Scraping the Web for Repetitive Array Information

Struggling to eliminate duplicate articles from my web scraper results using the following code:

app.get("/scrape", function (req, res) {

  request("https://www.nytimes.com/", function (error, response, html) {

    // Parsing the HTML using cheerio
    var $ = cheerio.load(html);
    var uniqueResults = [];
    
    $("div.collection").each(function (i, element) {
      var results = [];
      
      // Scrape relevant data
      results.link = $(element).find("a").attr("href");
      results.title = $(element).find("a").text();
      results.summary = $(element).find("p.summary").text().trim();

      db.Article.create(results)
        .then(function (dbArticle) {
          res.json(dbArticle);
        }).catch(function (err) {
          return res.json(err);
        });

    });
    res.send("Data successfully scraped.");
  });
});

// Route for fetching Articles from the database
app.get("/articles", function (req, res) {
  db.Article.find()
    .then(function (dbArticle) {

      res.json(dbArticle);
    })
    .catch(function (err) {
      res.json(err);
    });
});

Currently experiencing issues with receiving multiple copies of each article. Attempted solutions like db.Article.distinct and similar methods without success. Any suggestions?

Answer №1

Summary: I successfully resolved the issue by changing var results = [] from an Array to Object var results = {}. I am still investigating the root cause of the duplicate document insertion in the database and will provide updates once I have more information.

Detailed Explanation:

Your code contains multiple mistakes and areas that can be improved. Let me highlight them for you:

Let's address these issues to ensure your code runs without errors.

Mistakes

1. While mongoose's model.create and new mongoose() may work with Arrays, it is not a common practice and doesn't seem suitable for creating documents sequentially. To create documents one by one, consider representing them using an object instead of an Array. Arrays are typically used for mass document creation.

So, change -

var results = [];

to

var results = {};

2. Sending response headers after they have been sent will result in an error. If not handled properly, this could lead to a PromiseRejection Error, preventing further document storage. The asynchronous nature of the block inside

$("div.collection").each(function (i, element)
means that the process control won't wait for all documents to be processed before executing
res.send("You scraped the data successfully.");
.

To prevent premature response termination, comment out the res.json statements within the .create's then and catch methods. This allows the code to continue saving articles in the background while terminating the response.

If you wish to only terminate the response after successfully saving the data, modify your middleware implementation as follows:

...

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Generating JSON data on the fly using D3.js scripting

I am attempting to create a JSON object dynamically by pulling data from an array in D3 JavaScript. (The code below is not the exact one I used, but similar) let radius = [10,20,30]; let jsonradius = '[{'; for (let i = 0; i < radius.le ...

Error: Unable to locate module: Issue discovering 'crypto' and 'fs' modules

I am currently in the process of learning React and attempting to establish a connection between my React app and my database using the following code: var mysql = require('mysql'); var con = mysql.createConnection({ host: "localhost", user: ...

Localizing strings that are not saved in a database

Our web app will soon support multiple languages, a new feature we are excited to roll out! Currently, we utilize Handlebars for front-end templating and Node + Jade for back-end templating. As we prepare to implement language support, we're conside ...

Consolidating and totaling fields within a single document using the $group and $sum

Below is an example document. The 'year' field contains keys for different years, which in turn contain metrics with days as nested keys: { "_id" : NumberInt(1), "year" : { "2017" : { "g1" : { "1" : { "t ...

The object MediaStreamRecorder is not able to be used as a constructor

Just starting my Angular6 journey and I'm experimenting with the MediaStreamRecorder feature. Somehow, I must be messing up the definition of MediaStreamRecorder because all I get is this frustrating error message: TypeError: msr__WEBPACK_IMPORTED_MOD ...

What causes json.parse to malfunction? and how can you resolve the issue

My server sends data to my JavaScript code in the format below. {"triggers": [{"message_type": "sms","recipients": "[\"+91xxxxxxxxx\",\"+91xxxxxxxxx\"]", "message": "This is a test"}]} To parse this JSON string, my code executes the f ...

Finding the position of the latest item added to an array

If I add an element into an array in this way: $arr[] = 'something'; How can I determine the index of 'something'? Is there a different approach to adding an element to the array and finding its index? ...

Searching for array keys in JavaScript/jQuery can be done with the indexOf method

I am facing an issue with searching through an array named members. Each element in this array consists of a name as the index (e.g. "John Smith") and an array with "degree" and "id". Here is an example structure: https://i.sstatic.net/bDI2Y.png My searc ...

When sorting in AngularJS using the orderBy filter, remember that string values should come before numeric values: for example, sort as follows (n/a, 0,

While running an AngularJS filter sorting on a table, I encountered an issue where if the value is 'n/a' (which is a numeric string), the sorting is incorrect. I expected the order to be n/a, 0, 1, 2, 5, 100 since strings should be considered l ...

What could be the reason behind MongoDB not displaying all the fields?

In my database collection, I have two fields: location and name. I decided to create an index for the name field using mongoose as shown below: eventSchema.index({name: 'text'}); When I execute this query in RoboMongo, it retrieves all 12 fiel ...

Performing updates on various database collections within another database collection using the foreach method

connect to AppID db.Collection1.find().forEach( function(row){ switch to NewStore db.Collection2.insert(row.value); }); Attempting to update the database connection within a forEach loop. ...

I do not prefer output as my optimal choice

My preference is to create drill down buttons rather than focusing on output. Currently, the output appears as: https://i.sstatic.net/8qs5F.png The content of index.html is as follows: <html>  <head> <script type="text/javascript" ...

Shutting down a filtered v-treeview node becomes sluggish when operating with a large number of items

My v-treeview has a node with approximately 2000 children, and I am in need of applying a filter to it. However, the current issue is that opening the node takes around 3 seconds, while closing it takes about 15 seconds, which is completely unacceptable. ...

Guide to integrating external css and scss files along with Bootstrap 4 into a Vue.js 3 application

I am seeking advice on transitioning my current web application to a Vue 3 application. Currently, I am using scss in my files with a main.css and main.scss folder structured as follows: main.css folder: bootstrap.min.css font.css responsive.css style.css ...

Utilizing the class instance within PHP

Can anyone help me figure out how to access the HREF from this specific object? SimpleXMLElement Object ( [@attributes] => Array ( [title] => Preview [rel] => enclosure [type] => image/jpeg ...

Execute the gulp module on the source files

Recently, I've been delving into the world of gulp and trying to enhance the readability of my js source files. I have a task in place (which executes successfully) that utilizes 'gulp-beautify' to beautify the js files: gulp.task('js& ...

What could be the reason jqGrid is not making multiple Ajax calls when the OnSelectRow Event is triggered repeatedly?

Encountering an issue with my two jqGrid grids. The first grid operates smoothly, with data and a loadComplete event that automatically selects the first row. This selection triggers the population of the second jqGrid based on the selected row (id) from t ...

Creating a string from values in a multidimensional array by utilizing parent-child relationships and generating dynamic SQL queries

This data represents a dynamic array with sample information that needs to be utilized to create an SQL query. I am working with VueJs + Laravel. Below is the updated array data along with the methods: [ { "operator": "AND", "rules": [ { ...

The setTimeout function executes immediately without any delay

In my code, an html button triggers a jQuery function (let's call it funcOne) which then calls another function (funcTwo) recursively to modify CSS in the DOM. The funcTwo function involves setTimeout() calls to create a blinking effect by delaying th ...

Encountering a TypeError: relativeURL.replace is not a valid function in Next.js

Why am I encountering this error in my Next.js application? TypeError: relativeURL.replace is not a function. (In 'relativeURL.replace(/^/+/, '')', 'relativeURL.replace' is undefined) Request.js file const API_KEY = process ...