What is the best way to cycle through various URLs to collect data?

After receiving valuable input from fellow commenters, I made some tweaks to the code. Here's a brief overview: - The goal is to extract product information from over 800 pages of HTML, convert that data into JSON format, and store it in a JSON file. While the code functions smoothly when processing around 20 pages at once, attempting to do all of them triggers the following error:

Error: Max redirects exceeded.

Here is the complete code snippet:

// Necessary module imports
const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

const url = "http://johndevisser.marktplaza.nl/?p=";

async function getProductsHtml(data) {
    const $ = await cheerio.load(data);
    let productsHTML = [];
    $("div.item").each((i, prod) => {
        productsHTML.push(($(prod).html()));
    });
    return productsHTML;
};

async function parseProducts(html) {
  let products = [];
  for (item in html) {
    // Retain existing data
    const $ = await cheerio.load(html[item]);
    let product = {};
    let mpUrl = $("a").attr("href");
    product["title"] = $("a").attr("title");
    product["mpUrl"] = mpUrl;
    product["imgUrl"] = $("img").attr("src");
    let priceText = $("span.subtext").text().split("\xa0")[1].replace(",", ".");
    product["price"] = parseFloat(priceText);
    products.push(product);
  }
  return products;
}

async function addDescriptionToProducts(prods) {
  for (i in prods) {
    const response = await axios.get(prods[i]["mpUrl"])
    const $ = cheerio.load(response.data);
    description = $("div.description p").text();
    prods[i]["descr"] = description;
  }
  return prods
}

async function getProductsFromPage(i) {
  try {
      const page = await axios.get(`http://johndevisser.marktplaza.nl/?p=${i}`);
      console.log("GET request succeeded!");
      // Extract HTML array for each product
      const productsHTML = await getProductsHtml(page.data);
      console.log("Obtained HTML array!");
      // Parse meta info into object array
      const productsParsed = await parseProducts(productsHTML);
      console.log("Products parsed!")
      // Add descriptions to products
      const productsMeta = await addDescriptionToProducts(productsParsed);
      console.log("Descriptions added!")
      // Return complete product information array
      return productsMeta;
    } catch(e) {
      console.log(e);
    }
};

async function saveAllProducts() {
  try {
    const allProducts = await getAllProducts();
    let jsonProducts = await JSON.stringify(allProducts);
        fs.writeFile("products.json", jsonProducts, "utf8", (e) => {
          if (e) {
            console.log(e)
          }
        });
  } catch(e) {
    console.log(e);
  }
}

async function getAllProducts() {
  try {
    let allProducts = [];
    for (let i = 1; i < 855; i++) {
      const productsFromPage = await getProductsFromPage(i);
      allProducts = [...allProducts, ...productsFromPage];
      console.log("Saved products from page " + i);
    }
    return allProducts
  } catch(e) {
    console.log(e);
  }
}

saveAllProducts();

Answer №1

Before attempting to acquire all 800 products, it is advisable to take a moment to reassess your current code structure. There are certain aspects that may complicate the process of running this script multiple times.

  1. getProducts function fetches the page html and stores the product html in a global variable as a side effect, unnecessarily adding complexity to the code.
  2. parseProducts takes in an array of product html but does not utilize it, opting instead to rely on the global variable.
  3. parseProducts parses each product html and saves the meta data in yet another global variable.
  4. fetchAndUpdateProducts handles both parsing a page and writing to json, mixing concerns within a single function.

These issues result in a convoluted flow within fetchAndUpdateProducts, making debugging more challenging.

My suggestion would be to create a new method structured like the following:

async getProductsFromPage(i) {
  try {
      const page = await axios.get(`http://johndevisser.marktplaza.nl/?p=${i}`);
      
      // Obtain Array containing HTML of each product
      const productsHTML = getProductsHtml(response.data);
      
      // Obtain Array of objects with meta information
      const productsParsed = parseProducts(productsHTML);
      
      // Add description to each product
      const productsMeta = await addDescriptionToProducts(productsParsed);

      // Return Array with all product information
      return productsMeta;
    } catch(e) {

    }
}

Following this restructuring, you can then execute something similar to the following:

const p1 = await getProductsFromPage(1);
const p2 = await getProductsFromPage(2);
const p3 = await getProductsFromPage(3);
// and so on

You can also consolidate all the data into a single array:

let allProducts = [];

for(let i = 0; i < 800; i++){
  const productsFromPage = await getProductsFromPage(i);
  allProducts = [...allProducts, ...productsFromPage];
}

// Write to JSON

Answer №2

Here is a simple method to gather and loop through all the data.

Assuming you have a range of pages from 1 - 800.

const pageCount = 800;
const promises = [];
for (let page = 0; page < pageCount; page++) {
  promises.push(axios.get(`https://someurl?page=${page}`));
}

Promises.all(promises).then(responses => {
  // The responses will contain the data for each page in order
  // Now you can process this data and save it as needed
  
  const data = responses.flat();
});

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Focus on the checkbox list items using the setfocus method

In my asp.net webform, I have a page that includes a checkboxlist with 7 different listitems. I am looking to set the focus on the first listitem. Previously, I have used the following setup on other pages, but it doesn't work for listitem since you ...

Troubleshooting issues with Ember-Data when utilizing the findAll() method

I am currently working on a project where I have a back-end server running on node.js and utilizing Ember version 2.0.2 along with Ember-data 2.0.0 and JQuery 2.1.4 on the front-end. The node.js server has body-parser and express installed to handle data r ...

Error: The call stack size has exceeded the maximum limit (MongoDB-mongoose)

Currently facing an issue while trying to save user details in MongoDB Atlas through Postman for API communication. The JSON request sent via Postman's POST method fails to reach the backend, resulting in the following error: Error: RangeError-Maxi ...

Drag and release: Place within invalid drop areas

I'm currently developing a drag-and-drop web application using Vue.JS and Vuex Store. The drag-and-drop functionality is based on the HTML Drag and Drop API as outlined in the Mozilla documentation. I have successfully implemented the Dropzone Compone ...

Leverage the power of PHP and cURL to retrieve JSON data from a file containing a mix of text, HTML

When utilizing PHP and cURL on THIS website, the returned file contains data that looks like this: <!DOCTYPE html> <html> <head></head> <body> <script> window['flyerData'] = { ...

Pair of elements connecting in Vuejs

What is the best way to efficiently organize data and communication between two Vue.js components? Here's an example scenario: 1) I have a component item(v-for="item in items) a {{item.name}} 2) And then the second component card(v-for="item in it ...

"Encountering a 500 internal server error with jQuery and WordPress

I'm having issues implementing ajax into my WordPress project to dynamically load videos based on their post ID. Whenever I click on the video link, a 500 internal server error occurs. I'm sending the post ID through ajax to a PHP script where I ...

Here is an example of how to transfer a value from PHP to a jQuery function in the code snippet provided

This is an example of my code. It is functioning properly even without passing a value. function displayMessage(text) { alert(text); } <button type="button" id="button" class="btn btn-success" onclick="displayMessage("Hello");"> Click Me </ ...

Find the child only when the value changes in Firebase using Vue

I'm currently developing my first application using Vue and Firebase. On one page, users can create a post and send it to the real-time database with name, email, picture, and status. In another page (the admin page), you have the option to publish ...

Creating an array of JSX elements or HTMLElements in a React TypeScript rendering

Currently in the process of developing a custom bootstrap card wrapper that allows for dynamic rendering of elements on the front and back of the card based on requirements. Here is the initial implementation: import React, { useState, ReactElement } from ...

Webpack automatically prepends "auto/" to script tags in the compiled HTML file

I am currently setting up an application for coding, but I am facing a problem with webpack. Every time I build the application, webpack automatically adds "auto/file.js" to the script tags instead of just "file.js". I have checked all my webpack configura ...

Generating a new argument and incorporating it into a click handler

Hello, I am a beginner in javascript and I need some assistance with creating an argument and adding it to a click event using jQuery. Here is a demo that I have created: http://jsfiddle.net/zidski/8VwAy/1/ If anyone could provide an example of what I ne ...

The <script> element failed to close correctly

Within my default.jspx file, which serves as the foundational layout for the page, I am attempting to import several jQuery libraries. The code snippet looks like this: <head> ... <spring:url value="/resources/js/lib/jquery-1.9.1.min.js" ...

What is the best way to apply various styles to a single CSS class?

I'm in the process of implementing a dark mode feature on my website, and I want to do it without using any boilerplate code. My plan is to create a .darkmode class in CSS, apply styles to it, and have JavaScript add the darkmode class to the <bod ...

Creating a list of objects in ReactJS

Currently, I am attempting to iterate through object properties (using "Name" as an example) and list them within a loop in a function. The method I have come up with seems quite clumsy and doesn't feel optimal. Here is the code snippet: const Ite ...

Using jQuery to Retrieve Accurate User Identification - Making AJAX Requests

Currently, I am facing a bit of a dilemma. I have implemented 2 jQuery scripts to handle updating a simple database row based on user ID and session. This functionality allows users to send a "Gift" that adds value to their database row column "bonus". Wh ...

Tips for showing outcome in the main window after form submission in a showModelDialog box through ajax

I find myself in a challenging situation where the parent window triggers a showModalDialog box to submit a form. Upon submission, the form is sent to a struts2 action which performs some tasks and then redirects back to the parent page, causing a full ref ...

Angular 4 - Issues with route configurations

My Angular application is running smoothly on localhost:4200 using ng serve. The node server can be found at localhost:3000. After running ng build, a bundle file is generated and properly served from localhost:3000 thanks to the line app.use(express.sta ...

Issue with cancel button click in Jquery confirm window not being resolved

I have been encountering issues with opening a confirmation dialog in JavaScript despite having the code in place. The dialog is supposed to have 'yes' and 'no' options. function ConfirmDialog(obj, title, dialogText) { if (!dialogC ...

Establishing an animated starting point for a template variable through Vue JS

Using a content management system, I am designing a front-end form that allows theatre directors to input multiple showtimes. There is a button they can click to add another input field if needed (side note: I am still figuring out how to implement a &apos ...