Once the puppeteer infinite scroll has completed, it fails to retrieve all the available results

Below is the code snippet from my data scraping file:

const puppeteer = require('puppeteer');
const db = require('../db');
const Job = require('../models/job');

(async() => {
  try {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: null,
      // args: ['--no-zygote', '--no-sandbox']
    });
    const url = 'https://www.linkedin.com/jobs/search?keywords=Junior%20Software%20Developer&location=Indianapolis%2C%20IN&geoId=&trk=homepage-jobseeker_jobs-search-bar_search-submit&position=1&pageNum=0';

    // Initiate a new page in the browser
    const page = await browser.newPage({
      waitUntil: 'networkidle0'
    });
    console.log(`Navigating to ${url}`);
    await page.goto(url);

    // Scroll to the bottom of the page, click on 'See More Jobs', and repeat   
    let lastHeight = await page.evaluate('document.body.scrollHeight');
    const scroll = async() => {
      while (true) {
        await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
        await page.waitForTimeout(2000);
        let newHeight = await page.evaluate('document.body.scrollHeight');
        if (newHeight === lastHeight) {
          console.log('Done scrolling!');
          break;
        }
        lastHeight = newHeight;
        seeMoreJobs();
      }
      console.log(data);
    }
    // Click on 'See More Jobs'
    const seeMoreJobs = async() => {
      await page.evaluate(() => {
        document.querySelector('button[data-tracking-control-name="infinite-scroller_show-more"]').click();
      });
    }
    // Fetch and collect data
    const data = await page.evaluate(() => {
      const allJobsArr = Array.from(document.querySelectorAll('a[data-tracking-control-name="public_jobs_jserp-result_search-card"]'));
      const namesAndUrls = allJobsArr.map(job => {
        return {
          name: job.innerText,
          url: job.href,
          path: job.pathname
        }
      });
      return namesAndUrls;
    });
    scroll();
  } catch (err) {
    console.log(err);
  }
})();

The above script aims to open the specified url, then continuously scroll until reaching the end of the page. After completing these actions, I intend to output an array containing three properties for each job listing: name, href, and path. While running the Immediate Invoked Function Expression (IIFE), I can scrape the initial 24-25 job postings displayed before any scrolling occurs.

  • However, the issue arises when attempting to evaluate the entire page or document after all the scrolling is completed using the data function.

I have made several attempts and thoroughly analyzed the script's behavior, yet I am unable to find a solution. My ultimate objective is to iterate through every job posting visible after scrolling and log all the retrieved data with the desired properties to the console, not limiting to the first 24-25 results.

Appreciate any assistance provided.

Answer №1

After some investigation, I finally discovered why the script was only extracting the first 25 results. It seems that the issue stemmed from a scope problem, similar to what I mentioned in my initial question. By encapsulating the 'data' functional expression within the scroll() function, I ensured that the same 'page' was consistently being processed. Otherwise, it seemed like there were two separate instances of the 'page' causing the discrepancy. If anyone can provide a more precise explanation, I would greatly appreciate it. Here is the straightforward solution to the problem I encountered. Thank you.

const puppeteer = require('puppeteer');
const db = require('../db');
const Job = require('../models/job');

(async() => {
  try {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: null,
      // args: ['--no-zygote', '--no-sandbox']
    });
    const url = 'https://www.linkedin.com/jobs/search?keywords=Junior%20Software%20Developer&location=Indianapolis%2C%20IN&geoId=&trk=homepage-jobseeker_jobs-search-bar_search-submit&position=1&pageNum=0';

    // Opening browser instance
    const page = await browser.newPage({
      waitUntil: 'networkidle0'
    });
    console.log(`Navigating to ${url}`);
    await page.goto(url);

    // Scrolling to the bottom of the page, clicking on 'See More Jobs,' and repeating   
    let lastHeight = await page.evaluate('document.body.scrollHeight');
    const scroll = async() => {
      while (true) {
        await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
        await page.waitForTimeout(2000);
        let newHeight = await page.evaluate('document.body.scrollHeight');
        if (newHeight === lastHeight) {
          break;
        }
        lastHeight = newHeight;
        seeMoreJobs();
      }
      // Scraping all junior job titles
      const data = await page.evaluate(() => {
        const allJobsArr = Array.from(document.querySelectorAll('a[data-tracking-control-name="public_jobs_jserp-result_search-card"]'));
        const namesAndUrls = allJobsArr.map(job => {
          return {
            name: job.innerText,
            url: job.href,
            path: job.pathname
          }
        });
        const juniorJobs = namesAndUrls.filter(function(job) {
          return job.name.includes('Junior') || job.name.includes('Jr') || job.name.includes('Entry') && job.url && job.path;
        });
        return juniorJobs;
      });
      console.log(data);
    }
    // Clicking on 'See More Jobs'
    const seeMoreJobs = async() => {
      await page.evaluate(() => {
        document.querySelector('button[data-tracking-control-name="infinite-scroller_show-more"]').click();
      });
    }
    scroll();
  } catch (err) {
    console.log(err);
  }
})();

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Is the size of the JSON file inhibiting successful parsing?

After retrieving a large list of schools with their respective columns from the database, totaling over 1000 rows, I converted it to JSON and passed it to my view. I then attempted to parse it using $.parseJSON('@Html.Raw(Model.subChoiceJsonString)& ...

Transfer information from my class to a specific pathway

Currently, I am in the process of developing an application using Angular 2. My goal is to be able to send data from my class to a specific route within the application. Sample Code @RouteConfig([ { name: 'Slider', ...

What is the reason behind this HTML/CSS/jQuery code functioning exclusively in CodePen?

I have encountered an issue where this code functions properly in JSFiddle, but not when run locally in Chrome or Firefox. I suspect there may be an error in how the CSS or JavaScript files are being linked. In the Firefox console, I am receiving an error ...

Convert JSON data into a Google chart with a dynamic number of columns and arrays

Modify my variable chart which currently holds this JSON: [{ "month": "January", "values": [0, 0, 0, 0, 0, 0, 0, 0, 0] }, { "month": "February", "values": [0, 0, 0, 0, 0, 0, 0, 0, 0] }, { "month": "March", "values": [35, 3, 8, 18, ...

Struggling to make divs reach the bottom of the page? Check out my jsFiddle for a solution!

I'm facing difficulty in extending the left and right divs to the bottom of the page, with no additional space above or below. You can view my progress here: http://jsfiddle.net/qggFz/26/ Appreciate any help, Dale ...

There seems to be a hiccup in the distribution build of Angular grunt, as it is unable to locate the

While testing the build, everything runs smoothly. However, when attempting to build the distribution, an error is encountered: An error occurred: Cannot find module '/Users/matt.sich/Documents/angularProjects/firstProject/node_modules/grunt-usemin/l ...

Having trouble with the jQuery function not working as expected? Can't seem to identify any errors in the code?

I'm attempting to capture the essence of moving clouds from this beautiful theme: (I purchased it on themeforest, but it's originally designed for tumblr) Now, I want to incorporate it into my wordpress website here: The code used to be under ...

Receiving JSON data and saving it into an array in JavaScript

I have a JSON input that contains information about different categories including companies, countries, and persons. { "Categories": { "Facets": [{ "count": 1, "entity": "Company", "Company": [{ ...

Assign a value to the input field based on changes made in another input field

I am brand new to the world of JavaScript and currently grappling with setting a value in an input field based on the onchange event of another input field. Here is my code sample for input field 1: <input type='text' onchange='methodTh ...

Loop through a multi-dimensional array in JavaScript, apply a filter, and generate a fresh array

After coming across a post on Stack Overflow, I realized it wasn't exactly what I needed. My JSON file is quite large and has the following structure: { foo: [1, 2, 3, ...], bar: [ { name: 'abc', cl: ...

The findByIdAndUpdate() function lacks the ability to modify the collection

I'm encountering an issue when trying to update a product using mongodb and redux. It seems that the database is not reflecting the changes after I attempt to update the product. Can someone please assist me with this problem? Here is my product.js f ...

pre-iframe loading function

Is it possible to capture the state of an iframe while data is loading? The onload event only fires once all content has finished loading. I would appreciate any assistance with this issue. Thank you. ...

Adding a class to a div upon loading: A guide

Currently using the following script: $(document).ready(function() { $(".accept").change(function () { if ($(this).val() == "0") { $(".generateBtn").addClass("disable"); } else { $(".generateBtn").remove("dis ...

What could be causing my Vue component to not refresh?

Can anyone help me figure out why this component isn't re-rendering after changing the value? I'm attempting to create a dynamic filter similar to Amazon using only checkboxes. Here are the 4 components I have: App.vue, test-filter.vue, filtersIn ...

Troubleshooting Vue.js: Why is .bind(this) not behaving as anticipated?

Demo: https://codesandbox.io/s/23959y5wnp I have a function being passed down and I'm trying to rebind the this by using .bind(this) on the function. However, the data that is returned still refers to the original component. What could I be missing h ...

The issue of Node.js getting stuck arises when handling multiple recursive setTimeout calls

I am currently working on coding a class that has the ability to pause audio file playback. This class is designed to take raw PCM data, and you can specify how frequently sample chunks should be sent through by providing the class with this information. F ...

Is there a way to merge arrays in jQuery by pushing one array into another?

I am looking to construct an array as shown below. var coordinates = [ [41.02178, 29.26108], [41.02196, 29.26067], [41.02251, 29.26031], [41.02258, 29.26015], [41.02267, 29.25926] ]; My attempt in the code was as follows: var locations = []; f ...

Unable to reach other documents within Node.js

NOTE: Although similar questions may exist on Stack Overflow, this one is unique. Please read carefully. I'm diving into Socket.io on Node for the first time, and I'm facing an issue in my HTML file where I cannot access other files like images. ...

Tips for executing a .exe file in stealth mode using JavaScript?

I am currently working on the transition of my vb.net application to JavaScript and I am facing a challenge. I need to find a way to execute an .exe file in hidden mode using JS. Below is the snippet from my vb.net code: Dim p As Process = New Pro ...

Using JavaScript to disable and re-enable an ASP.NET Timer control

I currently have a webpage built with ASP.Net that includes an ASP:Timer control <asp:Timer ID="TimerRefresh" runat="server" Interval="5000" Enabled="true" OnTick="TimerRefresh_Tick"> </asp:Timer> It is connected to an asp:UpdatePanel on the ...