Once the puppeteer infinite scroll has completed, it fails to retrieve all the available results

Question

Once the puppeteer infinite scroll has completed, it fails to retrieve all the available results

Below is the code snippet from my data scraping file:

const puppeteer = require('puppeteer');
const db = require('../db');
const Job = require('../models/job');

(async() => {
  try {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: null,
      // args: ['--no-zygote', '--no-sandbox']
    });
    const url = 'https://www.linkedin.com/jobs/search?keywords=Junior%20Software%20Developer&location=Indianapolis%2C%20IN&geoId=&trk=homepage-jobseeker_jobs-search-bar_search-submit&position=1&pageNum=0';

    // Initiate a new page in the browser
    const page = await browser.newPage({
      waitUntil: 'networkidle0'
    });
    console.log(`Navigating to ${url}`);
    await page.goto(url);

    // Scroll to the bottom of the page, click on 'See More Jobs', and repeat   
    let lastHeight = await page.evaluate('document.body.scrollHeight');
    const scroll = async() => {
      while (true) {
        await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
        await page.waitForTimeout(2000);
        let newHeight = await page.evaluate('document.body.scrollHeight');
        if (newHeight === lastHeight) {
          console.log('Done scrolling!');
          break;
        }
        lastHeight = newHeight;
        seeMoreJobs();
      }
      console.log(data);
    }
    // Click on 'See More Jobs'
    const seeMoreJobs = async() => {
      await page.evaluate(() => {
        document.querySelector('button[data-tracking-control-name="infinite-scroller_show-more"]').click();
      });
    }
    // Fetch and collect data
    const data = await page.evaluate(() => {
      const allJobsArr = Array.from(document.querySelectorAll('a[data-tracking-control-name="public_jobs_jserp-result_search-card"]'));
      const namesAndUrls = allJobsArr.map(job => {
        return {
          name: job.innerText,
          url: job.href,
          path: job.pathname
        }
      });
      return namesAndUrls;
    });
    scroll();
  } catch (err) {
    console.log(err);
  }
})();

The above script aims to open the specified url, then continuously scroll until reaching the end of the page. After completing these actions, I intend to output an array containing three properties for each job listing: name, href, and path. While running the Immediate Invoked Function Expression (IIFE), I can scrape the initial 24-25 job postings displayed before any scrolling occurs.

However, the issue arises when attempting to evaluate the entire page or document after all the scrolling is completed using the data function.

I have made several attempts and thoroughly analyzed the script's behavior, yet I am unable to find a solution. My ultimate objective is to iterate through every job posting visible after scrolling and log all the retrieved data with the desired properties to the console, not limiting to the first 24-25 results.

Appreciate any assistance provided.

javascript express puppeteer infinite-scroll selectors-api

Answer 1

Answer №1

After some investigation, I finally discovered why the script was only extracting the first 25 results. It seems that the issue stemmed from a scope problem, similar to what I mentioned in my initial question. By encapsulating the 'data' functional expression within the scroll() function, I ensured that the same 'page' was consistently being processed. Otherwise, it seemed like there were two separate instances of the 'page' causing the discrepancy. If anyone can provide a more precise explanation, I would greatly appreciate it. Here is the straightforward solution to the problem I encountered. Thank you.

const puppeteer = require('puppeteer');
const db = require('../db');
const Job = require('../models/job');

(async() => {
  try {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: null,
      // args: ['--no-zygote', '--no-sandbox']
    });
    const url = 'https://www.linkedin.com/jobs/search?keywords=Junior%20Software%20Developer&location=Indianapolis%2C%20IN&geoId=&trk=homepage-jobseeker_jobs-search-bar_search-submit&position=1&pageNum=0';

    // Opening browser instance
    const page = await browser.newPage({
      waitUntil: 'networkidle0'
    });
    console.log(`Navigating to ${url}`);
    await page.goto(url);

    // Scrolling to the bottom of the page, clicking on 'See More Jobs,' and repeating   
    let lastHeight = await page.evaluate('document.body.scrollHeight');
    const scroll = async() => {
      while (true) {
        await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
        await page.waitForTimeout(2000);
        let newHeight = await page.evaluate('document.body.scrollHeight');
        if (newHeight === lastHeight) {
          break;
        }
        lastHeight = newHeight;
        seeMoreJobs();
      }
      // Scraping all junior job titles
      const data = await page.evaluate(() => {
        const allJobsArr = Array.from(document.querySelectorAll('a[data-tracking-control-name="public_jobs_jserp-result_search-card"]'));
        const namesAndUrls = allJobsArr.map(job => {
          return {
            name: job.innerText,
            url: job.href,
            path: job.pathname
          }
        });
        const juniorJobs = namesAndUrls.filter(function(job) {
          return job.name.includes('Junior') || job.name.includes('Jr') || job.name.includes('Entry') && job.url && job.path;
        });
        return juniorJobs;
      });
      console.log(data);
    }
    // Clicking on 'See More Jobs'
    const seeMoreJobs = async() => {
      await page.evaluate(() => {
        document.querySelector('button[data-tracking-control-name="infinite-scroller_show-more"]').click();
      });
    }
    scroll();
  } catch (err) {
    console.log(err);
  }
})();

Answer 2

After some investigation, I finally discovered why the script was only extracting the first 25 results. It seems that the issue stemmed from a scope problem, similar to what I mentioned in my initial question. By encapsulating the 'data' functional expression within the scroll() function, I ensured that the same 'page' was consistently being processed. Otherwise, it seemed like there were two separate instances of the 'page' causing the discrepancy. If anyone can provide a more precise explanation, I would greatly appreciate it. Here is the straightforward solution to the problem I encountered. Thank you.

const puppeteer = require('puppeteer');
const db = require('../db');
const Job = require('../models/job');

(async() => {
  try {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: null,
      // args: ['--no-zygote', '--no-sandbox']
    });
    const url = 'https://www.linkedin.com/jobs/search?keywords=Junior%20Software%20Developer&location=Indianapolis%2C%20IN&geoId=&trk=homepage-jobseeker_jobs-search-bar_search-submit&position=1&pageNum=0';

    // Opening browser instance
    const page = await browser.newPage({
      waitUntil: 'networkidle0'
    });
    console.log(`Navigating to ${url}`);
    await page.goto(url);

    // Scrolling to the bottom of the page, clicking on 'See More Jobs,' and repeating   
    let lastHeight = await page.evaluate('document.body.scrollHeight');
    const scroll = async() => {
      while (true) {
        await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
        await page.waitForTimeout(2000);
        let newHeight = await page.evaluate('document.body.scrollHeight');
        if (newHeight === lastHeight) {
          break;
        }
        lastHeight = newHeight;
        seeMoreJobs();
      }
      // Scraping all junior job titles
      const data = await page.evaluate(() => {
        const allJobsArr = Array.from(document.querySelectorAll('a[data-tracking-control-name="public_jobs_jserp-result_search-card"]'));
        const namesAndUrls = allJobsArr.map(job => {
          return {
            name: job.innerText,
            url: job.href,
            path: job.pathname
          }
        });
        const juniorJobs = namesAndUrls.filter(function(job) {
          return job.name.includes('Junior') || job.name.includes('Jr') || job.name.includes('Entry') && job.url && job.path;
        });
        return juniorJobs;
      });
      console.log(data);
    }
    // Clicking on 'See More Jobs'
    const seeMoreJobs = async() => {
      await page.evaluate(() => {
        document.querySelector('button[data-tracking-control-name="infinite-scroller_show-more"]').click();
      });
    }
    scroll();
  } catch (err) {
    console.log(err);
  }
})();

Once the puppeteer infinite scroll has completed, it fails to retrieve all the available results

Answer №1

Similar questions

Is the size of the JSON file inhibiting successful parsing?

Transfer information from my class to a specific pathway

What is the reason behind this HTML/CSS/jQuery code functioning exclusively in CodePen?

Convert JSON data into a Google chart with a dynamic number of columns and arrays

Struggling to make divs reach the bottom of the page? Check out my jsFiddle for a solution!

There seems to be a hiccup in the distribution build of Angular grunt, as it is unable to locate the

Having trouble with the jQuery function not working as expected? Can't seem to identify any errors in the code?

Receiving JSON data and saving it into an array in JavaScript

Assign a value to the input field based on changes made in another input field

Loop through a multi-dimensional array in JavaScript, apply a filter, and generate a fresh array

The findByIdAndUpdate() function lacks the ability to modify the collection

pre-iframe loading function

Adding a class to a div upon loading: A guide

What could be causing my Vue component to not refresh?

Troubleshooting Vue.js: Why is .bind(this) not behaving as anticipated?

The issue of Node.js getting stuck arises when handling multiple recursive setTimeout calls

Is there a way to merge arrays in jQuery by pushing one array into another?

Unable to reach other documents within Node.js

Tips for executing a .exe file in stealth mode using JavaScript?

Using JavaScript to disable and re-enable an ASP.NET Timer control