What is the best way to utilize Puppeteer to scrape a website for both titles and images, and then store them in a single object where the images are directly associated with their corresponding titles?

Question

What is the best way to utilize Puppeteer to scrape a website for both titles and images, and then store them in a single object where the images are directly associated with their corresponding titles?

I have successfully extracted the image src and title into separate variables using the following code snippet:

  let theOfficeUrl =
    "https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures";

  let browser = await puppeteer.launch({
    headless: true,
    defaultViewport: null,
  });
  let page = await browser.newPage();

  await page.goto(theOfficeUrl), { waitUntil: "networkidle2" };

  let data = await page.evaluate(() => {
    var images = Array.from(
      document.querySelectorAll("div.post_anchor_divs.gallery img")
    ).map((img) => img.src);

    // Extracting all h3 titles on the page
    var titles = Array.from(document.querySelectorAll("h3")).map(
      (title) => title.innerText
    );
    let forDeletion = ["", "Leave a Comment:"];
    titles = titles.filter((item) => !forDeletion.includes(item));

    return {
      images,
      titles,
    };
  });
  console.log("Running Scraper...");
  console.log({ data });
  console.log("======================");
})();

The output looks like this:

data: {
   images: [Array of image srcs],
   titles: [Array of title text]
 }
}

However, I need them to be an array of objects with corresponding titles and image srcs as shown below:

{
data: [
   {
   item: {
      title: "title from website",
      image: "image src from website"
   }
item: {
      title: "title from website",
      image: "image src from website"
   }
item: {
      title: "title from website",
      image: "image src from website"
   }
....and so on
 ]
}

The challenge I am facing is that all the images and titles are within a single container div without individual identifiers. The titles are enclosed in h3 tags without class names, and the images are found in p tags or sometimes even within h3 tags. The specific website I am trying to scrape is:

https://www.cardboardconnection.com/funko-pop-yu-gi-oh-vinyl-figures

I aim to extract information from the Funko Pop Yu-Gi-Oh! Figures Gallery section where each funko pop name is accompanied by an image.

If you have any advice or tips on how to navigate this situation, I would greatly appreciate it!

javascript web-scraping puppeteer

Answer 1

Answer №1

To transform individual arrays within the data object into a desired array, follow these steps:

data = {
    image: ["image1 src", "image2 src", "image3 src", "image4 src"],
    title: ["title1", "title2", "title3", "title4"]
}

data_new = [];
for (i=0;i<data.image.length;i++) {
  data_new.push({'image':data.image[i], 'title': data.title[i]})
}

By implementing the above code snippet, you will obtain:

data_new = [
    {
        "image": "image1 src",
        "title": "title1"
    },
    {
        "image": "image2 src",
        "title": "title2"
    },
    {
        "image": "image3 src",
        "title": "title3"
    },
    {
        "image": "image4 src",
        "title": "title4"
    }
]

Answer 2

To transform individual arrays within the data object into a desired array, follow these steps:

data = {
    image: ["image1 src", "image2 src", "image3 src", "image4 src"],
    title: ["title1", "title2", "title3", "title4"]
}

data_new = [];
for (i=0;i<data.image.length;i++) {
  data_new.push({'image':data.image[i], 'title': data.title[i]})
}

By implementing the above code snippet, you will obtain:

data_new = [
    {
        "image": "image1 src",
        "title": "title1"
    },
    {
        "image": "image2 src",
        "title": "title2"
    },
    {
        "image": "image3 src",
        "title": "title3"
    },
    {
        "image": "image4 src",
        "title": "title4"
    }
]

Answer 3

Answer №2

To optimize image loading, consider using the data-src attribute in your code like this:

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();

try {
  const [page] = await browser.pages();

  await page.goto('https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures');

  const data = await page.evaluate(() => {
    const titles = Array.from(
      document.querySelectorAll("div.post_anchor_divs.gallery h3")
    ).filter(title => title.innerText !== '');

    return titles.map(title => ({
      title: title.innerText,
      image: title.nextSibling.nextSibling.querySelector('img').dataset.src,
    }));
  });
  console.log(data);
} catch(err) { console.error(err); } finally { await browser.close(); }

Answer 4

To optimize image loading, consider using the data-src attribute in your code like this:

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();

try {
  const [page] = await browser.pages();

  await page.goto('https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures');

  const data = await page.evaluate(() => {
    const titles = Array.from(
      document.querySelectorAll("div.post_anchor_divs.gallery h3")
    ).filter(title => title.innerText !== '');

    return titles.map(title => ({
      title: title.innerText,
      image: title.nextSibling.nextSibling.querySelector('img').dataset.src,
    }));
  });
  console.log(data);
} catch(err) { console.error(err); } finally { await browser.close(); }

What is the best way to utilize Puppeteer to scrape a website for both titles and images, and then store them in a single object where the images are directly associated with their corresponding titles?

Answer №1

Answer №2

Similar questions

Discover the method for invoking a Javascript function within a Leaflet popup using HTML

the pause in execution before my function redirects to a different route

Unlocking the Power of $http and Stream Fusion

Is there a way to replicate table cells in the style of Excel using jQuery?

ajax duplicator and reset form tool

Utilize Set.Attribute prior to entering the for loop

Tips for using JavaScript to style an array items individually

How can I customize the color of the selected Leaflet GeoJSON FeatureLayer and its border?

Having trouble locating the issue in my React application

While utilizing the imodel.js front-end for designing a custom geometric model, I ran into an issue while trying to display it

My Ajax script is not recognizing the select tag value?

Update an array while monitoring for a specific event

Tips for updating server-side variables from the client-side in Next.js

Make sure that the click event listener is set up on the anchor element so that it also affects its children

The Javascript JSON object is reporting an error, but the JSON validator is indicating that the

When the properties change, React Router Redux does not get rendered

How to access an array mapped to a specific key within an object in JavaScript

The React component continuously refreshes whenever the screen is resized or a different tab is opened

Commence the list from the lowest point

Tips for identifying the most frequently occurring value in arrays within MongoDB/Mongoose documents