What is the best way to utilize Puppeteer to scrape a website for both titles and images, and then store them in a single object where the images are directly associated with their corresponding titles?

I have successfully extracted the image src and title into separate variables using the following code snippet:

  let theOfficeUrl =
    "https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures";

  let browser = await puppeteer.launch({
    headless: true,
    defaultViewport: null,
  });
  let page = await browser.newPage();

  await page.goto(theOfficeUrl), { waitUntil: "networkidle2" };

  let data = await page.evaluate(() => {
    var images = Array.from(
      document.querySelectorAll("div.post_anchor_divs.gallery img")
    ).map((img) => img.src);

    // Extracting all h3 titles on the page
    var titles = Array.from(document.querySelectorAll("h3")).map(
      (title) => title.innerText
    );
    let forDeletion = ["", "Leave a Comment:"];
    titles = titles.filter((item) => !forDeletion.includes(item));

    return {
      images,
      titles,
    };
  });
  console.log("Running Scraper...");
  console.log({ data });
  console.log("======================");
})();

The output looks like this:

data: {
   images: [Array of image srcs],
   titles: [Array of title text]
 }
}

However, I need them to be an array of objects with corresponding titles and image srcs as shown below:

{
data: [
   {
   item: {
      title: "title from website",
      image: "image src from website"
   }
item: {
      title: "title from website",
      image: "image src from website"
   }
item: {
      title: "title from website",
      image: "image src from website"
   }
....and so on
 ]
}


The challenge I am facing is that all the images and titles are within a single container div without individual identifiers. The titles are enclosed in h3 tags without class names, and the images are found in p tags or sometimes even within h3 tags. The specific website I am trying to scrape is:

https://www.cardboardconnection.com/funko-pop-yu-gi-oh-vinyl-figures

I aim to extract information from the Funko Pop Yu-Gi-Oh! Figures Gallery section where each funko pop name is accompanied by an image.

If you have any advice or tips on how to navigate this situation, I would greatly appreciate it!

Answer №1

To transform individual arrays within the data object into a desired array, follow these steps:

data = {
    image: ["image1 src", "image2 src", "image3 src", "image4 src"],
    title: ["title1", "title2", "title3", "title4"]
}

data_new = [];
for (i=0;i<data.image.length;i++) {
  data_new.push({'image':data.image[i], 'title': data.title[i]})
}

By implementing the above code snippet, you will obtain:

data_new = [
    {
        "image": "image1 src",
        "title": "title1"
    },
    {
        "image": "image2 src",
        "title": "title2"
    },
    {
        "image": "image3 src",
        "title": "title3"
    },
    {
        "image": "image4 src",
        "title": "title4"
    }
]

Answer №2

To optimize image loading, consider using the data-src attribute in your code like this:

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();

try {
  const [page] = await browser.pages();

  await page.goto('https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures');

  const data = await page.evaluate(() => {
    const titles = Array.from(
      document.querySelectorAll("div.post_anchor_divs.gallery h3")
    ).filter(title => title.innerText !== '');

    return titles.map(title => ({
      title: title.innerText,
      image: title.nextSibling.nextSibling.querySelector('img').dataset.src,
    }));
  });
  console.log(data);
} catch(err) { console.error(err); } finally { await browser.close(); }

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Discover the method for invoking a Javascript function within a Leaflet popup using HTML

Upon clicking on a marker on the leaflet map, I aim to trigger a popup box that contains five elements: Title Description Image Button (Next Image) Button (Previous Image) To achieve this, I attempted to include a custom popup for each feature ...

the pause in execution before my function redirects to a different route

Currently, I am developing a page using nodeJs with express which is supposed to display a table. However, I encountered an issue with my variable "allMusique" that contains the data for my page. When trying to access it initially, there seems to be an err ...

Unlocking the Power of $http and Stream Fusion

I'm interested in accessing the public stream of App.net. However, when I attempt to retrieve it using a simple $http.get(), I only receive one response. $http .get('https://alpha-api.app.net/stream/0/posts/stream/global') .success(func ...

Is there a way to replicate table cells in the style of Excel using jQuery?

Excel has a convenient feature that allows cells to be copied by dragging and dropping with the mouse. This same functionality is also available in Google Spreadsheets. I am trying to understand how Google has implemented this feature using JavaScript cod ...

ajax duplicator and reset form tool

Hello everyone, I have a website where users can add their experiences. While adding an experience, they can dynamically add and remove more fields. One of the input fields is for a date, but when the data is submitted, the same date appears for all entrie ...

Utilize Set.Attribute prior to entering the for loop

Could someone please clarify why I am unable to declare the var node before the for loop and then simply use appendChild(node) inside the loop? Why is it necessary to declare it for each iteration in order for it to affect all div elements? Otherwise, it w ...

Tips for using JavaScript to style an array items individually

I have currently placed the entire array within a single div, but I would like to be able to display each element of the array separately so that I can style "date", "title", and "text" individually. This is my JSON structure: [ { "date": "Example ...

How can I customize the color of the selected Leaflet GeoJSON FeatureLayer and its border?

Learn how to customize the fill and border color for leaflet GeoJSON layers with the following options. /** * Generates GeoJSON layers and assigns event handlers. */ private createGeoJsonLayer(geodata: any, map: L.Map): L.GeoJSON<any> { c ...

Having trouble locating the issue in my React application

As part of a tutorial project, I am developing an e-Commerce application using React. Currently, I am encountering an error message stating 'TypeError: Cannot read property 'length' of undefined' when dealing with a cart object. Let me ...

While utilizing the imodel.js front-end for designing a custom geometric model, I ran into an issue while trying to display it

Utilizing imodel.js front-end, I was able to design a customized geometric model featuring elements like a collection box. However, when placing the model within the existing SpatialViewState in bim, it failed to display properly in the current view. Sub ...

My Ajax script is not recognizing the select tag value?

I am struggling with an ajax script that is supposed to send data from a contact form to a PHP script. The main issue I'm facing is that I can't seem to retrieve the value from the "select" tag. My knowledge of JavaScript/ajax is limited, so plea ...

Update an array while monitoring for a specific event

Working with Ionic, my goal is to push an array of an object when a specific event is emitted. This is what I currently have: export class PublicationService { constructor(private storage: Storage) {} private addPublicationSubject = new Be ...

Tips for updating server-side variables from the client-side in Next.js

There is a code snippet in api/scraper.js file that I need help with. const request = require("request-promise"); const cheerio = require("cheerio"); let url = "https://crese.org/distintivo-azul/"; let result; request(url, ...

Make sure that the click event listener is set up on the anchor element so that it also affects its children

Currently, I have implemented a click event listener on my anchor elements. However, the anchors contain a span element within them, and the event listener does not function properly if you click on the span inside the anchor. document.addEventListene ...

The Javascript JSON object is reporting an error, but the JSON validator is indicating that the

Upon receiving a Json object from an API, I realized that it contains numerous backslashes and despite passing it through a json validator, it is marked as valid. When using this json object in JavaScript, everything seems to work fine until the final sect ...

When the properties change, React Router Redux does not get rendered

I am encountering a challenge with using react router redux, where everything seems to be working well except for rendering when props change. Index.js import React from 'react'; import ReactDOM from 'react-dom'; import {Provider} fro ...

How to access an array mapped to a specific key within an object in JavaScript

Is there a way to access an array mapped to a specific key in a JavaScript object? data = {}; data.key = 'example'; data.value = 'test'; data.list = [111, 222, 333]; Viewing the list of items works fine: alert(data.list); // displays ...

The React component continuously refreshes whenever the screen is resized or a different tab is opened

I've encountered a bizarre issue on my portfolio site where a diagonal circle is generated every few seconds. The problem arises when I minimize the window or switch tabs, and upon returning, multiple circles populate the screen simultaneously. This b ...

Commence the list from the lowest point

I am currently working with Ionic 2 and have a list of items: this.firelist = this.dataService.findMessages(this.chatItem).map(items => { this.updateReadMessages(items); return items.reverse(); }); These items are displayed in a list: <ion-con ...

Tips for identifying the most frequently occurring value in arrays within MongoDB/Mongoose documents

Imagine a scenario where there is a collection with documents structured like this: [ { "username": "user123", "email": "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="88fdfbedfac8b9b ...