I have successfully extracted the image src and title into separate variables using the following code snippet:
let theOfficeUrl =
"https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures";
let browser = await puppeteer.launch({
headless: true,
defaultViewport: null,
});
let page = await browser.newPage();
await page.goto(theOfficeUrl), { waitUntil: "networkidle2" };
let data = await page.evaluate(() => {
var images = Array.from(
document.querySelectorAll("div.post_anchor_divs.gallery img")
).map((img) => img.src);
// Extracting all h3 titles on the page
var titles = Array.from(document.querySelectorAll("h3")).map(
(title) => title.innerText
);
let forDeletion = ["", "Leave a Comment:"];
titles = titles.filter((item) => !forDeletion.includes(item));
return {
images,
titles,
};
});
console.log("Running Scraper...");
console.log({ data });
console.log("======================");
})();
The output looks like this:
data: {
images: [Array of image srcs],
titles: [Array of title text]
}
}
However, I need them to be an array of objects with corresponding titles and image srcs as shown below:
{
data: [
{
item: {
title: "title from website",
image: "image src from website"
}
item: {
title: "title from website",
image: "image src from website"
}
item: {
title: "title from website",
image: "image src from website"
}
....and so on
]
}
The challenge I am facing is that all the images and titles are within a single container div without individual identifiers. The titles are enclosed in h3 tags without class names, and the images are found in p tags or sometimes even within h3 tags. The specific website I am trying to scrape is:
https://www.cardboardconnection.com/funko-pop-yu-gi-oh-vinyl-figures
I aim to extract information from the Funko Pop Yu-Gi-Oh! Figures Gallery section where each funko pop name is accompanied by an image.
If you have any advice or tips on how to navigate this situation, I would greatly appreciate it!