Is there a way to convert HTML websites into JSON format using Cheerio and Puppeter?

Just dipping my toes into the world of JS here. I have a burning question - How exactly can we scrape an HTML website and store it as a JSON file?

https://i.sstatic.net/BJtwO.png

The website in question is an examination test portal. It's loaded with multiple "view-question" tags, much like the one shown in the image provided. My mission? To extract all these questions into a neat little JSON file named "data.json".

My current tools of choice are Puppeter and Cheerio:

const puppeteer = require('puppeteer');
const url = 'http://tracnghiem.itrithuc.vn/tra-cuu-cau-hoi?grade=12&subject=1&level='
const cheerio = require('cheerio');
const jsonfile = require('jsonfile');
const request = require('request')


puppeteer
.launch()
.then(function(browser){
    return browser.newPage();
})
.then(function(page){
    return page.goto(url).then(function(){
        return page.content()
    })
})
.then(function(html){
    const $ = cheerio;
    $('.view-question', html).each(function(){
        jsonfile.writeFile('data.json',$(this).text())
    })
})
.catch(function(err){
    console.warn(err);

})

The resulting data looks something like this:

"\n\n                                            Câu 113915. Phần thực và phần ảo của số phức z=3+i lần lượt là:\n                                            \n                                            A. 3 và 1\n                                            B. 1 và 3\n\n                                                                                            C. 3 và 0\n                                                                                                                                        D. 3 và i\n                                            \n\n                                            \n\n                                                Câu trả lời đúng: Đáp án A\n                                                Hướng dẫn giải: \n                                                \n                                                                                                            Nếuz=a+bithì:\n+ Phần thực là a\n+ Phần thực là b\nSuy ra z=3+i có phần thực là 3, phần ảo là 1\n                                                    \n                                                \n                                                                                                                                            \n\n                                        "
         "
 \n\n                                        "
                            "
                                 "
     "
         \n\n                                        "

A puzzling fact - why all those pesky \n... lines popping up?

SOS! Please help !!!

UPDATE 1

Tried switching things up with $(this).html(), but no luck yet. Here's how it turned out:

"\n\n                                            <p><b>Câu 113915.</b> Phần thực và phên ảo của số phức $z=3+i$ lần lượt là:</p>\n                                            \n                                            <p><b>A.</b> 3 và 1</p>\n                                            <p><b>B.</b> 1 và 3</p>\n\n                                                                                            <p><b>C.</b> 3 và 0</p>\n                                                                                                                                        <p><b>D.</b> 3 và i</p>\n                                            \n\n                                            <div class=\"box-guide\" style=\"display: none;\" id=\"div-113915\">\n\n                                                <p>Câu trả lời đúng: <b>Đáp án A</b></p>\n                                                <p><i>Hướng dẫn giải: </i></p>\n                                                <div class=\"view-guide\" id=\"view-question-guide\">\n                                                                                                            <p>$Nếu z=a+bi  thì:$\n+ Phǧn thực là a\n+ Phǧn thực là b\nSuy ra $z=3+i$ có phần thực là 3, ph&#xx1EA7;n ảo là 1</p>\n                                                    \n                                                </div>\n                                                                                                                                            </div>\n\n                                        " <
/div>\n\n                                        " <
/div>\n\n                                        "
"\
n < /div>\n                                                                                                                                            </div > \n\ n " <
    /div>\n\n                                        " <
    /div>\n\n                                        "

Answer №1

Apologies for any misunderstanding of the requirements.

  1. When using puppeteer, there is no need for cheerio (unless jQuery functions are essential): puppeteer has the capability to execute JavaScript in the document context and transfer data without the necessity to reparse the HTML source with cheerio.

  2. It appears that $(this).text() and $(this).html() behave similarly to element.textContent and element.innerHTML — they preserve all the source markup including additional white spaces. To obtain readable text, consider using element.innerText.

One suggested code variant:

const puppeteer = require('puppeteer');
const { writeFileSync } = require('fs');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('http://tracnghiem.itrithuc.vn/tra-cuu-cau-hoi?grade=12&subject=1&level=');

    const data = await page.evaluate(() => {
      return Array.from(
        document.querySelectorAll('.view-question'),
        element => element.innerText
      );
    });

    writeFileSync('data.json', JSON.stringify(data, null, '  '));

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

This will produce the following JSON output:

[
  "Question 115340: Given a, b, c as real numbers and z=-\n1\n\n\n2\n+i\n√\n3\n\n\n2\n. The value of (a+bz+cz2)(a+bz2+cz) equals\n\nA. a+b+c.\n\nB. a2+b2+c2−ab−bc−ca.\n\nC. a2+b2+c2+ab+bc+ca.\n\nD. a2+b2+c2+ab+bc+ca.",
  "Question 115339: Calculate the sum S of the real parts of all complex numbers z satisfying the condition \nˉ\nz\n=\n√\n3\nz2.\n\nA. S=\n√\n3\n.\n\nB. S=\n√\n3\n\n\n6\n.\n\nC. S=\n2\n√\n3\n\n\n3\n.\n\nD. S=\n√\n3\n\n\n3\n.",
  
  // Additional questions removed for brevity
  
  "Question 113915: The real and imaginary parts of the complex number z=3+i respectively are:\n\nA. 3 and 1\n\nB. 1 and 3\n\nC. 3 and 0\n\nD. 3 and i"
]

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Issue with Box2Dweb's Rope Joint functionality has been identified

I am facing an issue with the Rope Joint code in Box2dweb. Despite running the code in my browser, I am only seeing a blank canvas with no activity. However, when I remove the lines that define the joints (the eight lines following //joints), the code ru ...

Convert the Date FR and Date US formats to ISO date format

There is a function in my code that accepts dates in different formats. It can handle two formats: 2022-06-04 or 04/06/2022 I want all dates to be in the format: 2022-06-04 For instance: public getMaxduration(data: object[]): number { data.forEach((l ...

How to include a new key into a nested dictionary using Python with Flask

Currently, I am in the process of creating a dictionary containing a list of IDs that will serve as the response from an API. This dictionary is crucial for generating a JSON response. I am attempting to add a list of affected IDs resulting from the reque ...

Adding additional `select` dynamically causes the value to disappear from view

After attempting to replicate the functionality of the select field, I encountered an issue where it no longer displays any values. Specifically, if I opt for a small size, I encounter this error message in the console. https://i.stack.imgur.com/5hKX6.png ...

Alter Text Using Typewriter

Recently, I have been experimenting with code on my glitch website, trying to create typewriter text. Thanks to help from a user on StackOverflow, I was able to achieve this effect successfully. However, I am now facing a new challenge. My goal is to make ...

How can I troubleshoot Ajax not loading my additional external JavaScript files?

$(document).ready(function () { $("#livesearch").on("keyup",function(){ var search_term = $(this).val(); $.ajax({ url:"ajax-live-search.php", type:"POST", d ...

The scrollTop feature fails to function properly following an Axios response

I'm currently facing a challenge with creating a real-time chat feature using Laravel, Vue.js, Pusher, and Echo. The issue arises while implementing the following 3 methods: created() { this.fetchMessages(); this.group = $('#group') ...

Position of Vertices in Three.js PolyhedronGeometry

Currently, I am experimenting with creating my own unique shape using PolyhedronGeometry In a rough sketch: https://i.sstatic.net/PREZN.jpg I've encountered some challenges. Specifically, I'm attempting to attach this newly created shape onto a ...

How can one guide a glTF model in the direction it is pointed using A-frame?

I have a 3D model of my customized robot loaded in an A-frame scene using version 1.0.4. Currently, I am able to rotate the robot along its x, y, z axes, but I am facing an issue where it continues to move in its original direction rather than the one it i ...

Looking to verify the existence of a div using jQuery once other jQuery functions have executed and created HTML?

Is there a way to verify if a specific element exists within newly added HTML after clicking a button? I attempted this code snippet: $(document).on('click', '#add-html-code', function() { if ($('#something').length ...

Establishing a standard flatiron-director route (within the element) using the polymer core-pages component

This particular query is closely linked with issues surrounding the usage of flatiron-director/core-pages SPA in conjunction with route-specific JavaScript functions and default routes. While the solution proposed may be effective, my limited expertise in ...

Organize array by year and month groupings

I'm trying to organize an array of events by year and month. Here is a sample of my data: const events = [ { name: "event 1", year: 2021, month: 1, }, { name: "event 2", year: 2021, month: 9, }, { ...

When attempting to deserialize JSON objects in parallel, the process fails even when the

Is it possible to deserialize JSON values in parallel using rayon? I encountered a situation where a valid JSON from the serde-json example fails when trying to deserialize inside par_iter, despite being parsed correctly without parallelization. Below is t ...

What is the average time frame for completing the construction of an electron project?

My project has only a few npm dependencies, but the build process is taking longer than 30 minutes and still counting. I'm not sure if this is normal or if there's an issue causing the delay. I have two specific questions: Is it common for pro ...

Steps to retrieve the central coordinates of the displayed region on Google Maps with the Google Maps JavaScript API v3

Is there a way to retrieve the coordinates for the center of the current area being viewed on Google Maps using JavaScript and the Google Maps JavaScript API v3? Any help would be greatly appreciated. Thank you! ...

Another option could be to either find a different solution or to pause the loop until the

Is it possible to delay the execution of a "for" loop until a specific condition is met? I have a popup (Alert) that appears within the loop, prompting the user for confirmation with options to Agree or Cancel. However, the loop does not pause for the co ...

Encountered a problem during the insertion of data into the database through ajax and php

An issue is being encountered while trying to insert data into a database using Ajax, PHP, and jQuery. The code works smoothly on a localhost environment, but upon uploading it to the server, an error occurs. $('#sunsubmit').click(function(){ ...

What is the best way to apply a mask to a textbox to format the date as MM/yyyy using a mask

In my asp.net application, I have a TextBox for entering Credit card date (month & year only). I tried using the 'TextBox with masked edit extender' and set Mask="99/9999" with Mask Type="Date. However, it is not working as expected - it only wor ...

PHP is having trouble parsing values from keys in JSON data retrieved from an external API

I'm currently working on making an external API call to retrieve JSON data and display it in my web application. Here is a snippet of the JSON data obtained from the API call: JSON returned from URL: { "results":[ { "name":"Company1", "Prov ...

Comparison between Microsoft's AJAX Toolkit and jQuery

Ever since the early days of Atlas, our team has relied on Microsoft's AJAX Toolkit for all our web development needs. It wasn't until recently that we stumbled upon the jQuery/Prototype phenomenon and realized what we had been missing out on. Th ...