Having difficulty retrieving text data from a web URL using JavaScript

Question

Having difficulty retrieving text data from a web URL using JavaScript

I am trying to extract text data from a web URL ()

My approach involved using two node modules.

1) Using crawler-Request

it('Read Pdf Data using crawler',function(){
        const crawler = require('crawler-request');
        function response_text_size(response){
            response["size"] = response.text.length;
            return response;
        }
        crawler("http://www.africau.edu/images/default/sample.pdf",response_text_size).then(function(response){
            // handle response

            console.log("Response =" + response.size);
        });

    });

The issue here is that it does not print anything on the console as expected.

2) Using pfd2json/pdfparser

it('Read Data from url',function(){
        var request = require('request');
        var pdf = require('pfd2json/pdfparser');
        var fs = require('fs');
        var pdfUrl = "http://www.africau.edu/images/default/sample.pdf";
        let databuffer = fs.readFileSync(pdfUrl);
        pdf(databuffer).then(function(data){
            var arr:Array<String> = data.text;
            var n = arr.includes('Thursday 02 May');
            console.log("Print Array " + n);
        });

    });

Failed: ENOENT: no such file or directory, open ''

While I can access data from a local path successfully, extracting it from a URL seems to be causing issues.

javascript selenium protractor

Answer 1

Answer №1

The problem lies in your usage of the fs module (File System) to read a file from a remote server.

You also made a mistake with the pdf2json module, which likely resulted in an error?

Make sure you have imported the request module. This will enable you to fetch the file from the remote location. Here's one approach to achieve this:

it('Read Data from url', function () {
    var request = require('request');
    var PDFParser = require('pdf2json');

    var pdfUrl = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';

    var pdfParser = new PDFParser(this, 1);

    // Executed if there's an error during parsing
    pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError));
    // Executed when parsing is complete
    pdfParser.on("pdfParser_dataReady", pdfData => console.log(pdfParser.getRawTextContent()));

    // Send a request to get the content of the pdf file and then pass it to the pdf parser
    request({ url: pdfUrl, encoding: null }, (error, response, body) => pdfParser.parseBuffer(body));
});

By following these steps, you should be able to access the distant .pdf file within your application.

If you wish to explore further capabilities, I suggest referring to the pdf2json documentation. This will help you extract textual content from the .pdf file once the parsing process is completed.

Answer 2

The problem lies in your usage of the fs module (File System) to read a file from a remote server.

You also made a mistake with the pdf2json module, which likely resulted in an error?

Make sure you have imported the request module. This will enable you to fetch the file from the remote location. Here's one approach to achieve this:

it('Read Data from url', function () {
    var request = require('request');
    var PDFParser = require('pdf2json');

    var pdfUrl = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';

    var pdfParser = new PDFParser(this, 1);

    // Executed if there's an error during parsing
    pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError));
    // Executed when parsing is complete
    pdfParser.on("pdfParser_dataReady", pdfData => console.log(pdfParser.getRawTextContent()));

    // Send a request to get the content of the pdf file and then pass it to the pdf parser
    request({ url: pdfUrl, encoding: null }, (error, response, body) => pdfParser.parseBuffer(body));
});

By following these steps, you should be able to access the distant .pdf file within your application.

If you wish to explore further capabilities, I suggest referring to the pdf2json documentation. This will help you extract textual content from the .pdf file once the parsing process is completed.

Having difficulty retrieving text data from a web URL using JavaScript

Answer №1

Similar questions

Error: Express JS custom module cannot be located in the root directory's modules folder

Do you have any recommendations for a jQuery plugin that can create a sleek horizontal scrolling image gallery?

Leveraging selenium for automating client interactions by enabling camera functionality

What is preventing these AngularJS applications from functioning simultaneously?

Troubleshooting a Peculiar Problem with Form Submission in IE10

What significance does the slash hold in a package name when using require for an npm package?

A guide on extracting/filtering information from JSON files using JavaScript

Python is unable to locate the geckodriver within my system's directory

Error accessing element - The RemoteWebDriver encountered a 'System.InvalidOperationException' exception

Press the button to reveal the hidden Side Menu as it gracefully slides out

Conceal the div with ID "en" if the value matches $en

Stop users from saving the page in Next.js

Save room for text that shows up on its own

Tips on entering a text field that automatically fills in using Python Selenium

Having trouble retrieving the keyword property within a Vue.js promise

React Material UI DataGrid: Error encountered - Unable to access property 'useRef' due to being undefined

Using Nestjs to inject providers into new instances of objects created using the "new" keyword

Learn the process of transferring information through ajax while managing dependent drop-down menus

display and conceal elements according to the slider's current value

Selenium Edge and the macOS Sierra operating system