PDF.js extracts the full text content of the entire document, displaying it as the text content of each individual page

Question

PDF.js extracts the full text content of the entire document, displaying it as the text content of each individual page

I'm working on a project where I'm using PDF.js in a client-side app to analyze the content of selected PDF files. However, I've encountered an unexpected issue.

Initially, everything appears to be functioning correctly. The code successfully loads the PDF.js PDF object, iterates through the pages of the document, and retrieves the textContent for each page.

Upon examining the data in the browser tools after running the code provided below, I've discovered that the textContent object for each page actually contains the entire document's text, rather than just the content from the specific page.

Has anyone else run into this issue before?

I obtained (and modified) most of the code from resources related to PDF.js on this platform, and it seems relatively straightforward and performs as expected—except for this particular problem:

testLoop: function (event) {
    var file = event.target.files[0];
    var fileReader = new FileReader();
    fileReader.readAsArrayBuffer(file);
    fileReader.onload = function () {
        var typedArray = new Uint8Array(this.result);
        PDFJS.getDocument(typedArray).then(function (pdf) {
            for(var i = 1; i <= pdf.numPages; i++) {
                pdf.getPage(i).then(function (page) {
                    page.getTextContent().then(function (textContent) {
                        console.log(textContent);
                    });
                });
            }
        });
    }
},

Furthermore, the size of the returned textContent objects varies slightly for each page, even though they all contain the same final piece of text - the last part of the whole document.

The image displayed in my inspector demonstrates that the objects are consistently sized.

By manually inspecting the objects shown in the inspector, I've realized that the data for Page #1, for instance, should ideally only have around ~140 array items. So, why does the corresponding object for that page include approximately ~700 items? And why the discrepancy in sizes?

https://i.sstatic.net/dQ2Ds.jpg

javascript pdf.js

Answer 1

Answer №1

It seems like the issue at hand is related to the structure of the PDF file I'm attempting to analyze. The PDF comprises government records in a tabular layout, which appears to deviate from the modern PDF standards.

After testing the script with other PDF documents (that adhere to proper formatting), I found that the Page textContent objects are accurately segmented based on their respective Pages.

If others encounter similar issues in the future, there are a couple of potential solutions that come to mind:

Attempt to reformat the faulty PDF to comply with current standards before processing it. However, the feasibility and process of achieving this are uncertain.
Opt for the largest Page textContent object among those retrieved (as they all contain the document's complete text) and perform necessary operations on that specific object.

Answer 2

It seems like the issue at hand is related to the structure of the PDF file I'm attempting to analyze. The PDF comprises government records in a tabular layout, which appears to deviate from the modern PDF standards.

After testing the script with other PDF documents (that adhere to proper formatting), I found that the Page textContent objects are accurately segmented based on their respective Pages.

If others encounter similar issues in the future, there are a couple of potential solutions that come to mind:

Attempt to reformat the faulty PDF to comply with current standards before processing it. However, the feasibility and process of achieving this are uncertain.
Opt for the largest Page textContent object among those retrieved (as they all contain the document's complete text) and perform necessary operations on that specific object.

PDF.js extracts the full text content of the entire document, displaying it as the text content of each individual page

Answer №1

Similar questions

I am encountering issues with my PostCSS plugin not functioning properly within a Vue-cli 3 project

Executing prototype functions after a function has been defined

Issue with alert dismissal button not visible

Converting an array of numbers into an object using JSON

"Unable to move past the initial segment due to an ongoing

locating the truth value of the data in an array retrieved from MongoDB

Leverage the power of dynamic type implementations within Angular framework

issue with customized select dropdown within a bootstrap modal window

Attempting to develop a search feature for a multi-layered array using JavaScript

What is the process for submitting a form in Laravel 5 with ajax?

Avoiding caching of GET requests in Angular 2 for Internet Explorer 11

Breaking apart elements in an array of objects

Searching for the search parameter in the Wordpress admin-ajax.php file. What could it

Can the store.dispatch method in Redux be considered synchronous or asynchronous?

Ways to ensure the confidentiality of header values in ajax requests for authentication

What steps should I take to make my include function operational?

Filtering JavaScript arrays based on a variety of property combinations

The HTML status code is 200, even though the JQuery ajax request shows a status code of 0

Tips for unselecting a checked item when its tag is removed in React JS

Guide to adding attributes to an object in Vue.js