PDF.js extracts the full text content of the entire document, displaying it as the text content of each individual page

I'm working on a project where I'm using PDF.js in a client-side app to analyze the content of selected PDF files. However, I've encountered an unexpected issue.

Initially, everything appears to be functioning correctly. The code successfully loads the PDF.js PDF object, iterates through the pages of the document, and retrieves the textContent for each page.

Upon examining the data in the browser tools after running the code provided below, I've discovered that the textContent object for each page actually contains the entire document's text, rather than just the content from the specific page.

Has anyone else run into this issue before?

I obtained (and modified) most of the code from resources related to PDF.js on this platform, and it seems relatively straightforward and performs as expected—except for this particular problem:

testLoop: function (event) {
    var file = event.target.files[0];
    var fileReader = new FileReader();
    fileReader.readAsArrayBuffer(file);
    fileReader.onload = function () {
        var typedArray = new Uint8Array(this.result);
        PDFJS.getDocument(typedArray).then(function (pdf) {
            for(var i = 1; i <= pdf.numPages; i++) {
                pdf.getPage(i).then(function (page) {
                    page.getTextContent().then(function (textContent) {
                        console.log(textContent);
                    });
                });
            }
        });
    }
},

Furthermore, the size of the returned textContent objects varies slightly for each page, even though they all contain the same final piece of text - the last part of the whole document.

The image displayed in my inspector demonstrates that the objects are consistently sized.

By manually inspecting the objects shown in the inspector, I've realized that the data for Page #1, for instance, should ideally only have around ~140 array items. So, why does the corresponding object for that page include approximately ~700 items? And why the discrepancy in sizes?

https://i.sstatic.net/dQ2Ds.jpg

Answer â„–1

It seems like the issue at hand is related to the structure of the PDF file I'm attempting to analyze. The PDF comprises government records in a tabular layout, which appears to deviate from the modern PDF standards.

After testing the script with other PDF documents (that adhere to proper formatting), I found that the Page textContent objects are accurately segmented based on their respective Pages.

If others encounter similar issues in the future, there are a couple of potential solutions that come to mind:

  1. Attempt to reformat the faulty PDF to comply with current standards before processing it. However, the feasibility and process of achieving this are uncertain.

  2. Opt for the largest Page textContent object among those retrieved (as they all contain the document's complete text) and perform necessary operations on that specific object.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

I am encountering issues with my PostCSS plugin not functioning properly within a Vue-cli 3 project

I developed a custom postcss plugin that was working perfectly according to the postcss guidelines until I tried to implement it in a real project. For reference, here's the plugin on GitHub My goal is to integrate it into a Vue-cli app using Webpac ...

Executing prototype functions after a function has been defined

I'm looking to expand my knowledge on JavaScript prototypes. I came across some NodeJS modules where functions were being called in a chain like this: something.funcA().funcB().funcC(); and I want to implement something similar. How can I achieve this ...

Issue with alert dismissal button not visible

I am dynamically updating the alert message: <div id="alert" hidden="hidden"> <button type="button" class="close" data-dismiss="alert" aria-hidden="true">&times;</button> </div> $('#alert').addClass("alert alert-dan ...

Converting an array of numbers into an object using JSON

When I use jQuery to encode an array, this is the JSON I receive: {"1":{"name":"11233","po":"121212","po_item_number":"000001","po_item_material_code":"material","po_item_description":"assemble","sales_order":"11000000","sales_order_item":"10","tracable": ...

"Unable to move past the initial segment due to an ongoing

My portfolio webpage includes a "blob" and "blur" effect inspired by this YouTube video (https://www.youtube.com/watch?v=kySGqoU7X-s&t=46s). However, I am encountering an issue where the effect is only displayed in the first section of the page. Even a ...

locating the truth value of the data in an array retrieved from MongoDB

When returning from the mongoose find() function, I need to ensure that is_reqestor = true is checked. However, when updating the document, I pass the id which needs to be updated. let filter = { is_reqestor: true } if (!is ...

Leverage the power of dynamic type implementations within Angular framework

Recently, I developed a typescript module that contains type definitions and JavaScript implementations in the dist folder. This typescript module serves as an npm package dependency hosted on an internal HTTP link. Below is a basic diagram depicting the c ...

issue with customized select dropdown within a bootstrap modal window

Exploring how to implement a customized select box with unique CSS and search functionality within a Bootstrap modal. Here is the code snippet for the select element: <link rel="stylesheet" href="chosen.css"> <select data-placeholder="Choose ...

Attempting to develop a search feature for a multi-layered array using JavaScript

I've been attempting to search through a multidimensional array in JavaScript, but I'm facing some issues. Specifically, I am looking to input the first number from one of the 3 rows and retrieve the entire row. Essentially, my goal is to extract ...

What is the process for submitting a form in Laravel 5 with ajax?

Struggling with understanding how to create an ajax post in Laravel. I would like to display errors using jQuery after validation, but I'm unsure about accessing the object sent to my controller and what needs to be 'returned' in the control ...

Avoiding caching of GET requests in Angular 2 for Internet Explorer 11

My rest endpoint successfully returns a list when calling GET, and I can also use POST to add new items or DELETE to remove them. This functionality is working perfectly in Firefox and Chrome, with the additional note that POST and DELETE also work in IE ...

Breaking apart elements in an array of objects

Working on a react/javascript exercise, I'm struggling to grasp the concept of using splice(). My task involves randomly assigning 4 out of 8 cards to 2 players. Although everything seems to be functioning correctly, the part that puzzles me is the pr ...

Searching for the search parameter in the Wordpress admin-ajax.php file. What could it

Just diving into the world of php and wordpress. Are there any search parameters I can include in admin-ajax.php? Here are the parameters in the ajax post; action: rblivep data[uuid]: uid_search_0 data[name]: grid_small_1 data[posts_per_page]: 8 data[pagin ...

Can the store.dispatch method in Redux be considered synchronous or asynchronous?

This may seem like a simple question, but despite my efforts I couldn't locate the answer anywhere else. Can anyone clarify whether store.dispatch is synchronous or asynchronous in Redux ? If it is indeed asynchronous, is there a way to include a ca ...

Ways to ensure the confidentiality of header values in ajax requests for authentication

I've encountered a security issue. We are using authentication for our APIs and calling them from a client application via ajax. To bypass the authentication, we are passing our base 64 encoded value in the header, but it's visible in my script p ...

What steps should I take to make my include function operational?

As I develop a website for our entertainment company, I encounter language translation issues. Due to our diverse clientele, I implemented a multilingual feature on the site. However, during testing, all the text appeared blank. I suspect that the root of ...

Filtering JavaScript arrays based on a variety of property combinations

My array structure is as follows: [{ id: 1, name: 'Stephen', Team: { name: 'Lion', color: 'blue' }, skills: ['allrounder', 'middleorder'] }, { id: 2, name: 'Daniel', Team: ...

The HTML status code is 200, even though the JQuery ajax request shows a status code of 0

My issue is not related to cross site request problem, which is a common suggestion in search results for similar questions. When attempting to make an ajax request using jquery functions .get and .load, I'm receiving xhr.status 0 and xhr.statusText ...

Tips for unselecting a checked item when its tag is removed in React JS

My goal is to have a checkbox and tag component linked together seamlessly. Currently, I am able to display a tag below each checkbox when it's checked using the code provided. However, my next step is to ensure that removing a tag will automatically ...

Guide to adding attributes to an object in Vue.js

In my VueJS code, I have data structured like this: params: { comment: Object, state: "", deleteAppointment: false, appointmentId: null, } I am populating this data using two functions. The first function si ...