I'm working on a project where I'm using PDF.js in a client-side app to analyze the content of selected PDF files. However, I've encountered an unexpected issue.
Initially, everything appears to be functioning correctly. The code successfully loads the PDF.js PDF object, iterates through the pages of the document, and retrieves the textContent for each page.
Upon examining the data in the browser tools after running the code provided below, I've discovered that the textContent object for each page actually contains the entire document's text, rather than just the content from the specific page.
Has anyone else run into this issue before?
I obtained (and modified) most of the code from resources related to PDF.js on this platform, and it seems relatively straightforward and performs as expected—except for this particular problem:
testLoop: function (event) {
var file = event.target.files[0];
var fileReader = new FileReader();
fileReader.readAsArrayBuffer(file);
fileReader.onload = function () {
var typedArray = new Uint8Array(this.result);
PDFJS.getDocument(typedArray).then(function (pdf) {
for(var i = 1; i <= pdf.numPages; i++) {
pdf.getPage(i).then(function (page) {
page.getTextContent().then(function (textContent) {
console.log(textContent);
});
});
}
});
}
},
Furthermore, the size of the returned textContent objects varies slightly for each page, even though they all contain the same final piece of text - the last part of the whole document.
The image displayed in my inspector demonstrates that the objects are consistently sized.
By manually inspecting the objects shown in the inspector, I've realized that the data for Page #1, for instance, should ideally only have around ~140 array items. So, why does the corresponding object for that page include approximately ~700 items? And why the discrepancy in sizes?