Puppeteer-created PDF text may result in strange characters when copied and pasted

Question

Puppeteer-created PDF text may result in strange characters when copied and pasted

After using the most recent version of puppeteer to create the PDF attached, I noticed that when attempting to copy and paste text from Adobe Acrobat, it appears as:

This is a test string.

transforming into

Țħįș įș ǻ țěșț șțřįňģ.

Below is the code snippet used for generating the PDF.

const puppeteer = require('puppeteer');
const argv = require('minimist')(process.argv.slice(2));
const fileName = argv.fileName || "page";
const timeout = 90;

(async () => {
  var pageUrl = "my-url-here"
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  
  function onTimeout() {
    console.log("Timed out waiting for data after " + timeout + " seconds.");
    process.exit();
  }
  
  console.log("Opening " + pageUrl);
  await page.goto(pageUrl, {waitUntil: 'networkidle2'});
  console.log("Waiting for page to load...");
  
  console.log("Waiting for data to load...");
  await page.waitForSelector('#print-report-loaded', {timeout:timeout*1000}).catch(onTimeout);
  
  var fileFullName = fileName + ".pdf";
  console.log("Saving PDF as " + fileFullName);
  await page.pdf({path: fileFullName});
  console.log("PDF saved successfully as " + fileFullName);

  await browser.close();
})();

Click here to view the generated PDF

If you have any suggestions on how to resolve this issue, please feel free to share. Your help is greatly appreciated!

javascript pdf puppeteer

Answer 1

Answer №1

Acrobat doesn't actually alter the text; it merely duplicates the Unicode characters stored for these fonts. The 'characters' displayed are Type 3 outlines resembling "normal" characters but their corresponding Unicode code points are, in fact, those of heavily accented characters.

According to both Acrobat Reader and the official PDF specifications, everything is functioning as intended.

Lets delve into your PDF file.

To add unnecessary complexity, one might assume only one font is needed, yet your tool generated two fonts: F0, which correlates character codes with specific Unicode codes,

<(01)> <( )>
<(0D)> <(.)>
<(26)> <(Ț)>
<(32)> <(ǻ)>
<(35)> <(ě)>
<(37)> <(ģ)>
<(38)> <(ħ)>
<(39)> <(į)>
<(3E)> <(ň)>
<(42)> <(ț)>

and F1 mapping to

<(15)> <(ř)>
<(16)> <(ș)>

The character codes documented as a string, one character at a time (with some commands interspersed; excluded here for brevity):

<26><38><39>{16}<01><39>{16}<01><32><01><42><35>{16}<42><01>{16}<42>{15}<39><3E><37><0D>

Hex codes enclosed within <..> correspond to font F0 and {..} belong to F1. When replaced with Unicode characters one by one, you arrive at the Unicode string:

Țħįș įș ǻ țěșț șțřįňģ.

The "fonts" employed here are Type 3 PostScript fonts, entirely embedded inside the PDF. For instance, Font #0 is described as

8 0 obj @ 1059      % "F0"
<<
  /Type     /Font
  /Subtype  /Type3
  /CIDToGIDMap  /Identity
  /CharProcs    
  <<
    /g0     11 0 R      % -> stream
    /g1     12 0 R      % -> stream
    /g26    14 0 R      % -> stream
    /g32    15 0 R      % -> stream
    /g35    16 0 R      % -> stream
    /g37    17 0 R      % -> stream
    /g38    18 0 R      % -> stream
    /g39    19 0 R      % -> stream
    /g3E    20 0 R      % -> stream
    /g42    21 0 R      % -> stream
    /gD     13 0 R      % -> stream
  >>
  /Encoding     
  <<
    /Type   /Encoding
    /Differences [ 0 /g0 /g1 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /gD /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0
        /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g26 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0
        /g0 /g32 /g0 /g0 /g35 /g0 /g37 /g38 /g39 /g0 /g0 /g0 /g0 /g3E /g0 /g0 /g0 /g42 ]
  >>
  /FirstChar    0
  /FontBBox     [ -1 202 598 -801 ]
  /FontDescriptor 10 0 R        
  /FontMatrix   [ 0.082254 0 0 -0.082254 0 0 ]
  /LastChar     66
  /ToUnicode    9 0 R       
  /Widths   [ 500 300 0 0 0 244 0 641 579 592 664 616 263 616 404 ]
>>
endobj

Most of this information is extraneous, except for the encoding array, associating character indexes with glyph names, and the CharProcs array, connecting the names in the encoding array with actual drawing instructions. This chain links "font name plus character index" when displaying a string to "character index in encoding", which then utilizes the ToUnicode array to find reported Unicode values for each character.

The drawing instructions for each character (the references to each /gX stream) consist of routine move, line, and fill directives – typical processes, although other PDF engines often incorporate the original font instead of solely the literal drawing instructions.

However, the ToUnicode table disrupts copy operations. Instead of stating "character 16#26 maps to Unicode U+0054 'Latin Capital T'", it selects "U+021A Latin Capital T with Comma Below" – without apparent cause! It's undoubtedly not a random translation, leaving one puzzled as to why plain text is intentionally encoded in such a manner... unless someone out there is secretly pleased and thinking, "yes, this is what I had envisioned," implying intentional obfuscation.

The Puppeteer code on Github seems unable to handle PDFs independently, suggesting it relies on Chromium, internally powered by the Skia PDF engine (as indicated by the PDF binary header reading "D3 EB E9 E1" – "Skia" with the highest bit zeroed out). An issue was reported as a bug back in 2012; however, reports from 2017 suggest it may not be deemed urgent to rectify on their end.

Answer 2

Acrobat doesn't actually alter the text; it merely duplicates the Unicode characters stored for these fonts. The 'characters' displayed are Type 3 outlines resembling "normal" characters but their corresponding Unicode code points are, in fact, those of heavily accented characters.

According to both Acrobat Reader and the official PDF specifications, everything is functioning as intended.

Lets delve into your PDF file.

To add unnecessary complexity, one might assume only one font is needed, yet your tool generated two fonts: F0, which correlates character codes with specific Unicode codes,

<(01)> <( )>
<(0D)> <(.)>
<(26)> <(Ț)>
<(32)> <(ǻ)>
<(35)> <(ě)>
<(37)> <(ģ)>
<(38)> <(ħ)>
<(39)> <(į)>
<(3E)> <(ň)>
<(42)> <(ț)>

and F1 mapping to

<(15)> <(ř)>
<(16)> <(ș)>

The character codes documented as a string, one character at a time (with some commands interspersed; excluded here for brevity):

<26><38><39>{16}<01><39>{16}<01><32><01><42><35>{16}<42><01>{16}<42>{15}<39><3E><37><0D>

Hex codes enclosed within <..> correspond to font F0 and {..} belong to F1. When replaced with Unicode characters one by one, you arrive at the Unicode string:

Țħįș įș ǻ țěșț șțřįňģ.

The "fonts" employed here are Type 3 PostScript fonts, entirely embedded inside the PDF. For instance, Font #0 is described as

8 0 obj @ 1059      % "F0"
<<
  /Type     /Font
  /Subtype  /Type3
  /CIDToGIDMap  /Identity
  /CharProcs    
  <<
    /g0     11 0 R      % -> stream
    /g1     12 0 R      % -> stream
    /g26    14 0 R      % -> stream
    /g32    15 0 R      % -> stream
    /g35    16 0 R      % -> stream
    /g37    17 0 R      % -> stream
    /g38    18 0 R      % -> stream
    /g39    19 0 R      % -> stream
    /g3E    20 0 R      % -> stream
    /g42    21 0 R      % -> stream
    /gD     13 0 R      % -> stream
  >>
  /Encoding     
  <<
    /Type   /Encoding
    /Differences [ 0 /g0 /g1 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /gD /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0
        /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g26 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0
        /g0 /g32 /g0 /g0 /g35 /g0 /g37 /g38 /g39 /g0 /g0 /g0 /g0 /g3E /g0 /g0 /g0 /g42 ]
  >>
  /FirstChar    0
  /FontBBox     [ -1 202 598 -801 ]
  /FontDescriptor 10 0 R        
  /FontMatrix   [ 0.082254 0 0 -0.082254 0 0 ]
  /LastChar     66
  /ToUnicode    9 0 R       
  /Widths   [ 500 300 0 0 0 244 0 641 579 592 664 616 263 616 404 ]
>>
endobj

Most of this information is extraneous, except for the encoding array, associating character indexes with glyph names, and the CharProcs array, connecting the names in the encoding array with actual drawing instructions. This chain links "font name plus character index" when displaying a string to "character index in encoding", which then utilizes the ToUnicode array to find reported Unicode values for each character.

The drawing instructions for each character (the references to each /gX stream) consist of routine move, line, and fill directives – typical processes, although other PDF engines often incorporate the original font instead of solely the literal drawing instructions.

However, the ToUnicode table disrupts copy operations. Instead of stating "character 16#26 maps to Unicode U+0054 'Latin Capital T'", it selects "U+021A Latin Capital T with Comma Below" – without apparent cause! It's undoubtedly not a random translation, leaving one puzzled as to why plain text is intentionally encoded in such a manner... unless someone out there is secretly pleased and thinking, "yes, this is what I had envisioned," implying intentional obfuscation.

The Puppeteer code on Github seems unable to handle PDFs independently, suggesting it relies on Chromium, internally powered by the Skia PDF engine (as indicated by the PDF binary header reading "D3 EB E9 E1" – "Skia" with the highest bit zeroed out). An issue was reported as a bug back in 2012; however, reports from 2017 suggest it may not be deemed urgent to rectify on their end.

Puppeteer-created PDF text may result in strange characters when copied and pasted

Answer №1

Similar questions

What is the best way to pass data from a child component to its parent in React?

Having trouble with filtering JSON data in AngularJS?

Generate an array using hyperlinks within a list item created by the user

Guide to displaying aggregated table field values in an input field when checking the checkbox

Can styles be added using script code?

Internet Explorer 8 is not compatible with jQuery fadeIn and fadeOut functions

How can I achieve the quickest image loading speed with JavaScript?

The URL cannot be retrieved using an Ajax call, but it is accessible through Postman

How does the HTML file in the build directory connect to the JavaScript file in the source directory when setting up a create-react-app

Issue with scrolling when Bootstrap modal is opened on top of another modal

unable to implement multiple layouts while utilizing react-router

Exploring the passage of time across various time zones

Toggle the Editable Feature in AngularJS JSON Editor

Migrating a Node.js/Mongo application from Redhat Openshift2 to Openshift3: Where can I find the new MongoDB URI?

What is the process for updating a property in Inertia.js and Vue.js?

"Creating multiple circles on an HTML5 canvas using an iPad: A step-by-step guide

Toggling checkboxes based on user input

The following authentication error occurred: JWEDecryptionFailed - the decryption process has encountered a failure

Transmitting personalized information with Cylon.js and SocketIO

In Loopback, I have defined two remote methods within a single model, yet only one is accessible through the API explorer