Puppeteer-created PDF text may result in strange characters when copied and pasted

After using the most recent version of puppeteer to create the PDF attached, I noticed that when attempting to copy and paste text from Adobe Acrobat, it appears as:

This is a test string.

transforming into

Țħįș įș ǻ țěșț șțřįňģ.

Below is the code snippet used for generating the PDF.

const puppeteer = require('puppeteer');
const argv = require('minimist')(process.argv.slice(2));
const fileName = argv.fileName || "page";
const timeout = 90;

(async () => {
  var pageUrl = "my-url-here"
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  
  function onTimeout() {
    console.log("Timed out waiting for data after " + timeout + " seconds.");
    process.exit();
  }
  
  console.log("Opening " + pageUrl);
  await page.goto(pageUrl, {waitUntil: 'networkidle2'});
  console.log("Waiting for page to load...");
  
  console.log("Waiting for data to load...");
  await page.waitForSelector('#print-report-loaded', {timeout:timeout*1000}).catch(onTimeout);
  
  var fileFullName = fileName + ".pdf";
  console.log("Saving PDF as " + fileFullName);
  await page.pdf({path: fileFullName});
  console.log("PDF saved successfully as " + fileFullName);

  await browser.close();
})();

Click here to view the generated PDF

If you have any suggestions on how to resolve this issue, please feel free to share. Your help is greatly appreciated!

Answer №1

Acrobat doesn't actually alter the text; it merely duplicates the Unicode characters stored for these fonts. The 'characters' displayed are Type 3 outlines resembling "normal" characters but their corresponding Unicode code points are, in fact, those of heavily accented characters.

According to both Acrobat Reader and the official PDF specifications, everything is functioning as intended.

Lets delve into your PDF file.

To add unnecessary complexity, one might assume only one font is needed, yet your tool generated two fonts: F0, which correlates character codes with specific Unicode codes,

<(01)> <( )>
<(0D)> <(.)>
<(26)> <(Ț)>
<(32)> <(ǻ)>
<(35)> <(ě)>
<(37)> <(ģ)>
<(38)> <(ħ)>
<(39)> <(į)>
<(3E)> <(ň)>
<(42)> <(ț)>

and F1 mapping to

<(15)> <(ř)>
<(16)> <(ș)>

The character codes documented as a string, one character at a time (with some commands interspersed; excluded here for brevity):

<26><38><39>{16}<01><39>{16}<01><32><01><42><35>{16}<42><01>{16}<42>{15}<39><3E><37><0D>

Hex codes enclosed within <..> correspond to font F0 and {..} belong to F1. When replaced with Unicode characters one by one, you arrive at the Unicode string:

Țħįș įș ǻ țěșț șțřįňģ.

The "fonts" employed here are Type 3 PostScript fonts, entirely embedded inside the PDF. For instance, Font #0 is described as

8 0 obj @ 1059      % "F0"
<<
  /Type     /Font
  /Subtype  /Type3
  /CIDToGIDMap  /Identity
  /CharProcs    
  <<
    /g0     11 0 R      % -> stream
    /g1     12 0 R      % -> stream
    /g26    14 0 R      % -> stream
    /g32    15 0 R      % -> stream
    /g35    16 0 R      % -> stream
    /g37    17 0 R      % -> stream
    /g38    18 0 R      % -> stream
    /g39    19 0 R      % -> stream
    /g3E    20 0 R      % -> stream
    /g42    21 0 R      % -> stream
    /gD     13 0 R      % -> stream
  >>
  /Encoding     
  <<
    /Type   /Encoding
    /Differences [ 0 /g0 /g1 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /gD /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0
        /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g26 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0
        /g0 /g32 /g0 /g0 /g35 /g0 /g37 /g38 /g39 /g0 /g0 /g0 /g0 /g3E /g0 /g0 /g0 /g42 ]
  >>
  /FirstChar    0
  /FontBBox     [ -1 202 598 -801 ]
  /FontDescriptor 10 0 R        
  /FontMatrix   [ 0.082254 0 0 -0.082254 0 0 ]
  /LastChar     66
  /ToUnicode    9 0 R       
  /Widths   [ 500 300 0 0 0 244 0 641 579 592 664 616 263 616 404 ]
>>
endobj

Most of this information is extraneous, except for the encoding array, associating character indexes with glyph names, and the CharProcs array, connecting the names in the encoding array with actual drawing instructions. This chain links "font name plus character index" when displaying a string to "character index in encoding", which then utilizes the ToUnicode array to find reported Unicode values for each character.

The drawing instructions for each character (the references to each /gX stream) consist of routine move, line, and fill directives – typical processes, although other PDF engines often incorporate the original font instead of solely the literal drawing instructions.

However, the ToUnicode table disrupts copy operations. Instead of stating "character 16#26 maps to Unicode U+0054 'Latin Capital T'", it selects "U+021A Latin Capital T with Comma Below" – without apparent cause! It's undoubtedly not a random translation, leaving one puzzled as to why plain text is intentionally encoded in such a manner... unless someone out there is secretly pleased and thinking, "yes, this is what I had envisioned," implying intentional obfuscation.

The Puppeteer code on Github seems unable to handle PDFs independently, suggesting it relies on Chromium, internally powered by the Skia PDF engine (as indicated by the PDF binary header reading "D3 EB E9 E1" – "Skia" with the highest bit zeroed out). An issue was reported as a bug back in 2012; however, reports from 2017 suggest it may not be deemed urgent to rectify on their end.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

What is the best way to pass data from a child component to its parent in React?

I have a dynamic table where users can select items from a dropdown menu and add them to the list. Currently, I store the item list in the component's state to render the table dynamically. I want users to be able to click on an item in the table and ...

Having trouble with filtering JSON data in AngularJS?

I'm sorry if this question has already been answered. I tried looking for solutions on other websites, but couldn't understand them. I am attempting to filter JSON data within the ng-repeat function, but whenever I try to input something, it does ...

Generate an array using hyperlinks within a list item created by the user

In the process of developing a program, I have included a feature where users can drag and drop .wav files into a playlist-container. These files are then played in the order they are arranged within the playlist-container. Currently, I am working on imple ...

Guide to displaying aggregated table field values in an input field when checking the checkbox

How can I retrieve the value of the second span and display it in an input field when a checkbox is checked? For example, if there are values like 500 in the first row and 200 in the second row, when the checkbox in the first row is ticked, the value of 50 ...

Can styles be added using script code?

A kind member from this site has assisted me in fixing a script. This initial segment of the code allows for the categorization and separation of blog articles by tags. Is it feasible to incorporate CSS into this section of the code, where the tags Terro ...

Internet Explorer 8 is not compatible with jQuery fadeIn and fadeOut functions

Recently, I created a script that allows for fading in and out an error container. Surprisingly, it seems to work perfectly in Firefox and Chrome but unfortunately fails to function altogether in Internet Explorer 8. If you're interested, feel free t ...

How can I achieve the quickest image loading speed with JavaScript?

If I have a large ecommerce website with 15,000 image elements that need to be added to the HTML, what is the best approach using JavaScript to optimize efficiency and enhance user experience? ...

The URL cannot be retrieved using an Ajax call, but it is accessible through Postman

I'm having trouble fetching the URL "" using $.get. Strangely, when I paste the exact same URL into Postman, it works perfectly. Visit my JSFiddle page for reference. $.get( "https://api.spotify.com/v1/artists/1rQX6kg84TqcwGtZHYIdn4/album", ...

How does the HTML file in the build directory connect to the JavaScript file in the source directory when setting up a create-react-app

After initializing our react app by running the command create-react-app, a new app directory is created with sub-directories included. When we use the command npm start to run the app, it deploys the index.html file located in the public directory withi ...

Issue with scrolling when Bootstrap modal is opened on top of another modal

I recently created a modal using bootstrap, and inside this first modal, I added a button to open a second modal. Initially, everything worked perfectly fine. However, after closing the second modal, the scroll within the first modal stopped functioning pr ...

unable to implement multiple layouts while utilizing react-router

My current project requires two different layouts: one for the homepage (currently implemented in app.js) and another for the result page. After submitting a form, the results should be displayed in the result layout instead of the app layout. However, whe ...

Exploring the passage of time across various time zones

Currently, I am working with only four different timezones. <select class="pull-left marg-l-5px" id="hoursTimezone"> <option>-</option> <option>EST</option> <option>CST</option> <option>PDT</option> ...

Toggle the Editable Feature in AngularJS JSON Editor

Currently, I'm utilizing a library called ng-jsoneditor that is available on GitHub. I am attempting to switch the state of an element from being editable to being read-only. To indicate that an element should be read-only, I need to specify the onE ...

Migrating a Node.js/Mongo application from Redhat Openshift2 to Openshift3: Where can I find the new MongoDB URI?

It seems like I have successfully set up node.js and mongoDB on my Openshift3 "Starter" Account. In the previous version, Openshift2, there was a full MONGODB URL environment variable provided. However, in the current version, there are only USER and PASS ...

What is the process for updating a property in Inertia.js and Vue.js?

I am attempting to modify the list property in Vue.js using Inertia.js: props: { list: { type: Object, default: {} } }, updateTable(filters) { axios.post(route('updateList'), filters) .then(r => { ...

"Creating multiple circles on an HTML5 canvas using an iPad: A step-by-step guide

My current code is only drawing one circle at a time on an iPad view, but I want to be able to draw multiple circles simultaneously. How can I achieve this? // Setting up touch events for drawing circles var canvas = document.getElementById('pain ...

Toggling checkboxes based on user input

My dynamic table is filled with checkboxes that can be checked or unchecked. I have created a jquery script that should change the background color of the table cell whenever a checkbox is modified. However, the script seems to have some bugs and doesn&apo ...

The following authentication error occurred: JWEDecryptionFailed - the decryption process has encountered a failure

[...nextauth]/route.js file import { User } from "@/lib/models"; import { connectToDb } from "@/lib/utils"; import NextAuth from "next-auth"; import GitHubProvider from "next-auth/providers/github"; export const aut ...

Transmitting personalized information with Cylon.js and SocketIO

After making adjustments to my code based on the example provided here https://github.com/hybridgroup/cylon-api-socketio/tree/master/examples/robot_events_commands This is the complete server code, currently running on an Edison board. Everything functio ...

In Loopback, I have defined two remote methods within a single model, yet only one is accessible through the API explorer

I'm facing a challenge with making 2 remote methods function in the same loopback model. Only one is operational in the api explorer at a time - when I comment out or delete the code for one, the other works seamlessly. Here's my approach: modul ...