Acrobat doesn't actually alter the text; it merely duplicates the Unicode characters stored for these fonts. The 'characters' displayed are Type 3 outlines resembling "normal" characters but their corresponding Unicode code points are, in fact, those of heavily accented characters.
According to both Acrobat Reader and the official PDF specifications, everything is functioning as intended.
Lets delve into your PDF file.
To add unnecessary complexity, one might assume only one font is needed, yet your tool generated two fonts: F0
, which correlates character codes with specific Unicode codes,
<(01)> <( )>
<(0D)> <(.)>
<(26)> <(Ț)>
<(32)> <(ǻ)>
<(35)> <(ě)>
<(37)> <(ģ)>
<(38)> <(ħ)>
<(39)> <(į)>
<(3E)> <(ň)>
<(42)> <(ț)>
and F1
mapping to
<(15)> <(ř)>
<(16)> <(ș)>
The character codes documented as a string, one character at a time (with some commands interspersed; excluded here for brevity):
<26><38><39>{16}<01><39>{16}<01><32><01><42><35>{16}<42><01>{16}<42>{15}<39><3E><37><0D>
Hex codes enclosed within <..>
correspond to font F0
and {..}
belong to F1
. When replaced with Unicode characters one by one, you arrive at the Unicode string:
Țħįș įș ǻ țěșț șțřįňģ.
The "fonts" employed here are Type 3 PostScript fonts, entirely embedded inside the PDF. For instance, Font #0 is described as
8 0 obj @ 1059 % "F0"
<<
/Type /Font
/Subtype /Type3
/CIDToGIDMap /Identity
/CharProcs
<<
/g0 11 0 R % -> stream
/g1 12 0 R % -> stream
/g26 14 0 R % -> stream
/g32 15 0 R % -> stream
/g35 16 0 R % -> stream
/g37 17 0 R % -> stream
/g38 18 0 R % -> stream
/g39 19 0 R % -> stream
/g3E 20 0 R % -> stream
/g42 21 0 R % -> stream
/gD 13 0 R % -> stream
>>
/Encoding
<<
/Type /Encoding
/Differences [ 0 /g0 /g1 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /gD /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0
/g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g26 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0 /g0
/g0 /g32 /g0 /g0 /g35 /g0 /g37 /g38 /g39 /g0 /g0 /g0 /g0 /g3E /g0 /g0 /g0 /g42 ]
>>
/FirstChar 0
/FontBBox [ -1 202 598 -801 ]
/FontDescriptor 10 0 R
/FontMatrix [ 0.082254 0 0 -0.082254 0 0 ]
/LastChar 66
/ToUnicode 9 0 R
/Widths [ 500 300 0 0 0 244 0 641 579 592 664 616 263 616 404 ]
>>
endobj
Most of this information is extraneous, except for the encoding array, associating character indexes with glyph names, and the CharProcs
array, connecting the names in the encoding array with actual drawing instructions. This chain links "font name plus character index" when displaying a string to "character index in encoding", which then utilizes the ToUnicode
array to find reported Unicode values for each character.
The drawing instructions for each character (the references to each /gX
stream) consist of routine move
, line
, and fill
directives – typical processes, although other PDF engines often incorporate the original font instead of solely the literal drawing instructions.
However, the ToUnicode
table disrupts copy operations. Instead of stating "character 16#26
maps to Unicode U+0054 'Latin Capital T'", it selects "U+021A Latin Capital T with Comma Below" – without apparent cause! It's undoubtedly not a random translation, leaving one puzzled as to why plain text is intentionally encoded in such a manner... unless someone out there is secretly pleased and thinking, "yes, this is what I had envisioned," implying intentional obfuscation.
The Puppeteer code on Github seems unable to handle PDFs independently, suggesting it relies on Chromium, internally powered by the Skia PDF engine (as indicated by the PDF binary header reading "D3 EB E9 E1" – "Skia" with the highest bit zeroed out). An issue was reported as a bug back in 2012; however, reports from 2017 suggest it may not be deemed urgent to rectify on their end.