Can Node be prompted to utilize surrogate pairs for writing Unicode characters in JSON when creating a file?

Question

Can Node be prompted to utilize surrogate pairs for writing Unicode characters in JSON when creating a file?

After researching on this topic, it was mentioned that JSON is supposed to be automatically written using surrogate pairs.

However, this has not been the case in my personal experience.

Despite running the code below with Node.js version 6.9.2, some characters are still not encoded using surrogate pairs in the output file.

const fs = require('fs')

const infile = fs.readFile('raw.json', 'utf8', (err, data) => {
    if (err) {
        throw err
    }

    data = JSON.stringify(data)

    fs.writeFile('final.json', data, 'utf8', (err) => {
      if (err) {
        throw err
      }
      console.log('done')
    })

})

In my text editor, which supports unicode well and uses a font with glyphs for all characters, I noticed special characters like "題" in the contents of the file raw.json.

Unfortunately, even after saving the file as final.json, those characters remain unchanged without being converted into surrogate pairs.

I also attempted switching the encoding from utf8 to utf16le for the output file, but that did not solve the issue either.

Is there any method or technique to enforce the usage of surrogate pairs during JSON encoding?

javascript json unicode utf-8

Answer 1

Answer №1

The stated question could lead to confusion if one assumes that JSON.stringify will transform Unicode characters in a string, beyond the Basic Multilingual Plane, into a series of \u escaped surrogate pair values. An answer offers a clearer explanation, highlighting that JSON.stringify only escapes backslash (\), double quotation ("), and control characters.

As a result, when encountered with a character that spans more than one octet (like the '題' mentioned as an example), it will be directly written to the output file as that specific character. In case of successful writing followed by reading using UTF16 encoding, the input character encoded in UTF8 should ideally appear as intended.

If the objective is to convert JSON text to ASCII utilizing \u escaped characters for non-ASCII values, alongside surrogate pairs for characters outside the BMP, then processing the JSON formatted string involves straightforward character scrutiny. This is because JSON automatically handles the quote, backslash, and control characters:

var jsonComponent = '"2®π≤題😍"'; // for instance

function jsonToAscii( jsonText) {
    var s = "";
    
    for( var i = 0; i < jsonText.length; ++i) {
        var c = jsonText[ i];
        if( c >= '\x7F') {
            c = c.charCodeAt(0).toString(16);
            switch( c.length) {
              case 2: c = "\\u00" + c; break;
              case 3: c = "\\u0" + c; break;
              default: c = "\\u" + c; break;
            }
        }
        s += c;
    }
    return s;
}

console.log( jsonToAscii( jsonComponent));

This approach capitalizes on the fact that JavaScript strings are already in UTF16 format (including surrogate pairs), albeit being accessed as consecutive UCS-2 16-bit values through array notation lookup and the .charAt method. Notably, '題' falls within the BMP realm requiring only two octets in UTF16, whereas the emoji lies beyond plane 0 necessitating 4 octets (in UTF16).

If this isn't the main aim, there might be minimal cause for concern.

Answer 2

The stated question could lead to confusion if one assumes that JSON.stringify will transform Unicode characters in a string, beyond the Basic Multilingual Plane, into a series of \u escaped surrogate pair values. An answer offers a clearer explanation, highlighting that JSON.stringify only escapes backslash (\), double quotation ("), and control characters.

As a result, when encountered with a character that spans more than one octet (like the '題' mentioned as an example), it will be directly written to the output file as that specific character. In case of successful writing followed by reading using UTF16 encoding, the input character encoded in UTF8 should ideally appear as intended.

If the objective is to convert JSON text to ASCII utilizing \u escaped characters for non-ASCII values, alongside surrogate pairs for characters outside the BMP, then processing the JSON formatted string involves straightforward character scrutiny. This is because JSON automatically handles the quote, backslash, and control characters:

var jsonComponent = '"2®π≤題😍"'; // for instance

function jsonToAscii( jsonText) {
    var s = "";
    
    for( var i = 0; i < jsonText.length; ++i) {
        var c = jsonText[ i];
        if( c >= '\x7F') {
            c = c.charCodeAt(0).toString(16);
            switch( c.length) {
              case 2: c = "\\u00" + c; break;
              case 3: c = "\\u0" + c; break;
              default: c = "\\u" + c; break;
            }
        }
        s += c;
    }
    return s;
}

console.log( jsonToAscii( jsonComponent));

This approach capitalizes on the fact that JavaScript strings are already in UTF16 format (including surrogate pairs), albeit being accessed as consecutive UCS-2 16-bit values through array notation lookup and the .charAt method. Notably, '題' falls within the BMP realm requiring only two octets in UTF16, whereas the emoji lies beyond plane 0 necessitating 4 octets (in UTF16).

If this isn't the main aim, there might be minimal cause for concern.

Can Node be prompted to utilize surrogate pairs for writing Unicode characters in JSON when creating a file?

Answer №1

Similar questions

Tips for extracting a JSON element in an Arel query on Postgres in Rails 5.2.4

sending jqgrid post request with JSON payload

Passing variables from Guzzle Request in Laravel: A step-by-step guide

The prompt "npm run build" command resulted in a 126 Vercel exit status

Is there a way to stop the navbar from covering the title?

Ensure that there is no gap between a unicode character and the following character

Extracting data from a JSON file using Python: A step-by-step guide

The file or directory '/var/task/google-cloud-key.json' does not exist: ENOENT error on Vercel

Step back one iteration within the Array.prototype.map method

Include variables in a JavaScript statement to create conditional functionality

Struggling to send an object through a node route for rendering a page?

AngularJS unit testing with $httpBackend is impacted by conflicts with UI-Router

Unravel the encoded string to enable JSON parsing

Notify of an Invalid CSRF Token within the Action Buttons Present in the Table

Issue with Material-UI Nested Checkbox causing parent DOM to not update upon selection changes

Bundling and deploying a React-Native iOS app on a physical device: Step-by-step guide

Troubleshooting AngularJS: Issues arise when implementing ng-view

Is there a way for me to retrieve the name of a newly opened browser tab from the original tab?

Error TS7053 indicates that an element is implicitly assigned the 'any' type when trying to use a 'string' type to index a 'User_Economy' type

Unresponsive IE browser: Issues with jQuery event changes and click functionalities