Can a regular expression be created to specifically target and match a singular grapheme cluster?

Question

Can a regular expression be created to specifically target and match a singular grapheme cluster?

Characters in text that are perceived by users, known as graphemes, can consist of multiple codepoints in unicode.

According to Unicode® Standard Annex #29:

Users may perceive a character as a single unit of writing in a language, but it could actually be represented by several Unicode code points. This concept is called a user-perceived character to avoid confusion with the computer's use of the term character. For example, "G" + grave-accent forms a user-perceived character which consists of two Unicode code points. These characters are approximated by grapheme clusters that can be determined programmatically.

Is there a regular expression available (in javascript) that will match a single grapheme cluster? e.g.

"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"\r\n".match(/*?*/)[0] === "\r\n"
"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"

javascript regex unicode

Answer 1

Answer №1

Integrated support that is user-friendly and comprehensive: not available. However, there are approximations for different matching tasks: yes. As stated in the regex tutorial:

To match a single grapheme, whether it consists of a single code point or multiple code points with combining marks, various programming languages like Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and Just Great Software applications provide an easy solution using \X. Think of \X as the Unicode equivalent of the dot metacharacter. One key distinction is that while \X matches line break characters, the dot does not unless you activate the dot matches newline mode.

In .NET, Java versions prior to 8, and Ruby 1.9, you can utilize \P{M}\p{M}+ or (?>\P{M}\p{M}) as a fairly close alternative. For matching any number of graphemes, consider using (?>\P{M}\p{M}*)+ as a substitute for \X+.

\X offers the closest solution but is absent from all versions up to ES6. A workaround such as \P{M}\p{M}+ may resemble \X, but doesn't exactly match. In cases where ES6 is present through native or transpilation means, consider using /(\P{Mark})(\p{Mark}+)/gu.

However, even with these alternatives, it's important to note that this approach may not be sufficient. Make sure to check out that link for detailed insights.

A proposal has been introduced to segment text, as mentioned in this repository. While not yet universally accepted, users on Chrome can explore the non-standard Intl.v8BreakIterator for cluster segmentation and manual matching.

Answer 2

Integrated support that is user-friendly and comprehensive: not available. However, there are approximations for different matching tasks: yes. As stated in the regex tutorial:

To match a single grapheme, whether it consists of a single code point or multiple code points with combining marks, various programming languages like Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and Just Great Software applications provide an easy solution using \X. Think of \X as the Unicode equivalent of the dot metacharacter. One key distinction is that while \X matches line break characters, the dot does not unless you activate the dot matches newline mode.

In .NET, Java versions prior to 8, and Ruby 1.9, you can utilize \P{M}\p{M}+ or (?>\P{M}\p{M}) as a fairly close alternative. For matching any number of graphemes, consider using (?>\P{M}\p{M}*)+ as a substitute for \X+.

\X offers the closest solution but is absent from all versions up to ES6. A workaround such as \P{M}\p{M}+ may resemble \X, but doesn't exactly match. In cases where ES6 is present through native or transpilation means, consider using /(\P{Mark})(\p{Mark}+)/gu.

However, even with these alternatives, it's important to note that this approach may not be sufficient. Make sure to check out that link for detailed insights.

A proposal has been introduced to segment text, as mentioned in this repository. While not yet universally accepted, users on Chrome can explore the non-standard Intl.v8BreakIterator for cluster segmentation and manual matching.

Can a regular expression be created to specifically target and match a singular grapheme cluster?

Answer №1

Similar questions

Issue with Photoswipe pswp class Not Getting Properly Cleared Upon Closing Image

What is the best method to convert a variable array into a string array using jQuery in the context

Having trouble sending data to an API through jQuery: the function is not functioning properly

Issue with gMarker.key being undefined is causing an error in an angular application that utilizes the Google Maps API. This

Graph plot in a responsive div using Plotly.js

Implementing a Variety of Textures and Images in a Three.js Scene

Is the state of the React.js component empty?

Updating a nested property within an array of objects in MongoDB

What steps are necessary to integrate expo-auth-session with Firebase?

"Can you provide instructions on how to set the background to the selected button

Guide on retrieving data parameter on the receiving page from Ajax response call

Is it possible to change the input type of 'time' to a string when utilizing ng-change in AngularJS?

Clicking to Load Images - Angular

Is there a way to activate ng-change when a space character is input?

It appears that there is a slight hiccup in the code when JavaScript is implementing the line skip functionality for the condition

execute the execCommand function following an ajax request

Problems arising from Jquery append functionality

Displaying the currently logged in user's name with NodeJS/ExpressJS/Passport

What steps should be taken to configure Multer so that it delays saving an image until after a successful database entry using Mongoose has been made?

Peeling back the layers of a particular element