Tips on adding line breaks after periods that are not at the end of a sentence in HTML text nodes with regular expressions

Looking to craft a regex that can identify all periods not enclosed in quotes and not followed by a '<'.

This is for the purpose of converting text into ssml (Speech Synthesis Markup Language). The regex will be utilized to automatically insert <break time="200ms"/> after a period.

I've managed to devise a pattern that detects periods outside of quotes:

/\.(?=(?:[^"]|"[^"]*")*$)/g

The above regex produces the following results: (^ = match)

This. is.a.<break time="0.5s"/> test sentence.
    ^   ^ ^                                  ^

However, I am striving to formulate a regex that excludes matching the third period. The expected matches should appear as follows:

This. is.a.<break time="0.5s"/> test sentence.
    ^   ^                                    ^

If anyone can offer some guidance, it would be greatly appreciated!

Answer №1

Group capture can be a useful technique in this scenario.

To manipulate or extract string expressions effectively, it is important to capture the dots within a separate group:

/((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g

The expression [^"\.] denotes any character that is not a dot or double quote.

The syntax "(?:\\\\|\\"|[^"])*" represents a string expression, potentially containing escaped double quotes or dots.

Therefore, (?:[^"\.]|"(?:\\\\|\\"|[^"])*")* will consume all characters except dots (.), disregarding dots enclosed within string expressions as much as possible.

Upon executing this regex pattern on the provided string:

"Thi\\\"s." is..a.<break time="0\".5s"/> test sentence.

The following matches will be generated:

Match 1

  • Full match, from character 0 to 15: "Thi\\\"s." is.
  • Group 1, from character 14 to 15: .

Match 2

  • Full match, from character 15 to 16: .
  • Group 1, from character 15 to 16: .

Match 3

  • Full match, from character 18 to 55:
    <break time="0\".5s"/> test sentence.
  • Group 1, from character 54 to 55: .

You can validate this using an excellent tool like Regex101

Notably, the captured point will consistently reside in the second group due to how the expression is structured. As such, the index of the dot can be determined by match.index + group[1].length, assuming group[1] exists.

Note: The provided expression accommodates for escaped double quotes to prevent issues when encountering them.

A concise and functional version of the working solution is outlined below:

// To gather all matches, 'g' flag is essential
const regexp = /((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g;

function getMatchingPointsExcludingChevronAndStrings(input) {
  let match;
  const result = [];

  // Resetting the lastIndex of regexp since it's reused per call
  regexp.lastIndex = 0;
 
  while ((match = regexp.exec(input))) {
      // Index of the dot = match index + length of group 1 if present
      result.push(match.index + (match[1] ? match[1].length : 0));
  }

  // Result comprises indices of all '.' adhering to the specified criteria
  return result;
}

// Escaping an escaped string requires careful handling, evident from console.log
const testString = `"Thi\\\\\\"s." is..a.<break time="0\\".5s"/> test sentence.`;
console.log(testString);

// Final outcome
console.log(
    getMatchingPointsExcludingChevronAndStrings(testString)
);

Edit:

The requester desires to insert pause markup after periods in the text as raw HTML content.

Here’s a fully operational solution:

// To collect all matches, include 'g' flag
const regexp = /((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g;

function addPausesAfterPeriods(input) {
    let match;
    const dotOffsets = [];

    // Resetting lastIndex of regexp before each use
    regexp.lastIndex = 0;
    const ts = Date.now();

    // Initially compile offsets for all period occurrences
    while ((match = regexp.exec(input))) {
        // Offset of the dot = match index + length of first group if applicable
        dotOffsets.push(match.index + (match[1] ? match[1].length : 0));
    }

    // If no periods found, return input untouched
    if (dotOffsets.length === 0) {
        return input;
    }

    // Reconstruct the string with added breaks following each period
    const restructuredContent = dotOffsets.reduce(
        (result, offset, index) => {
            // A segment represents substring from one period to the next (or beginning)
            const segment = input.substring(
              index <= 0 ? 0 : dotOffsets[index - 1] + 1,
              offset + 1
            );
            return `${result}${segment}<break time="200ms"/>`;
        },
        ''
    );

    // Add remaining portion from last period till end of string
    const remainder = input.substring(dotOffsets[dotOffsets.length - 1] + 1);
    return `${restructuredContent}${remainder}`;
}

const testString = `
<p>
    This is a sample from Wikipedia.
    It is used as an example for this snippet.
</p>
<p>
    <b>Hypertext Markup Language</b> (<b>HTML</b>) is the standard
    <a href="/wiki/Markup_language.html" title="Markup language">
        markup language
    </a> for documents designed to be displayed in a
    <a href="/wiki/Web_browser.html" title="Web browser">
        web browser
    </a>.
    It can be assisted by technologies such as
    <a href="/wiki/Cascading_Style_Sheets" title="Cascading Style Sheets">
        Cascading Style Sheets
    </a>
    (CSS) and
    <a href="/wiki/Scripting_language.html" title="Scripting language">
        scripting languages
    </a>
    such as
    <a href="/wiki/JavaScript.html" title="JavaScript">JavaScript</a>.
</p>
`;


console.log(`Initial raw html:\n${testString}\n`);

console.log(`Result (added 2 pauses):\n${addPausesAfterPeriods(testString)}\n`);

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

What is the best way to rotate a cube when it is clicked on?

My current project involves rotating a cube by clicking on buttons either on the cube itself or floating next to it. At the moment, I have them floating for easier testing purposes, but placing them directly on the cube is not an issue. The main issue I&a ...

Offering various language options on a website determined by the URL

I've been contemplating how to add multi-language support to my personal website, which I developed using ExpressJS and NodeJS with EJS as the template engine. Currently, the website is only available in English, but I want to add a German version as ...

Webdriverio: exploring the window object

I am experimenting with Webdriverio Testrunner using Selenium Standalone. I have encountered an issue while trying to check a global variable (window.myVar) in one of my tests. When attempting to return the window object, I am getting unexpected results i ...

I'm having trouble with my Express server routes not being accessed. The browser is displaying an error message saying 'No Data Received ERR_EMPTY_RESPONSE

I've encountered an issue with my express server while setting up an email service. Despite troubleshooting and simplifying the code to a basic 'hello world' example, the problem persists. No routes are functioning properly – requests made ...

What is the best way to make changes to the DOM when the state undergoes a

I've programmed the box container to adjust dynamically based on input changes. For instance, if I entered 1, it will generate one box. However, if I modify the input to 2, it mistakenly creates 3 boxes instead of just 2. import React from 'rea ...

React Native ScrollView ref issue resolved successfully

I'm trying to automatically scroll to the bottom of a flatlist, so here's what I have: const scrollViewRef = useRef(); //my scroll view <ScrollView ref={scrollViewRef} onContentSizeChange={() => { scrollViewRef.current.scr ...

Interacting between frames with jQuery

I have main_page.htm with the following frameset structure: <frameset rows="30,*" frameborder=0 border=0> <frame name="top_frame" src="top.htm"> <frame name="bottom_frame" src="bottom.htm"> </frameset> The content in ...

Tips for accessing elements using document.getElementsByTagName

Greetings and best wishes for the holiday season! I hope everyone is cozy and safe. I need a little help with some code here, as I am fairly new to JavaScript but not entirely new to programming. Despite finding an answer on this site, I am still encounter ...

Query in progress while window is about to close

I'm attempting to trigger a post query when the user exits the page. Here's the code snippet I am currently working with: <script type="text/javascript> window.onbeforeunload = function(){ var used = $('#identifier').val(); ...

Combining Multiple Arrays into a Single Array

Is there a way to combine this merge operation that creates one array using forEach into a single array at the end? affProd.pipe(mergeMap( event1 => { return fireProd.pipe( map(event2 => { const fi ...

What is the process for implementing optional chaining on a JSON object?

I'm currently facing an issue where I need to compare a value within a JSON object with a variable. if (resp.userdetails.name == username) { // do something } The challenge arises when not all resp objects contain the userdetails property, resulting ...

What is the most effective method for sharing a form across various components in Angular 5?

I have a primary form within a service named "MainService" (the actual form is much lengthier). Here is an overview- export class MainService { this.mainForm = this.formBuilder.group({ A: ['', Validators.required], B: & ...

Establishing the preset values for Material-UI toggle button group

I utilized the Material UI library to implement a button toggle widget for selecting options. Check out my codesandbox project here - https://codesandbox.io/s/50pl0jy3xk The user can interact by choosing a membership type: adult, child, or infant. Option ...

What is the method for configuring automatic text color in CKEditor?

Is there a way to change the default text color of CKEditor from #333333 to #000000? I have attempted to modify the contents.css in the plugin folder: body { /* Font */ font-family: sans-serif, Arial, Verdana, "Trebuchet MS"; font-size: 12px; ...

Article: Offering CoffeeScript and JavaScript Assets Simultaneously

Currently, my web app is up and running using Node and Express. I initially developed it in JavaScript but now want to transition over to CoffeeScript. My goal is to have both file1.js and file2.coffee coexisting in the application (with both being served ...

Learn how to combine pie and bar charts using Highcharts. Discover how to efficiently load JSON data and understand the different ways

I'm feeling a bit lost when it comes to loading json data into the Highcharts combo pie/bar chart. Below is an example code that's a work in progress. I just need some help understanding how to load the json and structure the data series correctl ...

What separates $(document).ready() from embedding a script at the end of the body tag?

Can you explain the distinction between running a JavaScript function during jQuery's $(document).ready() and embedding it in the HTML within script tags at the bottom of the body? Appreciate your insights, DLiKS ...

Limiting ng-repeat in AngularJS when the last item on the object is reached

I have a large object being repeated using ng-repeat, and it runs smoothly on Chrome and Opera. However, when it comes to browsers like Mozilla and IE, the performance is very slow. I tried implementing pagination, which helped speed things up, but ideally ...

The process of retrieving request data from axios.get and storing it in the Redux store

Recently delving into Redux, I'm curious about how to retrieve request data from a GET method. Upon mounting the component, you can use axios to send a GET request to '/api/v3/products', passing in parameters like pageNumber and pageSize. ...

I am struggling to render the pages and components in React while using BrowserRouter

Below is the code snippet for App.js and Home.js that I'm working on. My aim is to showcase the Home.js component in the browser. App.js import "./App.css"; import { Route, BrowserRouter } from "react-router-dom"; import Home from ...