Group capture can be a useful technique in this scenario.
To manipulate or extract string expressions effectively, it is important to capture the dots within a separate group:
/((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g
The expression [^"\.]
denotes any character that is not a dot or double quote.
The syntax "(?:\\\\|\\"|[^"])*"
represents a string expression, potentially containing escaped double quotes or dots.
Therefore, (?:[^"\.]|"(?:\\\\|\\"|[^"])*")*
will consume all characters except dots (.
), disregarding dots enclosed within string expressions as much as possible.
Upon executing this regex pattern on the provided string:
"Thi\\\"s." is..a.<break time="0\".5s"/> test sentence.
The following matches will be generated:
Match 1
- Full match, from character 0 to 15:
"Thi\\\"s." is.
- Group 1, from character 14 to 15:
.
Match 2
- Full match, from character 15 to 16:
.
- Group 1, from character 15 to 16:
.
Match 3
You can validate this using an excellent tool like Regex101
Notably, the captured point will consistently reside in the second group due to how the expression is structured. As such, the index of the dot can be determined by match.index + group[1].length
, assuming group[1]
exists.
Note: The provided expression accommodates for escaped double quotes to prevent issues when encountering them.
A concise and functional version of the working solution is outlined below:
// To gather all matches, 'g' flag is essential
const regexp = /((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g;
function getMatchingPointsExcludingChevronAndStrings(input) {
let match;
const result = [];
// Resetting the lastIndex of regexp since it's reused per call
regexp.lastIndex = 0;
while ((match = regexp.exec(input))) {
// Index of the dot = match index + length of group 1 if present
result.push(match.index + (match[1] ? match[1].length : 0));
}
// Result comprises indices of all '.' adhering to the specified criteria
return result;
}
// Escaping an escaped string requires careful handling, evident from console.log
const testString = `"Thi\\\\\\"s." is..a.<break time="0\\".5s"/> test sentence.`;
console.log(testString);
// Final outcome
console.log(
getMatchingPointsExcludingChevronAndStrings(testString)
);
Edit:
The requester desires to insert pause markup after periods in the text as raw HTML content.
Here’s a fully operational solution:
// To collect all matches, include 'g' flag
const regexp = /((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)(\.(?!\s*<))((?:[^"\.]|(?:"(?:\\\\|\\"|[^"])*"))*)/g;
function addPausesAfterPeriods(input) {
let match;
const dotOffsets = [];
// Resetting lastIndex of regexp before each use
regexp.lastIndex = 0;
const ts = Date.now();
// Initially compile offsets for all period occurrences
while ((match = regexp.exec(input))) {
// Offset of the dot = match index + length of first group if applicable
dotOffsets.push(match.index + (match[1] ? match[1].length : 0));
}
// If no periods found, return input untouched
if (dotOffsets.length === 0) {
return input;
}
// Reconstruct the string with added breaks following each period
const restructuredContent = dotOffsets.reduce(
(result, offset, index) => {
// A segment represents substring from one period to the next (or beginning)
const segment = input.substring(
index <= 0 ? 0 : dotOffsets[index - 1] + 1,
offset + 1
);
return `${result}${segment}<break time="200ms"/>`;
},
''
);
// Add remaining portion from last period till end of string
const remainder = input.substring(dotOffsets[dotOffsets.length - 1] + 1);
return `${restructuredContent}${remainder}`;
}
const testString = `
<p>
This is a sample from Wikipedia.
It is used as an example for this snippet.
</p>
<p>
<b>Hypertext Markup Language</b> (<b>HTML</b>) is the standard
<a href="/wiki/Markup_language.html" title="Markup language">
markup language
</a> for documents designed to be displayed in a
<a href="/wiki/Web_browser.html" title="Web browser">
web browser
</a>.
It can be assisted by technologies such as
<a href="/wiki/Cascading_Style_Sheets" title="Cascading Style Sheets">
Cascading Style Sheets
</a>
(CSS) and
<a href="/wiki/Scripting_language.html" title="Scripting language">
scripting languages
</a>
such as
<a href="/wiki/JavaScript.html" title="JavaScript">JavaScript</a>.
</p>
`;
console.log(`Initial raw html:\n${testString}\n`);
console.log(`Result (added 2 pauses):\n${addPausesAfterPeriods(testString)}\n`);