Regular Expression - Locate all numerical values within text formatted with HTML

Question

Regular Expression - Locate all numerical values within text formatted with HTML

I am attempting to locate all the numbers within an HTML document. However, I want to ensure that I exclude numbers that are part of a word, such as "o365", "high5", and similar instances.

Here is my current approach, but it does not successfully avoid words:

regular expression:

[\s+>][-.0-9]+

sample HTML snippet:

<p ng-if="e.element != 'attachment'" ng-bind-html="::e.value" class="ng-binding ng-scope">123 Hello need 123 help with 0365 thanks</p>

javascript regex

Answer 1

Answer №1

One option is to utilize a straightforward regex pattern:

\b\d+\b

Explore demo here

The concept revolves around identifying digits within boundaries

Answer 2

One option is to utilize a straightforward regex pattern:

\b\d+\b

Explore demo here

The concept revolves around identifying digits within boundaries

Answer 3

Answer №2

When using a dot in your sample regexp, it appears that you are attempting to retrieve both floating point numbers and integers indiscriminately. To account for the sign, an optional sign should be considered first:

[+-]?

Following this, there must be a sequence of digits (at least one):

[0-9][0-9]*

(this can also be written as \d+) next, optionally, a dot followed by another sequence of digits (which may be empty)

(\.\d*)?

Furthermore, if you want to ensure these numbers are not attached to alphabetic input, word boundaries need to be placed on both ends. Therefore, the final regex would look like:

\b[+-]?\d+(\.\d*)?\b

As demonstrated in demo.

The demo showcases three unusual cases which warrant attention:

The right boundary avoids matching +15350.16f, capturing only +15350. The dot is recognized as a boundary, however, since it's a valid number, we exclude the right boundary.
In this instance, the + sign functions as a nonword character, creating a left-side word boundary to correctly scan.
In this case, due to the left boundary, we need to skip the initial part of the number (e25). The dot acts as a word boundary for the fractional part, allowing 42 to be scanned as a number after the dot. This scenario seems complex; additional context might be required to address this situation.

To mitigate the last case, context needs to be added prior to our number, determining whether to accept or reject the number based on that context. If something matches within the first group, everything is discarded; hence:

([a-zA-Z]?)

When appended to our regexp:

([a-zA-Z]?)([+-][0-9]+(\.[0-9]+)?)

In such cases, rejection occurs if group 1 has any matches. Conversely, if group 1 is empty, the number from group 2 is obtained. Refer to demo2.

The demo illustrates that a letter connected to a signed number could potentially be valid, resulting in match rejection due to the presence of the letter in the first group. To prevent this, two regular expressions will be _or_ed together to form two alternatives: first without a sign included:

([a-zA-Z]?)([0-9]+(\.[0-9]*)?)

Followed by the signed original expression (sign being mandatory in this case).

([+-][0-9]+(\.[0-9]*)?)

Therefore, if group 1 contains anything, the expression is rejected as not being a valid number. Group 2 indicates an *unsigned floating point or integer number*, while group 4 represents a *signed floating point or integer number*. The final regexp is:

([a-zA-Z]?)([0-9]+(\.[0-9]*)?)|([+-][0-9]+(\.[0-9]*)?)

Refer to demo3.

Answer 4

When using a dot in your sample regexp, it appears that you are attempting to retrieve both floating point numbers and integers indiscriminately. To account for the sign, an optional sign should be considered first:

[+-]?

Following this, there must be a sequence of digits (at least one):

[0-9][0-9]*

(this can also be written as \d+) next, optionally, a dot followed by another sequence of digits (which may be empty)

(\.\d*)?

Furthermore, if you want to ensure these numbers are not attached to alphabetic input, word boundaries need to be placed on both ends. Therefore, the final regex would look like:

\b[+-]?\d+(\.\d*)?\b

As demonstrated in demo.

The demo showcases three unusual cases which warrant attention:

The right boundary avoids matching +15350.16f, capturing only +15350. The dot is recognized as a boundary, however, since it's a valid number, we exclude the right boundary.
In this instance, the + sign functions as a nonword character, creating a left-side word boundary to correctly scan.
In this case, due to the left boundary, we need to skip the initial part of the number (e25). The dot acts as a word boundary for the fractional part, allowing 42 to be scanned as a number after the dot. This scenario seems complex; additional context might be required to address this situation.

To mitigate the last case, context needs to be added prior to our number, determining whether to accept or reject the number based on that context. If something matches within the first group, everything is discarded; hence:

([a-zA-Z]?)

When appended to our regexp:

([a-zA-Z]?)([+-][0-9]+(\.[0-9]+)?)

In such cases, rejection occurs if group 1 has any matches. Conversely, if group 1 is empty, the number from group 2 is obtained. Refer to demo2.

The demo illustrates that a letter connected to a signed number could potentially be valid, resulting in match rejection due to the presence of the letter in the first group. To prevent this, two regular expressions will be _or_ed together to form two alternatives: first without a sign included:

([a-zA-Z]?)([0-9]+(\.[0-9]*)?)

Followed by the signed original expression (sign being mandatory in this case).

([+-][0-9]+(\.[0-9]*)?)

Therefore, if group 1 contains anything, the expression is rejected as not being a valid number. Group 2 indicates an *unsigned floating point or integer number*, while group 4 represents a *signed floating point or integer number*. The final regexp is:

([a-zA-Z]?)([0-9]+(\.[0-9]*)?)|([+-][0-9]+(\.[0-9]*)?)

Refer to demo3.

Regular Expression - Locate all numerical values within text formatted with HTML

Answer №1

Answer №2

Similar questions

developing a dynamic map with javascript

Tips on activating the CSS style while typing using the onChange event in React

Encountering an issue... invariant.js:42 Error: A `string` value was received instead of a function for the `onClick` listener

Which method is better for presenting data: PHP or JavaScript?

Exploring the wonders of accessing POST request body in an Express server using TypeScript and Webpack

A tutorial on submitting multipart/form-data using JavaScript

Unable to access account due to login function malfunctioning

Counter is effective for the initial post, yet it does not function properly for the subsequent post

Manipulating data with Angular's array object

Ways to shift text upwards while scrolling in a downward direction?

Creating dynamic and asynchronous JSON structures with JavaScript

Why does the getComputedStyle function return null for IE11 and Edge when using Kendo React Grid within a Popout window, while it works fine in Chrome, Opera, and Firefox?

I attempted to access data from the phpmyadmin database, but the browser is displaying an error message stating "cannot get/employee", while there are no errors showing in the command prompt

Error: The object #<HTMLCollection> does not support the 'tags' method

Issues are arising with Jquery Scripts when running through Selenium in IE 9, resulting in the error message SCRIPT5009: '$' is undefined. However, these scripts are functioning correctly when executed directly in the web browser

Is dynamic data supported by Next.js SSG?

Is there a way to sequentially load two iframes, with the second one loading only after the first one is fully loaded

Is there a way to modify this within a constructor once the item has been chosen from a randomly generated array?

Developing Authorization in AngularJS

Send information using AJAX within a .NET integrated browser