The main focus of this question is to extract a URL along with its dimensions from an html attribute string called srcset
. The specific criteria are as follows:
- URL must start with either
http
orhttps
- URL may contain a comma
,
- URL should not have any spaces
- Dimensions consist of digits followed by either
x
orw
, but they don't necessarily need to be followed by those characters.
Given these parameters, the ideal method for matching would involve locating the http/https part and continuing until a space is encountered. Then, match the sequence of digits immediately followed by a w
or x
, optionally followed by a comma. The end of the match will be signaled by a subsequent space.
A typical example would look something like https://url.com 650w
or https://url.com 650
or https://url.com 650x
. Keep in mind that there is no strict standard format here.
Below is my attempted regex pattern along with a Regex101 demo link. The issue with this regex is that it doesn't group correctly:
(https?:\/\/(?:.*(?:\s+\d+[wx])(?:,\s*)?)+)
Here's a sample string to parse:
[Sample URL Strings Here]
The expected output should be:
[Extracted URLs and Dimensions Here]