Using RegEx in Google Apps Script to extract HTML content

Question

Using RegEx in Google Apps Script to extract HTML content

Currently, I am working with Google Apps Script and facing a challenge. My goal is to extract the content from an HTML page saved as a string using RegEx. Specifically, I need to retrieve data in the following format:

<font color="#FF0101">
        Data that needs to be extracted
</font>

I am seeking guidance on which RegEx pattern to use for extracting data enclosed within <font> tags (both opening and closing). It is important to note that I only want to extract data from tags that include the specified color attribute and value as indicated in the code snippet above.

javascript regex google-apps-script

Answer 1

Answer №1

Forget about struggling with RegEx to parse HTML - Google Apps Script's XmlService can handle well-formed HTML text interpretation.

function myFunction() {
  var xml = '<font color="#FF0101">Data which is want to fetch</font>';
  var doc = XmlService.parse(xml);
  var content = doc.getContent(0).getValue();
  Logger.log( content );  // "Data which is want to fetch"
  var color = doc.getContent(0).asElement().getAttribute('color').getValue();
  Logger.log( color );    // "#FF0101"
}

Answer 2

Forget about struggling with RegEx to parse HTML - Google Apps Script's XmlService can handle well-formed HTML text interpretation.

function myFunction() {
  var xml = '<font color="#FF0101">Data which is want to fetch</font>';
  var doc = XmlService.parse(xml);
  var content = doc.getContent(0).getValue();
  Logger.log( content );  // "Data which is want to fetch"
  var color = doc.getContent(0).asElement().getAttribute('color').getValue();
  Logger.log( color );    // "#FF0101"
}

Answer 3

Answer №2

JavaScript is a powerful tool, so there's no need to resort to using regex for HTML parsing.

var container = document.createElement('div');
container.innerHTML = "Insert your HTML content here";

var results = container.querySelectorAll("font[color='#FF0101']");
// Iterate through the `results` and extract desired information
// For example: results[0].textContent.replace(/^\s+|\s+$/g,'')

Answer 4

JavaScript is a powerful tool, so there's no need to resort to using regex for HTML parsing.

var container = document.createElement('div');
container.innerHTML = "Insert your HTML content here";

var results = container.querySelectorAll("font[color='#FF0101']");
// Iterate through the `results` and extract desired information
// For example: results[0].textContent.replace(/^\s+|\s+$/g,'')

Answer 5

Answer №3

If JavaScript had full support, a DOM-based solution could be implemented.

var html = "<font color=\"#FF0202\">NOT THIS ONE</font><font color=\"#FF0101\">\n        Data which is want to fetch\n</font>";
var faketag = document.createElement('faketag');
faketag.innerHTML = html;
var arr = [];
[].forEach.call(faketag.getElementsByTagName("font"), function(v,i,a) {
    if (v.hasAttributes() == true) {
      for (var o = 0; o < v.attributes.length; o++) {
        var attrib = v.attributes[o];
        if (attrib.name === "color" && attrib.value === "#FF0101")     {
       arr.push(v.innerText.replace(/^\s+|\s+$/g, ""));
        }
      }
    }
});
document.body.innerHTML = JSON.stringify(arr);

However, as per the GAS reference:

Apps Script code runs on Google's servers and does not support browser-based features like DOM manipulation or the Window API.

To extract inner text of <font color="#FF0101"> tags, regex can be used:

function myFunction() {
  var doc = DocumentApp.getActiveDocument();
  var paras = doc.getParagraphs();
  var MyRegex = new RegExp('<font\\b[^<]*\\s+color="#FF0101"[^<]*>([\\s\\S]*?)</font>','ig');
  for (i=0; i<paras.length; ++i) {
    while (match = MyRegex.exec(paras[i].getText()))
    {
      Logger.log(match[1]); 
    }
  }
}

The regex matches any font tag with color attribute set to #FF0101. Regex may not be perfect for HTML parsing, consider using more reliable techniques.

<font\\b[^<]*\\s+color="#FF0101"[^<]*>([^<]*(?:<(?!/font>)[^<]*)*)</font>

To handle HTML data spread across multiple paragraphs:

function myFunction() {
  var doc = DocumentApp.getActiveDocument();
  var text = doc.getBody().getText();
  var MyRegex = new RegExp('<font\\b[^<]*\\s+color="#FF0101"[^<]*>([\\s\\S]*?)</font>','ig');
  while (match = MyRegex.exec(text))
  {
     Logger.log(match[1]); 
  }
}

Given this input:

<font color="#FF0202">NOT THIS ONE</font>
<font color="#FF0101">
         Data which is want to fetch
</font>

The result would be:

https://i.sstatic.net/ebDcZ.png

Answer 6

If JavaScript had full support, a DOM-based solution could be implemented.

var html = "<font color=\"#FF0202\">NOT THIS ONE</font><font color=\"#FF0101\">\n        Data which is want to fetch\n</font>";
var faketag = document.createElement('faketag');
faketag.innerHTML = html;
var arr = [];
[].forEach.call(faketag.getElementsByTagName("font"), function(v,i,a) {
    if (v.hasAttributes() == true) {
      for (var o = 0; o < v.attributes.length; o++) {
        var attrib = v.attributes[o];
        if (attrib.name === "color" && attrib.value === "#FF0101")     {
       arr.push(v.innerText.replace(/^\s+|\s+$/g, ""));
        }
      }
    }
});
document.body.innerHTML = JSON.stringify(arr);

However, as per the GAS reference:

Apps Script code runs on Google's servers and does not support browser-based features like DOM manipulation or the Window API.

To extract inner text of <font color="#FF0101"> tags, regex can be used:

function myFunction() {
  var doc = DocumentApp.getActiveDocument();
  var paras = doc.getParagraphs();
  var MyRegex = new RegExp('<font\\b[^<]*\\s+color="#FF0101"[^<]*>([\\s\\S]*?)</font>','ig');
  for (i=0; i<paras.length; ++i) {
    while (match = MyRegex.exec(paras[i].getText()))
    {
      Logger.log(match[1]); 
    }
  }
}

The regex matches any font tag with color attribute set to #FF0101. Regex may not be perfect for HTML parsing, consider using more reliable techniques.

<font\\b[^<]*\\s+color="#FF0101"[^<]*>([^<]*(?:<(?!/font>)[^<]*)*)</font>

To handle HTML data spread across multiple paragraphs:

function myFunction() {
  var doc = DocumentApp.getActiveDocument();
  var text = doc.getBody().getText();
  var MyRegex = new RegExp('<font\\b[^<]*\\s+color="#FF0101"[^<]*>([\\s\\S]*?)</font>','ig');
  while (match = MyRegex.exec(text))
  {
     Logger.log(match[1]); 
  }
}

Given this input:

<font color="#FF0202">NOT THIS ONE</font>
<font color="#FF0101">
         Data which is want to fetch
</font>

The result would be:

https://i.sstatic.net/ebDcZ.png

Using RegEx in Google Apps Script to extract HTML content

Answer №1

Answer №2

Answer №3

Similar questions

What is the process for saving information to a database with JavaScript?

Is it possible for Tinymce to provide me with precise HTML content that retains all styles (essentially giving me a true WYSIWYG

Error: The function named 'setValues' has already been declared

Client component in Next.js is automatically updated upon successful login

In AngularJS, encountering difficulties when trying to append an object to the end of the scope due to persistent data updates

Issue [ERR_MODULE_NOT_FOUND]: The module 'buildapp' could not be located within buildserver.js

I am having trouble modifying the content of a div using Jquery append

Regular expression: Validate in PHP (on the server-side) or JavaScript (on the client-side)

The module at 'D:Education odemonin odemon.js' could not be located by Node

Error Encountered: Unable to Locate Node Modules on Ubuntu Version 20.04.5

What are some methods to make sure that functions in AngularJS do not run at the same time

Launch a new email window in Outlook from a server using C#

Can you examine words for similarities, and should we also search for instances of plurals and -ing

"Regarding compatibility with different browsers - IE8, Firefox3.6, and Chrome: An inquiry on

Ways to extract innerHTML content from a loaded element generated by using the .load() method

Begin the search process with one click using the jQuery Autocomplete feature for Ajax

What causes Post back to be triggered upon closing the page but not when the user navigates away from it?

What could be causing my Mocha reporter to duplicate test reports?

Add up the duplicate elements in two arrays

Avoid unnecessary renders by only updating state if it has changed from the previous state

Using RegEx in Google Apps Script to extract HTML content

Answer №1

Answer №2

Answer №3

Similar questions

What is the process for saving information to a database with JavaScript?

Is it possible for Tinymce to provide me with precise HTML content that retains all styles (essentially giving me a true WYSIWYG

Error: The function named 'setValues' has already been declared

Client component in Next.js is automatically updated upon successful login

In AngularJS, encountering difficulties when trying to append an object to the end of the scope due to persistent data updates

Issue [ERR_MODULE_NOT_FOUND]: The module 'buildapp' could not be located within buildserver.js

I am having trouble modifying the content of a div using Jquery append

Regular expression: Validate in PHP (on the server-side) or JavaScript (on the client-side)

The module at 'D:Education odemonin odemon.js' could not be located by Node

Error Encountered: Unable to Locate Node Modules on Ubuntu Version 20.04.5

What are some methods to make sure that functions in AngularJS do not run at the same time

Launch a new email window in Outlook from a server using C#

Can you examine words for similarities, and should we also search for instances of plurals and -ing

"Regarding compatibility with different browsers - IE8, Firefox3.6, and Chrome: An inquiry on

Ways to extract innerHTML content from a loaded element generated by using the .load() method

Begin the search process with one click using the jQuery Autocomplete feature for Ajax

What causes Post back to be triggered upon closing the page but not when the user navigates away from it?

What could be causing my Mocha reporter to duplicate test reports?

Add up the duplicate elements in two arrays

Avoid unnecessary renders by only updating state if it has changed from the previous state

The module at 'D:Education odemonin odemon.js' could not be located by Node