Extracting data from website's table using JavaScript and opening the link in href

My goal is to extract the details page for each link found on this particular page.

The link provides access to all the information required: PAGE

However, I'm interested in extracting details from pages that have links like this:

href="javascript:subOpen('9ca8ed0fae15d43dc1257e7300345b99')"

I've shared a sample spreadsheet using the ImportHTML feature to get an overview.

Google Spreadsheet

Any ideas on how to proceed with retrieving details from these individual pages?

UPDATE

I tried implementing the following method:

function doGet(e){
  var base = 'http://www.ediktsdatei.justiz.gv.at/edikte/ex/exedi3.nsf/'
  var feed =  UrlFetchApp.fetch(base + 'suche?OpenForm&subf=e&query=%28%5BVKat%5D%3DEH%20%7C%20%5BVKat%5D%3DZH%20%7C%20%5BVKat%5D%3DMH%20%7C%20%5BVKat%5D%3DMW%20%7C%20%5BVKat%5D%3DMSH%20%7C%20%5BVKat%5D%3DGGH%20%7C%20%5BVKat%5D%3DRH%20%7C%20%5BVKat%5D%3DHAN%20%7C%20%5BVKat%5D%3DWE%20%7C%20%5BVKat%5D%3DEW%20%7C%20%5BVKat%5D%3DMAI%20%7C%20%5BVKat%5D%3DDTW%20%7C%20%5BVKat%5D%3DDGW%20%7C%20%5BVKat%5D%3DGA%20%7C%20%5BVKat%5D%3DGW%20%7C%20%5BVKat%5D%3DUL%20%7C%20%5BVKat%5D%3DBBL%20%7C%20%5BVKat%5D%3DLF%20%7C%20%5BVKat%5D%3DGL%20%7C%20%5BVKat%5D%3DSE%20%7C%20%5BVKat%5D%3DSO%29%20AND%20%5BBL%5D%3D0').getContentText();

       var d = document.createElement('div'); //assuming you can do this
       d.innerHTML = feed;//make the text a dom structure
       var arr = d.getElementsByTagName('a') //iterate over the page links
       var response = "";
       for(var i = 0;i<arr.length;i++){
         var atr = arr[i].getAttribute('onclick');
         if(atr) atr = atr.match(/subOpen\((.*?)\)/) //if onclick calls subOpen
         if(atr && atr.length > 1){ //get the id
            var detail = UrlFetchApp.fetch(base + '0/'+atr[1]).getContentText();
            response += detail//process the relevant part of the content and append to the reposnse text
         }
        }      
       return ContentService.createTextOutput(response);
}

Unfortunately, I encountered an error when running this method:

ReferenceError: "document" is not defined. (line 6, file "")

What exactly does the object document refer to?

I have updated the Google Spreadsheet with a webapp integration.

Answer №1

To inspect the contents and javascript of a page, Firebug can be used effectively. An interesting find is that subOpen is an alias for subOpenXML as declared in xmlhttp01.js.

function subOpenXML(unid) {/*open found doc from search view*/
 if (waiting) return alert(bittewar);
 var wState = dynDoc.getElementById('windowState');
 wState.value = 'H';/*httpreq pending*/
 var last = '';
 if (unid==docLinks[0]) {last += '&f=1'; thisdocnum = 1;}
 if (unid==docLinks[docLinks.length-1]) {
  last += '&l=1';
  thisdocnum = docLinks.length;
 } else {
  for (var i=1;i<docLinks.length-1;i++)
   if (unid==docLinks[i]) {thisdocnum = i+1; break;}
 }
 var url = unid + html_delim + 'OpenDocument'+last + '&bm=2';
 httpreq.open('GET',    // &rand=' + Math.random();
  /*'/edikte/test/ex/exedi31.nsf/0/'+*/ '0/'+url, true);
 httpreq.onreadystatechange=onreadystatechange;
// httpreq.setRequestHeader('Accept','text/xml');
 httpreq.send(null);
 waiting = true;
 title2src = firstTextChild(dynDoc.getElementById('title2')).nodeValue;
}

To enhance the function source, you can modify it within the Console tab of Firebug by inserting a console.log(url) before the http call like so:

 var url = unid + html_delim + 'OpenDocument'+last + '&bm=2';
 console.log(url)
 httpreq.open('GET',    // &rand=' + Math.random();
  /*'/edikte/test/ex/exedi31.nsf/0/'+*/ '0/'+url, true);

Executing the function declaration in the Console tab allows you to update subOpen with the modified source. Clicking on the link will reveal that the URL being requested consists of the passed ID prefixed by '0/'. For example, in the provided instance, it would result in a GET request to:

http://www.ediktsdatei.justiz.gv.at/edikte/ex/exedi3.nsf/0/1fd2313c2e0095bfc1257e49004170ca?OpenDocument&f=1&bm=2

You can confirm this by examining the Network tab within Firebug and following the link.

To scrape details from the page, the following steps are required:

  1. Analyze the ID passed to subOpen
  2. Initiate a GET call to '0/'
  3. Parsing the response from the request

Reviewing the request response in the Network Tab reveals similar parsing might be necessary to retrieve the displayed content, although further investigation is needed.

UPDATE For the scraping task at hand, using the importHTML function may not be optimal. Google's HTML or Content Services could be more appropriate. Building a web app and implementing the doGet function is recommended:

function doGet(e){
  var base = 'http://www.ediktsdatei.justiz.gv.at/edikte/ex/exedi3.nsf/'
  var feed =  UrlFetchApp.fetch(base + 'suche?OpenForm&subf=e&query=%28%5BVKat%5D%3DEH%20%7C%20%5BVKat%5D%3DZH%20%7C%20%5BVKat%5D%3DMH%20%7C%20%5BVKat%5D%3DMW%20%7C%20%5BVKat%5D%3DMSH%20%7C%20%5BVKat%5D%3DGGH%20%7C%20%5BVKat%5D%3DRH%20%7C%20%5BVKat%5D%3DHAN%20%7C%20%5BVKat%5D%3DWE%20%7C%20%5BVKat%5D%3DEW%20%7C%20%5BVKat%5D%3DMAI%20%7C%20%5BVKat%5D%3DDTW%20%7C%20%5BVKat%5D%3DDGW%20%7C%20%5BVKat%5D%3DGA%20%7C%20%5BVKat%5D%3DGW%20%7C%20%5BVKat%5D%3DUL%20%7C%20%5BVKat%5D%3DBBL%20%7C%20%5BVKat%5D%3DLF%20%7C%20%5BVKat%5D%3DGL%20%7C%20%5BVKat%5D%3DSE%20%7C%20%5BVKat%5D%3DSO%29%20AND%20%5BBL%5D%3D0').getContentText();
       var response = "";
       var match = feed.match(/subOpen\('.*?'\)/g)
       if(match){
         for(var i = 0; i < match.length;i++){
              var m = match[i].match(/\('(.*)'\)/);
              if(m && m.length > 1){
                var detailText = UrlFetchApp.fetch(base + '0/'+m[1]);
                response += //dosomething with detail text 
                            //and concatenate in the response
              }
         }
       }
       return ContentService.createTextOutput(response);
}

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Switching to fullscreen mode and eliminating any applied styles

I'm trying to enable fullscreen mode with a button click. I've added some custom styles when the window enters fullscreen, however, the styles remain even after exiting fullscreen using the escape key. The styles only get removed if I press the e ...

Eliminate parameter from URL

Here is the URL I am working with: http://my.site/?code=74e30ef2-109c-4b75-b8d6-89bdce1aa860 My goal is to redirect to this URL: http://my.site#/homepage To achieve this, I use the following code snippet: import { push } from 'react-router-redux& ...

Challenges surrounding jQuery's .before

Currently, I am in the process of creating a simple carousel consisting of 4 divs. The carousel is utilizing 2 jQuery functions to position a div at either the first or last slot. The transitions being used are only alpha transitions as there is no need fo ...

Render function in Next.js did not yield anything

Recently, I came across the next.js technology and encountered an error. Can anyone help me solve this issue? What could be causing it?View image here import React from 'react' import Button from "../components/button" function HomePa ...

Can I modify a property in DataTables.Net using the data itself?

I am trying to set the "column" property based on the ajax data that I receive. The json data contains a "data" and "columns" property, so in order to extract the data, my code would look something like this: primaryTable = $('#example').DataTa ...

Replicate the function of the back button following the submission of an ajax-submitted form to Preview Form

I am currently working on a multi-part form with the following data flow: Complete the form, then SUBMIT (using ajax post) jQuery Form and CodeIgniter validation messages displayed if necessary Preview the submitted answers from the form Options: Canc ...

Tips for controlling the size of a canvas element: setting minimum and maximum width and height properties

function convertImageResolution(img) { var canvas = document.createElement("canvas"); if (img.width * img.height < 921600) { // Less than 480p canvas.width = 1920; canvas.height = 1080; } else if (img.width * img.he ...

Executing the onSuccess callback in Ajax without any ability to manipulate the ajax requests

My dilemma lies in needing to execute a JavaScript function upon the successful completion of an AJAX call. Unfortunately, I am unable to directly manage the AJAX calls as they are handled by the DNN5 framework. Is there a way for me to trigger my functio ...

Displaying or concealing dropdown menus based on a selected option

My goal is to have a drop-down menu in which selecting an option triggers the display of another drop-down menu. For example, if I have options like "Vancouver," "Singapore," and "New York," selecting Vancouver will reveal a second set of options, while ch ...

Retrieve the desired element from an array when a button is clicked

When I click on the button, I need to update an object in an array. However, I am facing difficulties selecting the object that was clicked. For better readability, here is the link to my GitHub repository: https://github.com/Azciop/BernamontSteven_P7_V2 ...

What is the best way to invoke a function only once in typescript?

Struggling to implement TypeScript in React Native for fetching an API on screen load? I've been facing a tough time with it, especially when trying to call the function only once without using timeouts. Here's my current approach, but it's ...

Ways to dynamically update the value of an object property within reactJS state

In the scenario where a component holds state like so: this.state = { enabled: { one: false, two: false, three: false } } What is the proper way to utilize this.setState() in order to set the value of a dynamic property? An attempt such ...

Show spinner until the web page finishes loading completely

Could anyone guide me on how to display Ring.html for a brief moment until About.html finishes loading completely? I need the Ring.html page to vanish once About.html is fully loaded. As a beginner in web development, I would greatly appreciate your assist ...

What could be the issue with my JSON file?

I am currently utilizing the jQuery function $.getJson. It is successfully sending the desired data, and the PHP script generating the JSON is functioning properly. However, I am encountering an issue at this stage. Within my $.getJSON code, my intention ...

several parameters for the `ts-node -r` options

Can I include multiple require statements in the package.json script before running with ts-node -r? "scripts": { "start": "ts-node -r dotenv/config newrelic src/index.ts", } I'm having trouble adding both "dotenv/config" and "newrelic" si ...

The authorization header for jwt is absent

Once the user is logged in, a jwt token is assigned to them. Then, my middleware attempts to validate the token by retrieving the authorization header, but it does not exist. When I try to display the request header by printing it out, it shows as undefine ...

Ways to invoke a specific component within ReactDOM.render in React

Currently, I am facing an issue where 2 components need to be rendered present in a single div using myProject-init.js, but both are getting called at the same time. In myProject-init.js file: ReactDOM.render( <div> <component1>in compone ...

Removing data from the controller with JQUERY AJAX in a Spring MVC application

Could someone assist me with this issue? I am trying to implement ajax and sweetalert.js using the following repository: So far, everything is working well when I use onclick = "" to call my function. However, I need guidance on how to properly utilize th ...

It appears that the Next.js environment variables are not defined

Upon setting up a fresh next.js project using npx create-next-app@latest and configuring some environment variables in the .env.local file, I encountered an error when attempting to run the server. "Failed to load env from .env.local TypeError: Cannot ...

The data-src tags are functioning properly in the index.html file, but they are not working correctly in angular

I'm still learning about Angular and JavaScript, so please bear with me if my questions seem silly. I've been trying to add a theme to my Angular project. When I include the entire code in index.html, everything works fine. However, when I move ...