Guide on scraping a site that employs Direct Web Remoting (DWR) for generating Javascript to interact with the HTML content of the page

Question

Guide on scraping a site that employs Direct Web Remoting (DWR) for generating Javascript to interact with the HTML content of the page

When trying to crawl a specific website that appears to generate its content dynamically using DWR, I encountered a challenge. The source code of the page only reveals a minimal amount of HTML as a 'shell', without any useful links for crawling. Instead, it consists of POST requests that return responses containing Javascript:

throw 'allowScriptTagRemoting is false.';
//#DWR-INSERT
//#DWR-REPLY
var a1 = {}; var a2 = {}; var a3 = {}; // ... and so on.
a1.configs=a3;a1.defaultSite=true;a1.defaultValues=a4; // ... and more.

This response comprises around 150 lines of data.

I am aware that this behavior is typical of how DWR operates. However, I am curious about strategies web crawlers employ to navigate such scenarios. Is there a way for them to execute the Javascript in the AJAX response and then patiently wait for the HTML to finalize its modifications?

This situation seems distinct from standard Ajax requests where the response may include HTML, and the DOM is updated upon the request's completion. Alternatively, they might return data which is subsequently used by the remaining page's Javascript to update the DOM. In both cases, post-execution of the response is unnecessary.

Any insights or advice on addressing this issue would be greatly valued.

javascript ajax web-crawler dwr

Answer 1

Answer №1

If you're not sure which platform to use for website crawling, consider utilizing PhantomJS to execute JavaScript and display the site. Alternatively, if you're attempting to crawl using AJAX from a different site, sending a request to your server that runs PhantomJS in the background can provide rendered content as the output.

Answer 2

If you're not sure which platform to use for website crawling, consider utilizing PhantomJS to execute JavaScript and display the site. Alternatively, if you're attempting to crawl using AJAX from a different site, sending a request to your server that runs PhantomJS in the background can provide rendered content as the output.

Guide on scraping a site that employs Direct Web Remoting (DWR) for generating Javascript to interact with the HTML content of the page

Answer №1

Similar questions

Retrieving a specific value from a JSON object using the find method in JavaScript

Animate the entire paragraph with CSS hover effect

Sending PDF file to client's request using PDFKIT and Strapi (Koa) via HTTP response

Unable to access property 'scrollToBottom' as it is undefined

What is the method for obtaining a ReadableStream from a GridFSBucket?

Setting up SKPM (Sketch Plugin Manager) using npm

The functionality of ko.utils.arrayFilter is malfunctioning

Is there a way for me to prevent a particular file from being cached by web browsers?

The relocation of the route from app.js to route.js has caused a malfunction in the app

I am unable to incorporate the RobotJS module into my ElectronJS project

Retrieve a value for a textbox by comparing the values of two separate combo boxes

Having trouble getting Laravel Full Calendar to function properly with a JQuery and Bootstrap theme

Issue with React Google Maps Api: Error occurs while trying to access undefined properties for reading 'emit'

Changes in React BrowserRouter URLs are not reflected on the page or components, requiring a manual refresh for them to

Tips for changing array items into an object using JavaScript

When utilizing the Mongodb findOne function, it may not return any results or could

There is a possibility of encountering an endless update loop in the watcher when utilizing the expression "tabs" error in vue

Is there a way to efficiently retrieve multiple values from an array and update data in a specific column using matching IDs?

What is the method for extracting URI value in JavaScript?

What methods are most effective for verifying user credentials in a web application using Node.js and AngularJS?