Scraping data without a browser: Selenium, phantomJS, and Geb

Question

Scraping data without a browser: Selenium, phantomJS, and Geb

I have a task of extracting data from a website on a weekly basis. The data is only visible after clicking on the page (triggering a Javascript function). It is loaded into a table which can be identified by its unique ID. This script needs to be executed on a server without browser support. Below is my code snippet using Geb:

    @Grab("org.gebish:geb-core:0.13.1")
    @Grab("org.seleniumhq.selenium:selenium-firefox-driver:2.52.0")
    @Grab("org.seleniumhq.selenium:selenium-support:2.52.0")
    @GrabExclude('org.codehaus.groovy:groovy-all')  

    import geb.Browser

Browser.drive{
    // driver.webClient.javaScriptEnabled = true
    go "mysite"
    js.loadWeekData()
   println $("div.data-listing").text()
    }

I've extensively researched this topic but couldn't find any solution for headless scraping with Javascript support. The record below is from Selenium IDE:

driver.findElement(By.linkText("Next")).click();

Unfortunately, I faced difficulties trying to integrate PhantomJS with Geb.

Edit 1 Below is the error message from PhantomJS: java.lang.NoClassDefFoundError: org/openqa/selenium/browserlaunchers/Proxies Despite attempting to address version compatibility issues, I was unable to resolve it.

@Grab("org.gebish:geb-core:0.13.1")
@Grab("org.seleniumhq.selenium:selenium-firefox-driver:2.52.0")
@Grab("org.seleniumhq.selenium:selenium-support:2.52.0")
@Grab("com.codeborne:phantomjsdriver:1.3.0")
WebDriver driver = new PhantomJSDriver();

        // Load Google.com
        driver.get("http://www.google.com");
        // Locate the Search field on the Google page
        WebElement element = driver.findElement(By.name("q"));

In summary, I am seeking a way to execute the first script in headless mode (if feasible without Xvfb installation). An ideal solution would involve Groovy or Java programming languages.

javascript selenium web-scraping phantomjs geb

Answer 1

Answer №1

Finally, my approach involves using HTMLUNIT and the following code snippet:

Although some tidying up is required, this code functions adequately. The primary issue with HTMLUNIT - warnings and errors, has been resolved through appropriate logging settings.

@Grab(group='net.sourceforge.htmlunit', module='htmlunit', version='2.21')

import com.gargoylesoftware.htmlunit.AlertHandler;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.logging.Level;

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);

WebClient webClient = new WebClient();
      webClient.waitForBackgroundJavaScriptStartingBefore(10000);
        HtmlPage currentPage = webClient.getPage("mysite");
       /* HtmlButton button = (HtmlButton) currentPage.getElementById("tomorrow");
        button.click();*/

        //String javaScriptCode = "loadTomorrowTrain();";
        String javaScriptCode = "loadYesterdayTrain();";


def result = currentPage.executeJavaScript(javaScriptCode);
//def result = page.executeJavaScript(javaScriptCode);
      webClient.waitForBackgroundJavaScriptStartingBefore(10000);
println result.getJavaScriptResult();
println "result: "+ result

def newpage = result.getNewPage()
def table = result.getNewPage().getElementById("training-days");
println table
def spans = currentPage.getByXPath( "//div[@training-days]");
println spans
def spans1 = newpage.getByXPath("//div[@class='training-days']//a");
println spans1
def spans2 = currentPage.getByXPath("//div[@class='training-days']//a");
println spans2
def spans3 = currentPage.getByXPath("//table[@id='training']");
println spans3

Answer 2

Finally, my approach involves using HTMLUNIT and the following code snippet:

Although some tidying up is required, this code functions adequately. The primary issue with HTMLUNIT - warnings and errors, has been resolved through appropriate logging settings.

@Grab(group='net.sourceforge.htmlunit', module='htmlunit', version='2.21')

import com.gargoylesoftware.htmlunit.AlertHandler;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.logging.Level;

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);

WebClient webClient = new WebClient();
      webClient.waitForBackgroundJavaScriptStartingBefore(10000);
        HtmlPage currentPage = webClient.getPage("mysite");
       /* HtmlButton button = (HtmlButton) currentPage.getElementById("tomorrow");
        button.click();*/

        //String javaScriptCode = "loadTomorrowTrain();";
        String javaScriptCode = "loadYesterdayTrain();";


def result = currentPage.executeJavaScript(javaScriptCode);
//def result = page.executeJavaScript(javaScriptCode);
      webClient.waitForBackgroundJavaScriptStartingBefore(10000);
println result.getJavaScriptResult();
println "result: "+ result

def newpage = result.getNewPage()
def table = result.getNewPage().getElementById("training-days");
println table
def spans = currentPage.getByXPath( "//div[@training-days]");
println spans
def spans1 = newpage.getByXPath("//div[@class='training-days']//a");
println spans1
def spans2 = currentPage.getByXPath("//div[@class='training-days']//a");
println spans2
def spans3 = currentPage.getByXPath("//table[@id='training']");
println spans3

Scraping data without a browser: Selenium, phantomJS, and Geb

Answer №1

Similar questions

Using select2, items can be automatically selected for an ajax call

The functionality of the custom file upload button is experiencing issues on Microsoft Edge

My attempt at creating a straightforward sorting function turned out to be ineffective

When the text in a Material UI Textfield is updated using setState in React, the Hinttext may overlap with the newly set

Can a JavaScript object be created in TypeScript?

Issue with jSignature transitioning properly from display:none to display:block within a multi-page form

Searching for complete words in JavaScript within an array

React: The error message is saying that it cannot retrieve the 'handler' property because it is either undefined or null

streaming audio data from a MongoDB database to an HTML audio element

Caution: It is important that each child within a list is assigned a distinct "key" prop - specifically in the Tbody component

Issues with Weglot link hooks not functioning properly within the sticky header

Combining and consolidating JSON attributes

CSS3 Animation: Facing issue with forwards fill behavior in Safari when using position and display properties

The key to successful filtering in Next.js with Hasura is timing - it's always a step

Emphasize a specific line of text within a <div> with a highlighting effect

While v-carousel adjusts to different screen sizes, the images it displays do not adapt to

Unable to modify zoom settings in iOS using javascript

Adding data to a defaultContent JSON column in JQuery DataTable using JavaScript

Utilize pandas to separate scrapped data from a table into organized lists

Incorporate the JavaScript file fetched from NPM into the app.js code