Scraping data without a browser: Selenium, phantomJS, and Geb

I have a task of extracting data from a website on a weekly basis. The data is only visible after clicking on the page (triggering a Javascript function). It is loaded into a table which can be identified by its unique ID. This script needs to be executed on a server without browser support. Below is my code snippet using Geb:

    @Grab("org.gebish:geb-core:0.13.1")
    @Grab("org.seleniumhq.selenium:selenium-firefox-driver:2.52.0")
    @Grab("org.seleniumhq.selenium:selenium-support:2.52.0")
    @GrabExclude('org.codehaus.groovy:groovy-all')  

    import geb.Browser

Browser.drive{
    // driver.webClient.javaScriptEnabled = true
    go "mysite"
    js.loadWeekData()
   println $("div.data-listing").text()
    }

I've extensively researched this topic but couldn't find any solution for headless scraping with Javascript support. The record below is from Selenium IDE:

driver.findElement(By.linkText("Next")).click();

Unfortunately, I faced difficulties trying to integrate PhantomJS with Geb.

Edit 1 Below is the error message from PhantomJS: java.lang.NoClassDefFoundError: org/openqa/selenium/browserlaunchers/Proxies Despite attempting to address version compatibility issues, I was unable to resolve it.

@Grab("org.gebish:geb-core:0.13.1")
@Grab("org.seleniumhq.selenium:selenium-firefox-driver:2.52.0")
@Grab("org.seleniumhq.selenium:selenium-support:2.52.0")
@Grab("com.codeborne:phantomjsdriver:1.3.0")
WebDriver driver = new PhantomJSDriver();

        // Load Google.com
        driver.get("http://www.google.com");
        // Locate the Search field on the Google page
        WebElement element = driver.findElement(By.name("q"));

In summary, I am seeking a way to execute the first script in headless mode (if feasible without Xvfb installation). An ideal solution would involve Groovy or Java programming languages.

Answer №1

Finally, my approach involves using HTMLUNIT and the following code snippet:

Although some tidying up is required, this code functions adequately. The primary issue with HTMLUNIT - warnings and errors, has been resolved through appropriate logging settings.

@Grab(group='net.sourceforge.htmlunit', module='htmlunit', version='2.21')

import com.gargoylesoftware.htmlunit.AlertHandler;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.logging.Level;

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);

WebClient webClient = new WebClient();
      webClient.waitForBackgroundJavaScriptStartingBefore(10000);
        HtmlPage currentPage = webClient.getPage("mysite");
       /* HtmlButton button = (HtmlButton) currentPage.getElementById("tomorrow");
        button.click();*/

        //String javaScriptCode = "loadTomorrowTrain();";
        String javaScriptCode = "loadYesterdayTrain();";


def result = currentPage.executeJavaScript(javaScriptCode);
//def result = page.executeJavaScript(javaScriptCode);
      webClient.waitForBackgroundJavaScriptStartingBefore(10000);
println result.getJavaScriptResult();
println "result: "+ result

def newpage = result.getNewPage()
def table = result.getNewPage().getElementById("training-days");
println table
def spans = currentPage.getByXPath( "//div[@training-days]");
println spans
def spans1 = newpage.getByXPath("//div[@class='training-days']//a");
println spans1
def spans2 = currentPage.getByXPath("//div[@class='training-days']//a");
println spans2
def spans3 = currentPage.getByXPath("//table[@id='training']");
println spans3

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Using select2, items can be automatically selected for an ajax call

Is it possible to configure a select2 control to automatically select an item when the ajax response contains extra data? I am looking to set up my controller to mark an item as an exact match in the JsonResult and have the select2 control automatically s ...

The functionality of the custom file upload button is experiencing issues on Microsoft Edge

I've been working on creating a unique custom image upload button that functions perfectly in Chrome, Firefox, and Opera based on my testing. However, I'm facing an issue where it doesn't work properly in Microsoft Edge. Feel free to check ...

My attempt at creating a straightforward sorting function turned out to be ineffective

My attempt at creating a basic sorting function seems to be failing as it is still returning the original list instead of the sorted one. function sortByPopular (collection) { let items = collection.slice(); items.sort(function(a,b) { re ...

When the text in a Material UI Textfield is updated using setState in React, the Hinttext may overlap with the newly set

How can I fix the issue of default hintText and floatingLabelText overlapping with text in a textfield when using setState to autofill the textfield upon clicking on an edit button? Here is an image showing the textfield overlap: https://i.sstatic.net/8M ...

Can a JavaScript object be created in TypeScript?

Looking for a way to utilize an existing JavaScript "class" within an Angular2 component written in TypeScript? The class is currently defined as follows: function Person(name, age) { this.name = name; this.age = age; } Despite the fact that Java ...

Issue with jSignature transitioning properly from display:none to display:block within a multi-page form

After searching through numerous resources, I have been unable to find a solution to my specific issue. I have a multi-page form that incorporates jSignature as the final tab. The structure closely follows the example from the W3Schools website, tailored t ...

Searching for complete words in JavaScript within an array

I've been searching everywhere for a solution to this problem but I'm stuck... here's the situation: var strArray = ['Email Address']; function findExactMatchInArray(str, strArray) { for (var j = 0; j < strArray.length; ...

React: The error message is saying that it cannot retrieve the 'handler' property because it is either undefined or null

I'm having some trouble with event handlers in my React.js project. Every time I try to create an event handler outside of the render function, I end up with an error. Can anyone help me figure out what I'm doing wrong? class CheckboxHandler ext ...

streaming audio data from a MongoDB database to an HTML audio element

As part of my current project, I am delving into the realm of creating an audio player. The concept of database storage for files is relatively new to me, as my previous experience has mainly involved storing strings. Up to this point, here's what I ...

Caution: It is important that each child within a list is assigned a distinct "key" prop - specifically in the Tbody component

I encountered the error above while running jest tests on a component. npm start is running without any issues. The component code is as follows: .... .... const [Data, setData] = useState([]); useEffect(() => { const fetchData = async () =&g ...

Issues with Weglot link hooks not functioning properly within the sticky header

I have integrated Weglot for translations on my website, aigle.ca. Due to issues with their widget, I am using link hooks instead. You can find more information about link hooks at: weglot.com link-hooks However, when scrolling down the page and the menu ...

Combining and consolidating JSON attributes

I am working with a JSON object that has multiple instances of the email property, like so: accounts": [ { "email": "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="61120e0c04030e051821050e0c">[email protected]</a& ...

CSS3 Animation: Facing issue with forwards fill behavior in Safari when using position and display properties

After creating a CSS3 animation to fade out an element by adjusting its opacity from 1 to 0 and changing the position to absolute and display to none in the last frames, I encountered an issue. In Safari, only the opacity is maintained while the position a ...

The key to successful filtering in Next.js with Hasura is timing - it's always a step

I am fetching data from Hasura using useRecipe_Filter and passing the searchFilter state as a variable. It seems that every time I press a key, the UI updates with one keystroke delay before filtered data is passed to the results state. const SearchBar = ( ...

Emphasize a specific line of text within a <div> with a highlighting effect

I'm looking to achieve a similar effect as demonstrated in this fiddle As per StackOverflow guidelines, I understand that when linking to jsfiddle.net, it's required to provide some code. Below is the main function from the mentioned link, but f ...

While v-carousel adjusts to different screen sizes, the images it displays do not adapt to

Whenever I implement v-carousel, everything seems to be working well, but there is an issue on mobile. Despite the carousel itself being responsive, the images inside do not resize properly, resulting in only the center portion of each image being displaye ...

Unable to modify zoom settings in iOS using javascript

My website is 1000x820 in size, but please don't ask me about responsive web design. Here's the viewport setup: <meta name="viewport" content="target-densitydpi=device-dpi, width=1000px, user-scalable=no"> When accessed on an iPhone SE w ...

Adding data to a defaultContent JSON column in JQuery DataTable using JavaScript

I am working with a dynamic jQuery data table. The final column in my table allows users to delete rows, but I am having trouble passing the itemId value to the specified function within the button's onClick attribute. Here is the code I have tried s ...

Utilize pandas to separate scrapped data from a table into organized lists

My code snippet here displays basic information when I print airlines[3], but when I try to use it in a for loop like 'for i in airlines[3]:', only "Airlines" and "Destinations" are printed. I want to separate all the airlines into their own list ...

Incorporate the JavaScript file fetched from NPM into the app.js code

New to using node and npm, I recently set up my main JavaScript file called app.js and configured webpack nicely. Inside app.js, I currently have the following script: //require in jquery var $ = require('jquery'); //require a constructor f ...