Looking to download a file with HTMLUnit from a javascript link is proving to be quite challenging. The journey begins at this page. When clicking on the "Authenticate with Java Web Start (new method)" link, a .jnlp file is downloaded, initiating a Java program window for authentication. Once authenticated, the original browser window loads up the desired information for scraping.
The source code snippet for the link on the starting page looks like this:
<tr>
<!-- onClick="return launchWebStart('authenticate');" -->
<td><a href="javascript:void(0)" id="webstart-authenticate" ><font size="5">Authenticate with Java Web Start (new method)</font></a>
</tr>
The essential javascript file needed for this process can be found here. It essentially encodes a cookie, appends it to a URL, and requests the jnlp file. Emulating this process directly goes against the advice provided in the HTMLUnit documentation, which recommends interacting with the page elements as a user would.
The issue faced in HTMLUnit arises after clicking on the anchor element; the expected jnlp file is not received. Various attempts have been made, such as:
HtmlUnit and JavaScript in links and HtmlUnit to invoke javascript from href to download a file
A suggested code implementation that was tried out is detailed below:
// Relevant imports here...
public class Test {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
// Open the starting webpage
HtmlPage page = webClient.getPage("https://ppair.uspto.gov/TruePassWebStart/AuthenticationChooser.html");
String linkID = "webstart-authenticate";
HtmlAnchor anchor = (HtmlAnchor) page.getElementById(linkID);
Page p = anchor.click();
InputStream is = p.getWebResponse().getContentAsStream();
int b = 0;
while ((b = is.read()) != -1) {
System.out.print((char)b);
}
webClient.close();
}
}
However, running this code results in printing out the html content of the initial webpage instead of the anticipated jnlp file. Furthermore, status updates from the javascript WebConsole are also displayed, indicating some activity related to the javascript functions within the separate WebStart.js file.
An alternative approach using a CollectingAttachmentHandler object as outlined here was attempted as well:
// Relevant imports here...
public class Test2 {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
// Open the starting webpage
HtmlPage page = webClient.getPage("https://ppair.uspto.gov/TruePassWebStart/AuthenticationChooser.html");
String linkID = "webstart-authenticate";
HtmlAnchor anchor = (HtmlAnchor) page.getElementById(linkID);
CollectingAttachmentHandler attachmentHandler = new CollectingAttachmentHandler();
webClient.setAttachmentHandler(attachmentHandler);
attachmentHandler.handleAttachment(anchor.click());
List<Attachment> attachments = attachmentHandler.getCollectedAttachments();
int i = 0;
while (i < attachments.size()) {
Attachment attachment = attachments.get(i);
Page attachedPage = attachment.getPage();
WebResponse attachmentResponse = attachedPage.getWebResponse();
String content = attachmentResponse.getContentAsString();
System.out.println(content);
i++;
}
webClient.close();
}
}
Similar to the first attempt, this code also ends up displaying the content of the initial webpage rather than fetching the desired file. With no success achieved so far, seeking guidance or suggestions on how to overcome this obstacle becomes crucial.