I am currently facing an issue while scraping dynamic data from a website. The PageSource I obtain using the get() method seems to be incomplete, unlike when viewing directly from Chrome or Firefox browsers. I am seeking a solution that will allow me to fully scrape all the data from the page.
For my project requirements, I aim to programmatically scrape using a .Net web browser or similar tool. I have experimented with various methods including Selenium WebDriver 2.48.2 with ChromeDriver, PhantomJSDriver, WebClient, and HttpWebRequest - all within the context of .Net 4.6.1.
The URL in question:
None of the following approaches have proven successful...
Attempt #1: Using HttpWebRequest
var urlContent = "";
try
{
var request = (HttpWebRequest) WebRequest.Create(url);
request.CookieContainer = new CookieContainer();
if (cookies != null)
{
foreach (Cookie cookie in cookies)
{
request.CookieContainer.Add(cookie);
}
}
var responseTask = Task.Factory.FromAsync<WebResponse>(request.BeginGetResponse,request.EndGetResponse,null);
using (var response = (HttpWebResponse)await responseTask)
{
if (response.Cookies != null)
{
foreach (Cookie cookie in response.Cookies)
{
cookies.Add(cookie);
}
}
using (var sr = new StreamReader(response.GetResponseStream()))
{
urlContent = sr.ReadToEnd();
}
}
Attempt #2: Utilizing WebClient
// async method signature required
using (WebClient client = new WebClient())
{
var task = await client.DownloadStringTaskAsync(url);
return task;
}
Attempt #3: Implementing PhantomJSDriver
var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.HideCommandPromptWindow = true;
using (var driver = new PhantomJSDriver(driverService))
{
driver.Navigate().GoToUrl(url);
WaitForAjax(driver);
string source = driver.PageSource;
return source;
}
public static void WaitForAjax(PhantomJSDriver driver)
{
while (true) // Handle timeout somewhere
{
var ajaxIsComplete = (bool)(driver as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0");
if (ajaxIsComplete)
break;
Thread.Sleep(100);
}
}
I also attempted to use ChromeDriver with a page object model approach. The length of this code prevents pasting here; however, it yielded the same outcome as the previous 3 attempts.
Expected Results
I anticipate retrieving complete data table from the provided URL, avoiding any missing information. An example screenshot is available for comparison with the display shown below. Note the presence of actual data instead of ellipses like '...'. This can be verified by examining the page source in Firefox or Chrome.
https://i.sstatic.net/ebqio.png
Actual Results
Notice the substantial gap indicated by an arrow where '...' should be, as seen in the screenshot. Instead of those ellipses, there should be numerous rows of content. This discrepancy persists across all aforementioned attempts.
https://i.sstatic.net/xwv3M.png
Please bear in mind that the URL hosts dynamic data, hence results may not replicate exact screenshots. Nonetheless, conducting a quick test comparing line counts in the Page Source would reveal nearly double the rows in the "complete" dataset html.