What is the best way to download the entire page source, rather than just a portion of it

Question

What is the best way to download the entire page source, rather than just a portion of it

I am currently facing an issue while scraping dynamic data from a website. The PageSource I obtain using the get() method seems to be incomplete, unlike when viewing directly from Chrome or Firefox browsers. I am seeking a solution that will allow me to fully scrape all the data from the page.

For my project requirements, I aim to programmatically scrape using a .Net web browser or similar tool. I have experimented with various methods including Selenium WebDriver 2.48.2 with ChromeDriver, PhantomJSDriver, WebClient, and HttpWebRequest - all within the context of .Net 4.6.1.

The URL in question:

None of the following approaches have proven successful...

Attempt #1: Using HttpWebRequest

    var urlContent = "";

    try
    {
        var request = (HttpWebRequest) WebRequest.Create(url);
        request.CookieContainer = new CookieContainer();
        if (cookies != null)
        {
            foreach (Cookie cookie in cookies)
            {
                request.CookieContainer.Add(cookie);
            }
        }

        var responseTask = Task.Factory.FromAsync<WebResponse>(request.BeginGetResponse,request.EndGetResponse,null);

        using (var response = (HttpWebResponse)await responseTask)
        {

            if (response.Cookies != null)
            {
                foreach (Cookie cookie in response.Cookies)
                {
                    cookies.Add(cookie);
                }
            }

            using (var sr = new StreamReader(response.GetResponseStream()))
            {
                urlContent = sr.ReadToEnd();
            }
        }

Attempt #2: Utilizing WebClient

// async method signature required
            using (WebClient client = new WebClient())
            {
                var task = await client.DownloadStringTaskAsync(url);

                return task;
            }

Attempt #3: Implementing PhantomJSDriver

   var driverService = PhantomJSDriverService.CreateDefaultService();
        driverService.HideCommandPromptWindow = true;
        using (var driver = new PhantomJSDriver(driverService))
        {
            driver.Navigate().GoToUrl(url);

            WaitForAjax(driver);

            string source = driver.PageSource;

            return source;
        }

    public static void WaitForAjax(PhantomJSDriver driver)
    {
        while (true) // Handle timeout somewhere
        {
            var ajaxIsComplete = (bool)(driver as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0");
            if (ajaxIsComplete)
                break;
            Thread.Sleep(100);
        }
    }

I also attempted to use ChromeDriver with a page object model approach. The length of this code prevents pasting here; however, it yielded the same outcome as the previous 3 attempts.

Expected Results

I anticipate retrieving complete data table from the provided URL, avoiding any missing information. An example screenshot is available for comparison with the display shown below. Note the presence of actual data instead of ellipses like '...'. This can be verified by examining the page source in Firefox or Chrome.

https://i.sstatic.net/ebqio.png

Actual Results

Notice the substantial gap indicated by an arrow where '...' should be, as seen in the screenshot. Instead of those ellipses, there should be numerous rows of content. This discrepancy persists across all aforementioned attempts.

https://i.sstatic.net/xwv3M.png

Please bear in mind that the URL hosts dynamic data, hence results may not replicate exact screenshots. Nonetheless, conducting a quick test comparing line counts in the Page Source would reveal nearly double the rows in the "complete" dataset html.

javascript c#selenium

Answer 1

Answer №1

Sure thing, happy to assist! :)

May I ask where you sourced the C# code snippet from? Specifically the line urlContent = sr.ReadToEnd(); How exactly are you extracting and copying the output from this? Could it possibly be the object inspector within the debugger that is causing truncation? Have you attempted retrieving the content from urlContent and saving it to a file? For example, using

System.IO.File.WriteAllText(@"temp.txt",urlContent);

Answer 2

Sure thing, happy to assist! :)

May I ask where you sourced the C# code snippet from? Specifically the line urlContent = sr.ReadToEnd(); How exactly are you extracting and copying the output from this? Could it possibly be the object inspector within the debugger that is causing truncation? Have you attempted retrieving the content from urlContent and saving it to a file? For example, using

System.IO.File.WriteAllText(@"temp.txt",urlContent);

What is the best way to download the entire page source, rather than just a portion of it

Answer №1

Similar questions

What is the process of performing numerical calculations using jQuery?

Android Troubleshooting: Audio Issue with Hybrid App

Exploring the nesting of client components in Next.jsIf you are

The AngularJS directive within a directive is failing to properly initialize the scope value

Update the information within the Nuxt.js middleware

Flowing Waterways and Transmission Control Protocol

Implementing pagination within an Angular 11 Mat-table with grouping feature

Track WordPress Post Views on Click using AJAX

The process of prioritizing specific elements at the top of an array when ordering

Protractor is displaying an error message of "unable to locate element testability" when attempting to access an element

The FatSecret Sharp API does not support starting an asynchronous operation at the current moment

Adjust the autofocus to activate once the select option has been chosen

Discover the method for extracting the value from an array that has been transferred from a php script

Retrieving data from handlebars variable in a client-side JavaScript file

My socket io connection is not working. There seems to be an issue with the connection io

Building a ReactJS application that displays an array of images with a distinct pop-up feature for each image

Leveraging AJAX to call upon a designated php file

Unlocking the Power of jQuery's toggle Method for Dynamic Functionality

Create a Discord.js bot that automatically deletes any URLs that are posted in the server

Running FeatureContext with bin/behat @FootballTeamBundle is acceptable, but bin/phing is not suitable for this task