What is the best way to extract text from HTML generated by JavaScript?

Question

What is the best way to extract text from HTML generated by JavaScript?

As a newcomer to scrapy, I am looking to scrape some datasets for a data mining project from "". My current scrapy crawler is able to extract data using XPath and CSS, but I have encountered an issue with extracting data from a tabbed table that uses JavaScript to populate its content. The XPath remains the same for each tab, making it difficult to extract data individually. Specifically, I need to fetch the stock gain percentage from each tab, which can be found in the 5th row of the last column in this image of the tabbed element.

I am comfortable scraping data using XPath and CSS methods, but extracting data that is generated by JavaScript poses a challenge. How can I achieve this? Additionally, if there is a way to extract data from each tab without using JSON (which I am not familiar with), please provide guidance as most solutions online involve JSON.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class NewsItem(scrapy.Item):
    name = scrapy.Field()

class StationDetailSpider(CrawlSpider):
    name = 'test2'
    start_urls = ["http://www.moneycontrol.com/india/stockpricequote/"]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='bl_12']"), follow=False, callback='parse_news'),
        Rule(LinkExtractor(allow=r"/diversified/.*$"), callback='parse_news')
)

    def parse_news(self, response):
        item = NewsItem()
        NEWS1_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'
        TIME1_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'
        NAME_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'

        print("------------------------------------starting extraction------------")
        item['name'] = response.css(NAME_SELECTOR).extract_first()
        item['time1'] = response.css(TIME1_SELECTOR).extract_first()
        item['news1'] = response.css(NEWS1_SELECTOR).extract()
        
        return item

javascript json xpath web-scraping scrapy

Answer 1

Answer №1

If you're looking to render javascript-based websites when using scrapy, be sure to check out splash at . It's a handy rendering service that can handle this task effortlessly.

Alternatively, you can create your own downloader middleware and integrate Selenium into your workflow. Check out this helpful resource on customizing Downloader Middleware for selenium and Scrapy: How to write customize Downloader Middleware for selenium and Scrapy?

I hope you find this information useful!

Answer 2

If you're looking to render javascript-based websites when using scrapy, be sure to check out splash at . It's a handy rendering service that can handle this task effortlessly.

Alternatively, you can create your own downloader middleware and integrate Selenium into your workflow. Check out this helpful resource on customizing Downloader Middleware for selenium and Scrapy: How to write customize Downloader Middleware for selenium and Scrapy?

I hope you find this information useful!

Answer 3

Answer №2

If you're interested, I found some helpful information on AJAX scraping here.

Scraping AJAX pages involves extracting content that loads dynamically without refreshing the entire page.

Simply follow the provided instructions to avoid any issues. For instance, when you switch timeframes (like week, month, year) on the mentioned page, a specific request is sent here:

The URL includes 3 query parameters, with the last pair representing the company ID and historical pricing range in days. Visit the link to see it in action.

Armed with this understanding, adjusting your spider to scrape such data should be straightforward.

Answer 4

If you're interested, I found some helpful information on AJAX scraping here.

Scraping AJAX pages involves extracting content that loads dynamically without refreshing the entire page.

Simply follow the provided instructions to avoid any issues. For instance, when you switch timeframes (like week, month, year) on the mentioned page, a specific request is sent here:

The URL includes 3 query parameters, with the last pair representing the company ID and historical pricing range in days. Visit the link to see it in action.

Armed with this understanding, adjusting your spider to scrape such data should be straightforward.

What is the best way to extract text from HTML generated by JavaScript?

Answer №1

Answer №2

Similar questions

Creating a fresh object from a previous one using JavaScript:

Working with JSON and managing inaccessible fields

Concentrate on all elements within the form

Scale transformation - I am aiming for it to exceed the limits, yet it remains contained within

JavaScript is unable to post content or access elements

The issue with jspdf is that it is failing to generate PDF documents of

css background is repeating after the height of the div is reset

Adjusting values in Vue.js with a slider bar

Using knockout to bind JSON data

What approach can be taken to establish a dependency between an AngularJS controller and a value that is retrieved through ajax and loaded onto the root

Could implementing a click/keydown listener on each cell in a large React datagrid with thousands of cells impact performance?

What is the most efficient method for managing components with dynamic templates and their corresponding data in Vue.js?

Navigating spaces, tabs, and line breaks during ReactJS rendering

JavaScript innerHTML not functioning properly when receiving a response from a servlet

Trouble with parsing JSON in PHP

Is there a way to turn off the highlights feature on MapHilight?

Problem with React Router: Uncaught Error - Invariant Violation: The element type is not valid, a string is expected for built-in components

Strategies for capturing and incorporating Meteor.Error notifications from Meteor.Methods into a client-side database?

Mapping the selection from jquery-ui's autocomplete to a Java class during a POST request

React state change is causing a functional component to not re-render