What is the best way to extract text from HTML generated by JavaScript?

As a newcomer to scrapy, I am looking to scrape some datasets for a data mining project from "". My current scrapy crawler is able to extract data using XPath and CSS, but I have encountered an issue with extracting data from a tabbed table that uses JavaScript to populate its content. The XPath remains the same for each tab, making it difficult to extract data individually. Specifically, I need to fetch the stock gain percentage from each tab, which can be found in the 5th row of the last column in this image of the tabbed element.

I am comfortable scraping data using XPath and CSS methods, but extracting data that is generated by JavaScript poses a challenge. How can I achieve this? Additionally, if there is a way to extract data from each tab without using JSON (which I am not familiar with), please provide guidance as most solutions online involve JSON.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class NewsItem(scrapy.Item):
    name = scrapy.Field()

class StationDetailSpider(CrawlSpider):
    name = 'test2'
    start_urls = ["http://www.moneycontrol.com/india/stockpricequote/"]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='bl_12']"), follow=False, callback='parse_news'),
        Rule(LinkExtractor(allow=r"/diversified/.*$"), callback='parse_news')
)

    def parse_news(self, response):
        item = NewsItem()
        NEWS1_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'
        TIME1_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'
        NAME_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'

        print("------------------------------------starting extraction------------")
        item['name'] = response.css(NAME_SELECTOR).extract_first()
        item['time1'] = response.css(TIME1_SELECTOR).extract_first()
        item['news1'] = response.css(NEWS1_SELECTOR).extract()
        
        return item

Answer №1

If you're looking to render javascript-based websites when using scrapy, be sure to check out splash at . It's a handy rendering service that can handle this task effortlessly.

Alternatively, you can create your own downloader middleware and integrate Selenium into your workflow. Check out this helpful resource on customizing Downloader Middleware for selenium and Scrapy: How to write customize Downloader Middleware for selenium and Scrapy?

I hope you find this information useful!

Answer №2

If you're interested, I found some helpful information on AJAX scraping here.

Scraping AJAX pages involves extracting content that loads dynamically without refreshing the entire page.

Simply follow the provided instructions to avoid any issues. For instance, when you switch timeframes (like week, month, year) on the mentioned page, a specific request is sent here:

The URL includes 3 query parameters, with the last pair representing the company ID and historical pricing range in days. Visit the link to see it in action.

Armed with this understanding, adjusting your spider to scrape such data should be straightforward.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Creating a fresh object from a previous one using JavaScript:

I am working towards a goal where I aim to take an object with string values, translate those values, and then create a new object filled with the translated strings. For example, if I start with: const strings = { "name": "my name", "age": "my ag ...

Working with JSON and managing inaccessible fields

Is there a specific technical limitation that prevents encoding/json from including unexported fields? If not, could there potentially be an alternative option introduced to allow their inclusion, such as a special character like '+'? The requir ...

Concentrate on all elements within the form

I am in the process of developing a form with multiple input fields, one of which is shown below. I am interested in triggering an event using jQuery's focusout function. The form goes by the name: form_test <form id="form_test"> <input ...

Scale transformation - I am aiming for it to exceed the limits, yet it remains contained within

Currently, I am working on enhancing my carousel by implementing a zoom effect when hovering over the images. However, I have encountered an issue where the image gets hidden within the div container and doesn't overflow as expected. I tried adjusting ...

JavaScript is unable to post content or access elements

Check out the following code: <div class="col-2"> <div class="input-group"> <label class="label">Name</label> <i ...

The issue with jspdf is that it is failing to generate PDF documents of

I'm currently developing a resume builder app using ReactJS. One of the functionalities I'm working on is enabling users to download their resumes as PDFs. However, I've encountered an issue with the generated PDFs when using jsPDF. The down ...

css background is repeating after the height of the div is reset

I'm working on a project where I want to resize an image while maintaining its aspect ratio to fit the height/width of the browser window. However, every time the code for resizing is implemented, the div height continues to increase with each resize ...

Adjusting values in Vue.js with a slider bar

Just starting out with vue.js and wanted to create a slide-bar with min and max values. I came across vue range slider, installed it, and successfully implemented it. The issue I'm facing is changing the value on the slider. The values from my API r ...

Using knockout to bind JSON data

I have a table that is supposed to be bound with the result of a JSON object: <table> <thead> <tr> <th> Id </th> <th> Number </th> <th> ...

What approach can be taken to establish a dependency between an AngularJS controller and a value that is retrieved through ajax and loaded onto the root

I have an app that loads like this: app.js file: angular.module('App', []).run(['$rootScope', '$q', 'SessionManager', 'EndpointService', function ($rootScope, $q, SessionManager, EndpointService) { $r ...

Could implementing a click/keydown listener on each cell in a large React datagrid with thousands of cells impact performance?

Years ago, before the advent of React, I mastered linking events to tables by attaching the listener to the <tbody> and extracting the true source of the event from the event target. This method allowed for a single listener for the entire table, as ...

What is the most efficient method for managing components with dynamic templates and their corresponding data in Vue.js?

I have a question and requirement that I would like to discuss. It involves dynamically rendering templates and data using components. The scenario is as follows: The root Vue instance fetches data from the backend, and let's say the following data i ...

Navigating spaces, tabs, and line breaks during ReactJS rendering

I have been attempting to display a string in a ReactJS Dialog box, which contains spaces and newlines represented by /n /t characters. My goal is to show the text exactly as it is with all the spaces and line breaks preserved. Despite trying various metho ...

JavaScript innerHTML not functioning properly when receiving a response from a servlet

Can someone help me troubleshoot the code below? I'm receiving a response from the servlet, but I can't seem to display it inside the div. Here is the response: lukas requests to be your friend &nbsp <button value="lukas"onclick="accfr(th ...

Trouble with parsing JSON in PHP

I've been encountering issues with a PHP script I recently created. The problem lies in my inability to retrieve data from a JSON file named demo.json using PHP. Below is the content of the JSON file: { "checkouts":[ { "billing_address":{ ...

Is there a way to turn off the highlights feature on MapHilight?

I am currently facing a challenge that has me stumped, and I am hoping you can provide some guidance. I'm working on a page located at: Here's the issue: I am focusing solely on the states of Washington and Idaho at the moment, and I want users ...

Problem with React Router: Uncaught Error - Invariant Violation: The element type is not valid, a string is expected for built-in components

I am encountering an issue with react-router and unable to render my app due to this error. Here is a screenshot of the error I have searched extensively for a solution but have not been able to find anything useful. Any help would be greatly appreciated ...

Strategies for capturing and incorporating Meteor.Error notifications from Meteor.Methods into a client-side database?

I am currently working on creating an error notification panel in Meteor. I have set up a client-side MongoDB, but I am encountering an issue with pushing Meteor.Error messages into that client-side database using the throwError function. Currently, the er ...

Mapping the selection from jquery-ui's autocomplete to a Java class during a POST request

Let's consider a scenario where we have a customized Java class named Club.java: public class Club { private Integer id; private String name; /* getters, setters */ } Next, let's take a look at the jquery-ui autocomplete code: v ...

React state change is causing a functional component to not re-render

When attempting to map out a nested array from the data retrieved by an http request in a functional component, you may encounter a frustrating error: "TypeError: Cannot read property 'map' of undefined". Even though the state is updated correctl ...