As a newcomer to scrapy, I am looking to scrape some datasets for a data mining project from "". My current scrapy crawler is able to extract data using XPath and CSS, but I have encountered an issue with extracting data from a tabbed table that uses JavaScript to populate its content. The XPath remains the same for each tab, making it difficult to extract data individually. Specifically, I need to fetch the stock gain percentage from each tab, which can be found in the 5th row of the last column in this image of the tabbed element.
I am comfortable scraping data using XPath and CSS methods, but extracting data that is generated by JavaScript poses a challenge. How can I achieve this? Additionally, if there is a way to extract data from each tab without using JSON (which I am not familiar with), please provide guidance as most solutions online involve JSON.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class NewsItem(scrapy.Item):
name = scrapy.Field()
class StationDetailSpider(CrawlSpider):
name = 'test2'
start_urls = ["http://www.moneycontrol.com/india/stockpricequote/"]
rules = (
Rule(LinkExtractor(restrict_xpaths="//a[@class='bl_12']"), follow=False, callback='parse_news'),
Rule(LinkExtractor(allow=r"/diversified/.*$"), callback='parse_news')
)
def parse_news(self, response):
item = NewsItem()
NEWS1_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'
TIME1_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'
NAME_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'
print("------------------------------------starting extraction------------")
item['name'] = response.css(NAME_SELECTOR).extract_first()
item['time1'] = response.css(TIME1_SELECTOR).extract_first()
item['news1'] = response.css(NEWS1_SELECTOR).extract()
return item