I am currently attempting to scrape data from a specific website:
While I have been successful in scraping the initial page, I am encountering difficulties navigating to subsequent pages. There are a couple of obstacles that I have come across:
Upon inspecting the
next_page
button, I am unable to retrieve a relative or absolute URL. Instead, I am presented withJavaScript:getPage(2)
, which does not allow me to follow links.The link for the next page button can be accessed by using
when on the first page. However, from the 2nd page onwards, the next page button becomes the 12th item, i.e.,(//table[@class='tbl_pagination']//a//@href)[11]
.(//table[@class='tbl_pagination']//a//@href)[12]
Consequently, my main query is how can I efficiently proceed to ALL subsequent pages and extract the necessary data.
Solving this issue may be straightforward, but as a novice in web scraping, any input would be highly valued. Please find below the code I have used.
Thank you for your assistance.
**
import scrapy
from scrapy_selenium import SeleniumRequest
class WinesSpider(scrapy.Spider):
name = 'wines'
def start_requests(self):
yield SeleniumRequest(
url='https://www.getwines.com/category_Wine',
wait_time=3,
callback=self.parse
)
def parse(self, response):
products = response.xpath("(//div[@class='layMain']//tbody)[5]/tr ")
for product in products:
yield {
'product_name':
product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
'product_link':
product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
'product_actual_price':
product.xpath(".//td//td[3]//td/span[2]/text()").get(),
'product_price_onsale':
product.xpath(".//td//td[3]//td/span[4]/text()").get()
}
#next_page = response.xpath("(//table[@class='tbl_pagination']//a//@href)[11]").get()
#if next_page:
# absolute_url = f"'https://www.getwines.com/category_Wine"**