Navigating through paginated content with scrapy-selenium and handling POST requests

Question

Navigating through paginated content with scrapy-selenium and handling POST requests

I am currently attempting to scrape data from a specific website:

While I have been successful in scraping the initial page, I am encountering difficulties navigating to subsequent pages. There are a couple of obstacles that I have come across:

Upon inspecting the next_page button, I am unable to retrieve a relative or absolute URL. Instead, I am presented with JavaScript:getPage(2), which does not allow me to follow links.
The link for the next page button can be accessed by using
```
(//table[@class='tbl_pagination']//a//@href)[11]
```
when on the first page. However, from the 2nd page onwards, the next page button becomes the 12th item, i.e.,
```
(//table[@class='tbl_pagination']//a//@href)[12]
```
.

Consequently, my main query is how can I efficiently proceed to ALL subsequent pages and extract the necessary data.

Solving this issue may be straightforward, but as a novice in web scraping, any input would be highly valued. Please find below the code I have used.

Thank you for your assistance.

**
import scrapy
from scrapy_selenium import SeleniumRequest
class WinesSpider(scrapy.Spider):
    name = 'wines'
  
    def start_requests(self):
        yield SeleniumRequest(
        url='https://www.getwines.com/category_Wine',
        wait_time=3,
        callback=self.parse
        )
    def parse(self, response):
        products = response.xpath("(//div[@class='layMain']//tbody)[5]/tr ")
        for product in products:
            yield {
                'product_name': 
                product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
                'product_link': 
                product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
                'product_actual_price': 
                product.xpath(".//td//td[3]//td/span[2]/text()").get(),
                'product_price_onsale': 
                product.xpath(".//td//td[3]//td/span[4]/text()").get()
            }
    #next_page = response.xpath("(//table[@class='tbl_pagination']//a//@href)[11]").get()
    #if next_page:
    #    absolute_url = f"'https://www.getwines.com/category_Wine"**

javascript selenium post pagination scrapy

Answer 1

Answer №1

Below is the revised code to address the question:

To summarize, I made structural changes to the code which resolved the issues it was facing. Here are some key points:

Initially, all page content was stored in a list.
The addition of "except NoSuchElementException" at the end of the while-try loop was crucial. This prevented the code from failing when reaching the last page.
Accessing the content of the saved links (responses) was essential.

In conclusion, organizing your Scrapy code in this manner is effective for integrating Selenium with Scrapy. However, being new to Web Scraping, any further advice on efficiently combining Selenium with Scrapy would be welcomed.

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from selenium.common.exceptions import NoSuchElementException

class WinesSpider(scrapy.Spider):
    name = 'wines'

    responses = []

    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.getwines.com/category_Wine',
            callback=self.parse
        )

    def parse(self, response):
        driver = response.meta['driver']
        initial_page = driver.page_source
        self.responses.append(initial_page)
        found = True
        while found:
            try:
                next_page = driver.find_element_by_xpath("//b[text()= '>>']/parent::a")
                href = next_page.get_attribute('href')
                driver.execute_script(href)
                driver.implicitly_wait(2)
                self.responses.append(driver.page_source)
        
            except NoSuchElementException:
                break

        for resp in self.responses:
            r = Selector(text=resp)
            products = r.xpath("(//div[@class='layMain']//tbody)[5]/tr")
            for product in products:
                yield {
                    'product_name':
                    product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
                    'product_link':
                    product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
                    'product_actual_price':
                    product.xpath(".//span[@class='RegularPrice']/text()").get(),
                    'product_price_onsale':
                    product.xpath(".//td//td[3]//td/span[4]/text()").get()
                }

Answer 2

Below is the revised code to address the question:

To summarize, I made structural changes to the code which resolved the issues it was facing. Here are some key points:

Initially, all page content was stored in a list.
The addition of "except NoSuchElementException" at the end of the while-try loop was crucial. This prevented the code from failing when reaching the last page.
Accessing the content of the saved links (responses) was essential.

In conclusion, organizing your Scrapy code in this manner is effective for integrating Selenium with Scrapy. However, being new to Web Scraping, any further advice on efficiently combining Selenium with Scrapy would be welcomed.

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from selenium.common.exceptions import NoSuchElementException

class WinesSpider(scrapy.Spider):
    name = 'wines'

    responses = []

    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.getwines.com/category_Wine',
            callback=self.parse
        )

    def parse(self, response):
        driver = response.meta['driver']
        initial_page = driver.page_source
        self.responses.append(initial_page)
        found = True
        while found:
            try:
                next_page = driver.find_element_by_xpath("//b[text()= '>>']/parent::a")
                href = next_page.get_attribute('href')
                driver.execute_script(href)
                driver.implicitly_wait(2)
                self.responses.append(driver.page_source)
        
            except NoSuchElementException:
                break

        for resp in self.responses:
            r = Selector(text=resp)
            products = r.xpath("(//div[@class='layMain']//tbody)[5]/tr")
            for product in products:
                yield {
                    'product_name':
                    product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
                    'product_link':
                    product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
                    'product_actual_price':
                    product.xpath(".//span[@class='RegularPrice']/text()").get(),
                    'product_price_onsale':
                    product.xpath(".//td//td[3]//td/span[4]/text()").get()
                }

Navigating through paginated content with scrapy-selenium and handling POST requests

Answer №1

Similar questions

Issue with Window.close() not working to close the current tab

When clicking, the images are displayed within a faded-in div

The function that can be used in Ajax for setting

Utilizing Javascript's Mapping Functionality on Arrays

Building a JavaScript module worker: Step-by-step guide

Is there a way to position the nav menu outside of the navbar without it overlapping?

What is the process for invoking a websocket from an HTML client?

P5.js mousePressed() not responding correctly :(

decipher the string using various operators

What is the best way to convert an Object with arrays of objects into a string?

Angular - No redirection occurs with a 303 response

The Passport.js local strategy with a unique twist on identification parameters

The retrieval of JSON data is successful in Internet Explorer, but it encounters issues in Firefox and

Determine in Jquery if all the elements in array 2 are being utilized by array 1

Passing data between child components using Vuejs 3.2 for seamless communication within the application

Similar to AngularJS, jQuery also provides a powerful tool for submitting forms

Are there alternative methods for utilizing ionicons without needing the <script> tag?

Triggering blur event manually in Ionic 3

Using Selenium to sequentially launch web browsers from an array of IWebDriver instances

Guide to implementing bidirectional data binding for a particular element within a dynamic array with an automatically determined index