Navigating through paginated content with scrapy-selenium and handling POST requests

I am currently attempting to scrape data from a specific website:

While I have been successful in scraping the initial page, I am encountering difficulties navigating to subsequent pages. There are a couple of obstacles that I have come across:

  1. Upon inspecting the next_page button, I am unable to retrieve a relative or absolute URL. Instead, I am presented with JavaScript:getPage(2), which does not allow me to follow links.

  2. The link for the next page button can be accessed by using

    (//table[@class='tbl_pagination']//a//@href)[11]
    when on the first page. However, from the 2nd page onwards, the next page button becomes the 12th item, i.e.,
    (//table[@class='tbl_pagination']//a//@href)[12]
    .

Consequently, my main query is how can I efficiently proceed to ALL subsequent pages and extract the necessary data.

Solving this issue may be straightforward, but as a novice in web scraping, any input would be highly valued. Please find below the code I have used.

Thank you for your assistance.

**
import scrapy
from scrapy_selenium import SeleniumRequest
class WinesSpider(scrapy.Spider):
    name = 'wines'
  
    def start_requests(self):
        yield SeleniumRequest(
        url='https://www.getwines.com/category_Wine',
        wait_time=3,
        callback=self.parse
        )
    def parse(self, response):
        products = response.xpath("(//div[@class='layMain']//tbody)[5]/tr ")
        for product in products:
            yield {
                'product_name': 
                product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
                'product_link': 
                product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
                'product_actual_price': 
                product.xpath(".//td//td[3]//td/span[2]/text()").get(),
                'product_price_onsale': 
                product.xpath(".//td//td[3]//td/span[4]/text()").get()
            }
    #next_page = response.xpath("(//table[@class='tbl_pagination']//a//@href)[11]").get()
    #if next_page:
    #    absolute_url = f"'https://www.getwines.com/category_Wine"**

Answer №1

Below is the revised code to address the question:

To summarize, I made structural changes to the code which resolved the issues it was facing. Here are some key points:

  1. Initially, all page content was stored in a list.
  2. The addition of "except NoSuchElementException" at the end of the while-try loop was crucial. This prevented the code from failing when reaching the last page.
  3. Accessing the content of the saved links (responses) was essential.

In conclusion, organizing your Scrapy code in this manner is effective for integrating Selenium with Scrapy. However, being new to Web Scraping, any further advice on efficiently combining Selenium with Scrapy would be welcomed.

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from selenium.common.exceptions import NoSuchElementException

class WinesSpider(scrapy.Spider):
    name = 'wines'

    responses = []

    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.getwines.com/category_Wine',
            callback=self.parse
        )

    def parse(self, response):
        driver = response.meta['driver']
        initial_page = driver.page_source
        self.responses.append(initial_page)
        found = True
        while found:
            try:
                next_page = driver.find_element_by_xpath("//b[text()= '>>']/parent::a")
                href = next_page.get_attribute('href')
                driver.execute_script(href)
                driver.implicitly_wait(2)
                self.responses.append(driver.page_source)
        
            except NoSuchElementException:
                break

        for resp in self.responses:
            r = Selector(text=resp)
            products = r.xpath("(//div[@class='layMain']//tbody)[5]/tr")
            for product in products:
                yield {
                    'product_name':
                    product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
                    'product_link':
                    product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
                    'product_actual_price':
                    product.xpath(".//span[@class='RegularPrice']/text()").get(),
                    'product_price_onsale':
                    product.xpath(".//td//td[3]//td/span[4]/text()").get()
                }

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Issue with Window.close() not working to close the current tab

When navigating from my webpage to a new page in a different tab, I would like for my original webpage to close once the new tab has loaded. For example, if I am on pageA and I open pageB using window.open(), I want pageA to close when pageB is opened. I a ...

When clicking, the images are displayed within a faded-in div

I'm trying to create a layout where all the images within the #ImageContainer are displayed in a fixed Div. Everything seems to be working well except for the fact that when I click on an image, it disappears from the #ImageContainer and shows up in t ...

The function that can be used in Ajax for setting

I'm currently utilizing Ajax to automatically submit a form when a key is pressed. After the form is submitted, the page displays the resulting information success: function (response){ $("#search_results<?php echo $HoursID ?>").html(response) ...

Utilizing Javascript's Mapping Functionality on Arrays

Here is an array that I need help with: var gdpData = {"CA": 1,"US": 2,"BF": 3,"DE": 4}; I am trying to retrieve the value associated with BF using a loop Can anyone provide guidance on how to accomplish this using either JQuery or Javascript? ...

Building a JavaScript module worker: Step-by-step guide

I am currently facing challenges with implementing web workers in my program. The example provided here is a simplified version of the code structure. There are 4 key files involved: index.html <!DOCTYPE html> <html> <head> <me ...

Is there a way to position the nav menu outside of the navbar without it overlapping?

After spending 4 hours trying to solve this problem, I've hit a roadblock. I'm attempting to create a responsive menu where the dropdown appears below the navbar when the hamburger menu is clicked, instead of overlapping it like in the current sn ...

What is the process for invoking a websocket from an HTML client?

I have created a WCF Service using netHttpBinding binding and it is hosted on IIS 8 (Windows Server 2012). The interfaces for the service are as follows: [ServiceContract(CallbackContract = typeof(IDuplexCallbackContract))] public interface IHelloWebSocke ...

P5.js mousePressed() not responding correctly :(

My p5 installation is all set up and running smoothly, but I've encountered an issue with the mousePressed() function (keyPressed() isn't working either). Here's the problematic part of the code: //press play button if (mouseX > 120 && m ...

decipher the string using various operators

Is it possible to explode a string using different operators? I am trying to extract every code (of varying sizes) between the brackets [ and ] Here are some examples of the different possibilities: const codes = [ '[5018902847][592][50189272809] ...

What is the best way to convert an Object with arrays of objects into a string?

https://i.sstatic.net/kTejf.png Is it possible to modify this object into a URL query parameter? For instance, the query parameter should look like: advocates=7195&categories=25&checkbox-active=true&checkbox-close=undefined&checkbox-filed ...

Angular - No redirection occurs with a 303 response

Having an issue with redirection after receiving a 303 response from a backend API endpoint, which includes a Location URL to any subpage on my site. Upon attempting the redirect, an error is triggered: Error: SyntaxError: Unexpected token '<&ap ...

The Passport.js local strategy with a unique twist on identification parameters

I am currently in the process of implementing a local authentication strategy for my web application, but I have encountered some difficulties with my approach. While reviewing Passport's documentation and various examples, I noticed that there is no ...

The retrieval of JSON data is successful in Internet Explorer, but it encounters issues in Firefox and

My MVC app is experiencing JSON request failures in Firefox but works fine on IE and Chrome. I initially suspected the same-origin policy, but all requests are using the same host and protocol (localhost). Upon inspecting the network functions of each brow ...

Determine in Jquery if all the elements in array 2 are being utilized by array 1

Can anyone help me figure out why my array1 has a different length than array2? I've been searching for hours trying to find the mistake in my code. If it's not related to that, could someone kindly point out where I went wrong? function contr ...

Passing data between child components using Vuejs 3.2 for seamless communication within the application

In my chess application, I have a total of 3 components: 1 parent component and 2 child components. The first child component, called Board, is responsible for updating the move and FEN (chess notation). const emit = defineEmits(['fen', 'm ...

Similar to AngularJS, jQuery also provides a powerful tool for submitting forms

Recently, I've delved into the world of angularjs and have been truly amazed by its capabilities so far. One thing that took me by surprise was the lack of a simple solution for sending AJAX requests using the $http service. After hours of searching G ...

Are there alternative methods for utilizing ionicons without needing the <script> tag?

In our workplace, the designer opted to incorporate ionicons but the documentation only provides instructions on how to use them without Ionic: Insert the following <script> at the end of your page, right before the closing </body> tag, to ac ...

Triggering blur event manually in Ionic 3

Is there a way to manually trigger the blur event on an ion-input element? The ideal scenario would be with an ionic-native method, but any javascript-based workaround will suffice. My current configuration: Ionic: ionic (Ionic CLI) : 4.0.1 (/User ...

Using Selenium to sequentially launch web browsers from an array of IWebDriver instances

Our company is currently conducting a Proof of Concept using Selenium and MS Test. For this POC, we are specifically focusing on supporting three browsers: Chrome, Firefox, and Edge. To achieve this, I have declared an array of IWebDriver and added the re ...

Guide to implementing bidirectional data binding for a particular element within a dynamic array with an automatically determined index

Imagine having a JavaScript dynamic array retrieved from a database: customers = [{'id':1, 'name':'John'},{'id':2, 'name':'Tim}, ...] Accompanied by input fields: <input type='text' na ...