How to extract dynamic content efficiently by combining Selenium and Scrapy for multiple initial URLs

I have been assigned the task of developing a web scraper for a property website where the data will be collected and stored for future analysis. The website is a national platform that requires users to select a region before displaying any results. In order to overcome this limitation, I have implemented a scraper using Scrapy with multiple start URLs that directly navigate to the desired regions. Since the site is dynamically generated using JavaScript, I am utilizing Selenium to render the pages and navigate through the pagination until all the data for each region is scraped. Everything runs smoothly when there's only one start URL, but as soon as multiple URLs are involved, I encounter an issue. The scraper starts working fine initially, but it switches to the next region (start URL) before completing the scraping process for the current region, resulting in incomplete data extraction. I've searched extensively for a solution to this problem but haven't found anyone facing the same issue. Any advice or suggestions would be greatly appreciated. Find an example of the code below:

from scrapy.spider                  import CrawlSpider
from scrapy.http                    import TextResponse
from scrapy.selector                import HtmlXPathSelector
from selenium                       import webdriver
from selenium.webdriver.common.by   import By
from selenium.webdriver.support.ui  import WebDriverWait
from selenium.webdriver.support     import expected_conditions as EC
from selenium.common.exceptions     import TimeoutException
import time
from selenium                       import webdriver
from selenium                       import selenium
from selenium_spider.items          import DemoSpiderItem
from selenium.webdriver.support.ui  import WebDriverWait
from selenium.webdriver.support     import expected_conditions as EC
from selenium.common.exceptions     import TimeoutException
import sys

class DemoSpider(CrawlSpider):
    name="Demo"
    allowed_domains = ['example.com']
    start_urls= ["http://www.example.co.uk/locationIdentifier=REGION    1234",
    "http://www.example.co.uk/property-for-sale/locationIdentifier=REGION    5678"]

    def __init__(self):
        self.driver = webdriver.Firefox()

    def __del__(self):
        self.selenium.stop()

    def parse (self, response):
        self.driver.get(response.url)


        result = response.xpath('//*[@class="l-searchResults"]')
        source = 'aTest'
        while True:
            try:
                element = WebDriverWait(self.driver, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR,".pagination-button.pagination-direction.pagination-direction--next"))
            )
                print "Scraping new site --------------->", result
                print "This is the result----------->", result
                for properties in result:
                    saleOrRent = properties.xpath('//*[@class = "property-title"]/text()').extract()
                    addresses = properties.xpath('//*[@class="property-address"]/text()').extract()
                    if saleOrRent:
                        saleOrRent = saleOrRent[0]
                        if 'for sale' in saleOrRent:
                            saleOrRent = 'For Sale'
                        elif 'to rent' in saleOrRent:
                            saleOrRent = 'To Rent'
                for a in addresses:
                    item = DemoSpiderItem()
                    address = a
                    item ["saleOrRent"] = saleOrRent
                    item ["source"] = source
                    item ["address"] = address
                    item ["response"] = response
                    yield item
                element.click()
            except TimeoutException:
                    break

Answer №1

After experimenting a bit, I discovered that the process is simpler than I initially thought. All you need to do is provide an initial URL in start_urls, create a separate list of manual subsequent URLs, and then use a counter to access the correct URL from the list when making a request with the parse function as a callback.

This method allows you to control when the next URL is loaded, especially if you stop receiving results. The only downside is that it operates sequentially, but sometimes that's just how it goes!

Check out the code snippet below for reference:

import scrapy from scrapy.http.request 
import Request from selenium 
import webdriver from scrapy.selector 
import Selector from products_scraper.items import ProductItem

class ProductsSpider(scrapy.Spider):
    name = "products_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/first']

    global manual_urls
    manual_urls = [
    'http://www.example.com/second',
    'http://www.example.com/third'
    ]

    global manual_url_index 
    manual_url_index = 0

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):

        self.driver.get(response.url)

        hasPostings = True

        while hasPostings:
            next = self.driver.find_element_by_xpath('//dd[@class="next-page"]/a')

            try:
                next.click()
                self.driver.set_script_timeout(30)
                products = self.driver.find_elements_by_css_selector('.products-list article')

                if(len(products) == 0): 
                    if(manual_url_index < len(manual_urls)):
                        yield Request(manual_urls[manual_url_index],
                            callback=self.parse)
                        global manual_url_index
                        manual_url_index += 1

                    hasPostings = False

                for product in products:
                    item = ProductItem()
                    # store product info here
                    yield item 

            except Exception, e:
                print str(e)
                break



        def spider_closed(self, spider):
            self.driver.quit()

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

When the email field is changed, the string is not being set to the state if it is

I encountered a strange issue while setting an email input as a string to state. Even though I can see on React Dev Tools that it gets sent, when I try to log it from another function, I get an empty string. The odd part is, if I change the order of the in ...

Unexpected behavior: Controller action method retrieves undefined value upon jQuery Ajax request

I'm currently working on my ASP.NET Core 3.1 project and implementing cascading dropdown functionality with jQuery. I've set it up so that changing the value of the first dropdown (Region) should automatically update the second dropdown, Location ...

The process of running npm build is not resulting in the creation of the bundle.js file

I've read through many Q&A threads where people are facing the same issue, but I still can't figure out what's wrong with my code. When I run 'sudo npm run build', no bundle.js file is being created.** This is my code: index ...

Ubuntu does not automatically delete Xvfb and Selenium temporary files from the /tmp directory

Our remote server utilizes Buildbot to run Python Selenium tests. However, we have encountered an issue where the temporary files generated by XVFB and Selenium start to consume all available space, ultimately causing the Buildbot activity to freeze. To ad ...

Step-by-step guide on generating a downloadable file in Vue

As a beginner in Vue, I am tasked with downloading a file but unsure of how to proceed. My attempt at the code resulted in the image opening on a new page instead. <a class = "btn btn-success btn-xs" href = "https://78.media.tumblr.com/tumb ...

including a collection of values into a JSON data structure

Currently, I am iterating through some JSON data (grouped tweets from Twitter) to tally the frequency of specific keywords (hashtags) in order to generate an organized list of common terms. this (19) that (9) hat (3) I have achieved this by initial ...

How to determine button placement based on the content present on the page

I'm struggling to find the right CSS positioning for a button on my page. I want the button to stay fixed in a specific location, but when there's a lot of content on the page, I need it to adjust its position accordingly. Initially, I want the ...

Automatically activate the next tab in Bootstrap

I have a modal that performs a count. Once the count is completed, the modal closes. My goal is to automatically switch to the next tab in Bootstrap when the modal closes. Here is the HTML: <ul class="nav nav-tabs namelocationtable"> <div clas ...

Accepting multiple file inputs in a form without using a selector, but instead utilizing the 'this' keyword or finding an alternative approach

When dealing with single file uploads, you can access the file input using this.image <form id="form"> <input type="file" name="image"> <input type="submit" name="submit"> </form> $ ...

Upon a successful AJAX post request, the page fails to display

I'm encountering an issue connecting my front-end JavaScript to my back-end Node/Express. Although the requests from my client-side js to the server return successful, the page I am requesting fails to render. On the server side, here is my code: ap ...

Tips on deactivating a button after it has been clicked once within a 24-hour period and reactivating it the following day with the use of JavaScript and Angular

Is it possible to disable my button after one click per day, and then automatically re-enable it the next day once a user has entered details using the submit button? I need assistance with JavaScript or AngularJS for this functionality. ...

Error Encountered - Configuring Node.js Deployment on the Heroku Platform

"APPLICATION ERROR - Oops! Looks like something went wrong with the application and we couldn't load your page. Please give it another shot in a little while. If you are the app owner, make sure to review your logs for more information." Hey there - ...

Navigating through the web using PhantomJS with Selenium

I've encountered a problem with my code for scrolling a webpage using pagination. It works perfectly fine with the Firefox driver, but when I switch to PhantomJS, it gets stuck in an infinite loop public class Drivers { public WebDriver phJS() { ...

Preventing Button Click with JQuery Tooltip

I've implemented a JQuery tooltip plugin on my website and it's working great. However, I'm facing an issue where I cannot click on the input button that appears when hovering over the tooltip. It seems like the button is not truly part of t ...

The rotation function of a THREE.js object seems to be malfunctioning

Currently, I am facing an issue with a Blender object that I have successfully displayed on my web page using THREE.js. However, for some reason the object is not rotating when my loop function is called. In my approach to working with JavaScript, I am tr ...

Avoiding a constantly repeating video to prevent the browser from running out of memory

Using HTML5, I created a video that loops and draws the frames to canvas. I decided to create multiple canvases and draw different parts of the video on each one. However, after some time, I encountered an issue where Google Chrome would run out of memory. ...

Error: Authentication error. fatal: Unable to access the remote repository." encountered while executing the command "yarn install

Today, while developing a web application, I encountered an issue with the "yarn install" command. Upon running "yarn install", the console displayed an error message: "Host key verification failed. fatal: Could not read from remote repository." I attemp ...

Using jQuery to automatically select a specific radio button after the load() function is executed

I'm trying to dynamically load radio buttons into a div using the JQuery load() function. However, I'm facing an issue when it comes to checking a specific radio button by its value. The problem is that the code doesn't seem to be working w ...

Leveraging the power of JavaScript functions together with the asp:Timer component

<p><b> Progress: <asp:Label ID="progressPercentageLabel" runat="server"></asp:Label>%</b></p> <script> function updateBar() { var bar = document.getElementById("CompletionBar"); ...

Using Vue.js code on an HTML file is only possible when the necessary CDN is included

Just diving into Vue.js and I've got a header html that doesn't include the cdn link for Vue.js. <nav class="navbar navbar-toggleable-md navbar-inverse"> <div class="collapse navbar-collapse" id="navbarSupportedContent"> ...