How to extract dynamic content efficiently by combining Selenium and Scrapy for multiple initial URLs

Question

How to extract dynamic content efficiently by combining Selenium and Scrapy for multiple initial URLs

I have been assigned the task of developing a web scraper for a property website where the data will be collected and stored for future analysis. The website is a national platform that requires users to select a region before displaying any results. In order to overcome this limitation, I have implemented a scraper using Scrapy with multiple start URLs that directly navigate to the desired regions. Since the site is dynamically generated using JavaScript, I am utilizing Selenium to render the pages and navigate through the pagination until all the data for each region is scraped. Everything runs smoothly when there's only one start URL, but as soon as multiple URLs are involved, I encounter an issue. The scraper starts working fine initially, but it switches to the next region (start URL) before completing the scraping process for the current region, resulting in incomplete data extraction. I've searched extensively for a solution to this problem but haven't found anyone facing the same issue. Any advice or suggestions would be greatly appreciated. Find an example of the code below:

from scrapy.spider                  import CrawlSpider
from scrapy.http                    import TextResponse
from scrapy.selector                import HtmlXPathSelector
from selenium                       import webdriver
from selenium.webdriver.common.by   import By
from selenium.webdriver.support.ui  import WebDriverWait
from selenium.webdriver.support     import expected_conditions as EC
from selenium.common.exceptions     import TimeoutException
import time
from selenium                       import webdriver
from selenium                       import selenium
from selenium_spider.items          import DemoSpiderItem
from selenium.webdriver.support.ui  import WebDriverWait
from selenium.webdriver.support     import expected_conditions as EC
from selenium.common.exceptions     import TimeoutException
import sys

class DemoSpider(CrawlSpider):
    name="Demo"
    allowed_domains = ['example.com']
    start_urls= ["http://www.example.co.uk/locationIdentifier=REGION    1234",
    "http://www.example.co.uk/property-for-sale/locationIdentifier=REGION    5678"]

    def __init__(self):
        self.driver = webdriver.Firefox()

    def __del__(self):
        self.selenium.stop()

    def parse (self, response):
        self.driver.get(response.url)


        result = response.xpath('//*[@class="l-searchResults"]')
        source = 'aTest'
        while True:
            try:
                element = WebDriverWait(self.driver, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR,".pagination-button.pagination-direction.pagination-direction--next"))
            )
                print "Scraping new site --------------->", result
                print "This is the result----------->", result
                for properties in result:
                    saleOrRent = properties.xpath('//*[@class = "property-title"]/text()').extract()
                    addresses = properties.xpath('//*[@class="property-address"]/text()').extract()
                    if saleOrRent:
                        saleOrRent = saleOrRent[0]
                        if 'for sale' in saleOrRent:
                            saleOrRent = 'For Sale'
                        elif 'to rent' in saleOrRent:
                            saleOrRent = 'To Rent'
                for a in addresses:
                    item = DemoSpiderItem()
                    address = a
                    item ["saleOrRent"] = saleOrRent
                    item ["source"] = source
                    item ["address"] = address
                    item ["response"] = response
                    yield item
                element.click()
            except TimeoutException:
                    break

javascript selenium webdriver scrapy

Answer 1

Answer №1

After experimenting a bit, I discovered that the process is simpler than I initially thought. All you need to do is provide an initial URL in start_urls, create a separate list of manual subsequent URLs, and then use a counter to access the correct URL from the list when making a request with the parse function as a callback.

This method allows you to control when the next URL is loaded, especially if you stop receiving results. The only downside is that it operates sequentially, but sometimes that's just how it goes!

Check out the code snippet below for reference:

import scrapy from scrapy.http.request 
import Request from selenium 
import webdriver from scrapy.selector 
import Selector from products_scraper.items import ProductItem

class ProductsSpider(scrapy.Spider):
    name = "products_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/first']

    global manual_urls
    manual_urls = [
    'http://www.example.com/second',
    'http://www.example.com/third'
    ]

    global manual_url_index 
    manual_url_index = 0

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):

        self.driver.get(response.url)

        hasPostings = True

        while hasPostings:
            next = self.driver.find_element_by_xpath('//dd[@class="next-page"]/a')

            try:
                next.click()
                self.driver.set_script_timeout(30)
                products = self.driver.find_elements_by_css_selector('.products-list article')

                if(len(products) == 0): 
                    if(manual_url_index < len(manual_urls)):
                        yield Request(manual_urls[manual_url_index],
                            callback=self.parse)
                        global manual_url_index
                        manual_url_index += 1

                    hasPostings = False

                for product in products:
                    item = ProductItem()
                    # store product info here
                    yield item 

            except Exception, e:
                print str(e)
                break



        def spider_closed(self, spider):
            self.driver.quit()

Answer 2

After experimenting a bit, I discovered that the process is simpler than I initially thought. All you need to do is provide an initial URL in start_urls, create a separate list of manual subsequent URLs, and then use a counter to access the correct URL from the list when making a request with the parse function as a callback.

This method allows you to control when the next URL is loaded, especially if you stop receiving results. The only downside is that it operates sequentially, but sometimes that's just how it goes!

Check out the code snippet below for reference:

import scrapy from scrapy.http.request 
import Request from selenium 
import webdriver from scrapy.selector 
import Selector from products_scraper.items import ProductItem

class ProductsSpider(scrapy.Spider):
    name = "products_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/first']

    global manual_urls
    manual_urls = [
    'http://www.example.com/second',
    'http://www.example.com/third'
    ]

    global manual_url_index 
    manual_url_index = 0

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):

        self.driver.get(response.url)

        hasPostings = True

        while hasPostings:
            next = self.driver.find_element_by_xpath('//dd[@class="next-page"]/a')

            try:
                next.click()
                self.driver.set_script_timeout(30)
                products = self.driver.find_elements_by_css_selector('.products-list article')

                if(len(products) == 0): 
                    if(manual_url_index < len(manual_urls)):
                        yield Request(manual_urls[manual_url_index],
                            callback=self.parse)
                        global manual_url_index
                        manual_url_index += 1

                    hasPostings = False

                for product in products:
                    item = ProductItem()
                    # store product info here
                    yield item 

            except Exception, e:
                print str(e)
                break



        def spider_closed(self, spider):
            self.driver.quit()

How to extract dynamic content efficiently by combining Selenium and Scrapy for multiple initial URLs

Answer №1

Similar questions

When the email field is changed, the string is not being set to the state if it is

Unexpected behavior: Controller action method retrieves undefined value upon jQuery Ajax request

The process of running npm build is not resulting in the creation of the bundle.js file

Ubuntu does not automatically delete Xvfb and Selenium temporary files from the /tmp directory

Step-by-step guide on generating a downloadable file in Vue

including a collection of values into a JSON data structure

How to determine button placement based on the content present on the page

Automatically activate the next tab in Bootstrap

Accepting multiple file inputs in a form without using a selector, but instead utilizing the 'this' keyword or finding an alternative approach

Upon a successful AJAX post request, the page fails to display

Tips on deactivating a button after it has been clicked once within a 24-hour period and reactivating it the following day with the use of JavaScript and Angular

Error Encountered - Configuring Node.js Deployment on the Heroku Platform

Navigating through the web using PhantomJS with Selenium

Preventing Button Click with JQuery Tooltip

The rotation function of a THREE.js object seems to be malfunctioning

Avoiding a constantly repeating video to prevent the browser from running out of memory

Error: Authentication error. fatal: Unable to access the remote repository." encountered while executing the command "yarn install

Using jQuery to automatically select a specific radio button after the load() function is executed

Leveraging the power of JavaScript functions together with the asp:Timer component

Using Vue.js code on an HTML file is only possible when the necessary CDN is included