I have been assigned the task of developing a web scraper for a property website where the data will be collected and stored for future analysis. The website is a national platform that requires users to select a region before displaying any results. In order to overcome this limitation, I have implemented a scraper using Scrapy with multiple start URLs that directly navigate to the desired regions. Since the site is dynamically generated using JavaScript, I am utilizing Selenium to render the pages and navigate through the pagination until all the data for each region is scraped. Everything runs smoothly when there's only one start URL, but as soon as multiple URLs are involved, I encounter an issue. The scraper starts working fine initially, but it switches to the next region (start URL) before completing the scraping process for the current region, resulting in incomplete data extraction. I've searched extensively for a solution to this problem but haven't found anyone facing the same issue. Any advice or suggestions would be greatly appreciated. Find an example of the code below:
from scrapy.spider import CrawlSpider
from scrapy.http import TextResponse
from scrapy.selector import HtmlXPathSelector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
from selenium import webdriver
from selenium import selenium
from selenium_spider.items import DemoSpiderItem
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import sys
class DemoSpider(CrawlSpider):
name="Demo"
allowed_domains = ['example.com']
start_urls= ["http://www.example.co.uk/locationIdentifier=REGION 1234",
"http://www.example.co.uk/property-for-sale/locationIdentifier=REGION 5678"]
def __init__(self):
self.driver = webdriver.Firefox()
def __del__(self):
self.selenium.stop()
def parse (self, response):
self.driver.get(response.url)
result = response.xpath('//*[@class="l-searchResults"]')
source = 'aTest'
while True:
try:
element = WebDriverWait(self.driver, 10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR,".pagination-button.pagination-direction.pagination-direction--next"))
)
print "Scraping new site --------------->", result
print "This is the result----------->", result
for properties in result:
saleOrRent = properties.xpath('//*[@class = "property-title"]/text()').extract()
addresses = properties.xpath('//*[@class="property-address"]/text()').extract()
if saleOrRent:
saleOrRent = saleOrRent[0]
if 'for sale' in saleOrRent:
saleOrRent = 'For Sale'
elif 'to rent' in saleOrRent:
saleOrRent = 'To Rent'
for a in addresses:
item = DemoSpiderItem()
address = a
item ["saleOrRent"] = saleOrRent
item ["source"] = source
item ["address"] = address
item ["response"] = response
yield item
element.click()
except TimeoutException:
break