Managing JavaScript with Scrapy

Spider for reference:

import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from script.items import ScriptItem

class RunSpider(scrapy.Spider):
    name = "run"
    allowed_domains = ["stopitrightnow.com"]
    start_urls = (
        'http://www.stopitrightnow.com/',
    )

    def parse(self, response):
        for widget in response.xpath('//div[@class="shopthepost-widget"]'):
            item = ScriptItem()
            item['url'] = widget.xpath('.//a/@href').extract()
            url = item['url']
            yield item

When executing this code in terminal, the output is as follows:

2015-08-21 14:23:51 [scrapy] DEBUG: Scraped from <200 http://www.stopitrightnow.com/>
{'url': []}
<div class="shopthepost-widget" data-widget-id="708473">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = p + '://' + 'widgets.rewardstyle.com' + '/js/shopthepost.js';d.body.appendChild(e);}if(typeof window.__stp === 'object') if(d.readyState === 'complete') {window.__stp.init();}}(document, 'script', 'shopthepost-script');</script><br>

This is the HTML structure:

<div class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls">
    <a class="stp-control stp-left stp-hidden">&lt;</a>
    <div class="stp-inner" style="width: auto">
        <div class="stp-slide" style="left: -0%">
                        <a href="http://rstyle.me/iA-n/zzhv34c_" target="_blank" rel="nofollow" class="stp-product " data-index="0" style="margin: 0 0px 0 0px">
                <span class="stp-help"></span>
                <img src="//images.rewardstyle.com/img?v=2.13&amp;p=n_24878713">
                            </a>
                        <a href="http://rstyle.me/iA-n/zzhvw4c_" target="_blank" rel="nofollow" class="stp-product " data-index="1" style="margin: 0 0px 0 0px">
                <span class="stp-help"></span>
                <img src="//images.rewardstyle.com/img?v=2.13&amp;p=n_24878708">

It appears that there is a problem when trying to activate the JavaScript in the code provided. While JavaScript cannot run in Scrapy, there might be a way to access those links. I have looked into using Selenium but am struggling to implement it successfully.

Any assistance or guidance is greatly appreciated.

Answer №1

I managed to solve the issue using ScrapyJS.

Simply follow the setup instructions provided in the official documentation as well as checking out this helpful answer.

This is the test spider code that I implemented:

# -*- coding: utf-8 -*-
import scrapy


class TestSpider(scrapy.Spider):
    name = "run"
    allowed_domains = ["stopitrightnow.com"]
    start_urls = (
        'http://www.stopitrightnow.com/',
    )

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

    def parse(self, response):
        for widget in response.xpath('//div[@class="shopthepost-widget"]'):
            print widget.xpath('.//a/@href').extract()

Furthermore, here are the results displayed on the console:

[u'http://rstyle.me/iA-n/7bk8r4c_', u'http://rstyle.me/iA-n/7bk754c_', u'http://rstyle.me/iA-n/6th5d4c_', u'http://rstyle.me/iA-n/7bm3s4c_', u'http://rstyle.me/iA-n/2xeat4c_', u'http://rstyle.me/iA-n/7bi7f4c_', u'http://rstyle.me/iA-n/66abw4c_', u'http://rstyle.me/iA-n/7bm4j4c_']
[u'http://rstyle.me/iA-n/zzhv34c_', u'http://rstyle.me/iA-n/zzhvw4c_', u'http://rstyle.me/iA-n/zwuvk4c_', u'http://rstyle.me/iA-n/zzhvr4c_', u'http://rstyle.me/iA-n/zzh9g4c_', u'http://rstyle.me/iA-n/zzhz54c_', u'http://rstyle.me/iA-n/zwuuy4c_', u'http://rstyle.me/iA-n/zzhx94c_']

Answer №2

To avoid using javascript like Alecxe's method, you can manually inspect where the page is sourcing its content from and integrate that functionality yourself (view this SO thread for more insight).

In this particular scenario, we discover the following: https://i.sstatic.net/4hq90.png

For

<div class="shopthepost-widget" data-widget-id="708473">
, JavaScript is utilized to insert the link "widgets.rewardstyle.com/stps/708473.html".

You have the option of managing this on your own by manually creating requests for these URLs:

def parse(self, response):
    for widget in response.xpath('//div[@class="shopthepost-widget"]'):
        widget_id = widget.xpath('@data-widget-id').extract()[0]
        widget_url = "http://widgets.rewardstyle.com/stps/{id}.html".format(id=widget_id)
        yield Request(widget_url, callback=self.parse_widget)

def parse_widget(self, response):
    for link in response.xpath('//a[contains(@class, "stp-product")]'):
        item = JavasItem()  # Name recommended by the author, see comments below
        item['link'] = links.xpath("@href").extract()
        yield item

    # Proceed with any additional actions desired on the opened page.

If you require maintaining a connection between these widgets and their corresponding post/article, transmit that data within the request using meta.

UPDATE: parse_widget() has been revised. It now employs contains to identify the class due to a trailing space. Alternatively, you could use a CSS selector if you prefer.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Discover the process of incorporating secondary links into your dabeng organizational chart!

I need to incorporate dotted lines on this chart, such as connecting leaf level nodes with middle level nodes. import OrgChart from '../js/orgchart.min.js'; document.addEventListener('DOMContentLoaded', function () { Mock.mock(&apo ...

Guide to transferring filtered data to the controller

As I work on designing a user interface for managing project applications, one of the key functionalities is the ability to filter applications by their type. Within the UI, there is a prominent button labeled select ALL which, when clicked, is meant to se ...

Enhance User Experience with ngDialog Modal's Multi-pane Feature for Angular

Looking at the ngDialog example, they showcase a modal with multiple 'panes' that can be scrolled through: . After going through the ngDialog guide, I couldn't find a straightforward way to achieve this. Any suggestions on how to add a butt ...

Experiencing problems with integrating Slim framework and AngularJS, such as encountering a 404 error

Although this may seem like a repeat question, I am encountering an issue with using AngularJS with Slim Framework web services. I have set up a webservice to retrieve a student record with a URL structure like: http://www.slim.local/api/getstudent/1 ...

With a GroupAvatar, my Avatar named "max" likes to dance to the beat of its own drum rather than following the rules of my

I am currently working on creating an AvatarGroup using MaterialUi. I have successfully applied a style to all my avatars, except for the avatar that is automatically generated by AvatarGroup when the "max" parameter is defined. const styles = makeStyl ...

If the LocalStorage value is empty, relocate to a different location

Upon login, I save the user value to session storage using: localStorage.setItem("user", something); Once logged in successfully, I redirect to a specific page with $location.path('/something'). On this page, I retrieve the user data with $scop ...

The footer section of my website seems to have gone missing in the HTML/CSS coding

I'm having trouble with my website footer not displaying. I've tried using "position: absolute; top: 50px; left:100px;" in my HTML/CSS, but it's not working. Can anyone help me fix this issue? Here is a link to the code I'm working on: ...

Selenium's continue session feature is unable to retrieve attributes using the getAttribute

After setting up a session in Selenium, I tried to re-use it by extracting the session id and local_url to create a different session that would pick up where the original one left off. Strangely, the getAttribute() function did not work in the continued s ...

Canvas only draws outside the table, with the exception of the first one

I am facing an issue with placing multiple signature pads inside table cells. Only the first canvas gets drawn, while the others remain blank. I have checked the mouse/touch events. The events are triggered (up/down/move) and the draw function is called, ...

Error in Angular2: "Promise" name not found despite successful installation of global dependencies

I'm currently developing an application using Angular 2 and Node.js. I've already installed all the necessary dependencies specified in package.json. Inside this file, there is a postinstall command that should install the required dependencies m ...

Invoke a fresh constructor within a $get method in Angular's provider

I'm encountering an issue where I am attempting to utilize a function as a constructor inside the `.provider`, but I'm unable to instantiate a new constructor when it's within the `$get`. Here is my provider setup - this.$get = $get; ...

Python WebDriver in Selenium Unable to Locate Element

I've hit a roadblock in my web automation project - I'm having trouble locating a specific button in my Python program. Here is the HTML code for the button: <a role="button" class="list_filter_toggle icon-filter btn btn-icon&qu ...

Utilizing NodeJS Puppeteer and the scrapedin framework to extract user email addresses from Linked-in profiles through web scraping techniques

Hello everyone, I am currently working on a project that involves scraping public LinkedIn profiles in order to extract information such as email address, name, company, job title, and photo (basic information). To achieve this, I am using NodeJS along wit ...

What should be done if an image is not wide enough to stretch it to match the width of the window?

When the image is not full screen, it looks fine but when it's viewed in full screen, there's a white area on the right side which is likely due to the image not being large enough. Is there a way to automatically stretch the image so that its wi ...

Navigating through different pages and encountering a StaleElementReferenceException with Python Selenium

Currently, I am navigating through multiple webpages that all share a common structure with both back and forward buttons identified as (//span/a)[2]. Strangely, although I can successfully loop through the initial page (and occasionally the second one), I ...

Why does routing function correctly in a browser with AngularUI Router and Ionic, but not in Ionic View?

My Ionic App runs smoothly in the browser when using ionic serve. However, I encounter issues with routing when running the app in Ionic View (view.ionic.io) after uploading it with ionic upload. The index.html loads but nothing within <div ui-view=""& ...

Is there a way to Clear All domain cookies using Selenium?

I am trying to figure out how to delete all cookies from all domains using Selenium WebDriver. Currently, Selenium only allows us to delete cookies from a specific domain. As an alternative solution, I attempted to use keypress events such as Ctrl+Shift+De ...

It seems that Firefox is ignoring the word-wrap style when the class of a child element is changed

Take a look at this: var iconIndex = 0; var icons = ['check', 'chain-broken', 'flag-o', 'ban', 'bell-o']; $('button:eq(0)').click(function() { iconIndex = (iconIndex + 1) % icons ...

Transmit data via XMLHttpRequest in smaller portions or through a ReadableStream to minimize memory consumption when handling large datasets

Recently, I've been experimenting with JS's XMLHttpRequest Class for handling file uploads. Initially, I attempted to upload files using the following code: const file = thisFunctionReturnsAFileObject(); const request = new XMLHttpRequest(); req ...

I am working with Vue.js 2.0 and attempting to send an event from a `child component`

I've been working with Vue.js 2.0 and I'm facing an issue trying to emit an event from a child component to the parent component, but unfortunately, it's not functioning as expected. Here is a glimpse of my code: child component: <temp ...