Looking to extract data from JavaScript using Scrapy 1.4.0?

Apologies for my lack of proficiency in English. As a beginner in scrapy, I am seeking guidance on an issue I encountered while trying to scrape a particular website. Below is the code for my spider:

import scrapy
from bs4 import BeautifulSoup as bs

class SomeSiteSpider(scrapy.Spider):
    name = 'somesite'

    def start_requests(self):
        urls = [
            'http://somesite.ru/proxies/'
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        token = response.css('input[name="xf0"]::attr(value)').extract_first()
        data = {
            'xpp': '4',
            'xf1': '4',
            'xf0': token,
            'xf2': '0',
            'xf4': '0'
        }
        yield scrapy.FormRequest(url='http://somesite.ru/proxies/', formdata=data, callback=self.parse_proxy, method='POST')

    def parse_proxy(self, response):
        page = bs(response.body, "html.parser")
        table = page.select('td[align="center"] > table[cellspacing="1"]')
        table = bs(str(table), 'html.parser')
        print(table.prettify())

I am aiming to extract the following information:

<font class="spy14">
  "200.200.200.200"
  <script type="text/javascript"></script>
  <font class="spy2">:</font>
  "8080"
</font>

However, the output from my spider includes some unexpected elements:

<font class="spy14">
    200.200.200.200
    <script type="text/javascript>
     document.write("<font class=spy2>:<\/font>"+(l2k1o5^f6l2)+(j0s9i9^e5z6)+(i9w3m3^s9p6)+(g7u1q7^u1j0)+(h8x4r8^n4s9))
    </script>
</font>

This site does not make use of AJAX requests.

For reference, here is an image showing the output generated by the spider: Spider Output Picture

Answer №1

If you're looking to execute Javascript in Scrapy, you'll need to incorporate a browser simulation tool like PhantomJS or Splash. Another option is using Selenium to run the Javascript in a real browser, but this can be more complex.

For beginners, I suggest starting with Splash as it is well-documented and seamlessly integrates with Scrapy since it was developed by the same team. You can find a great starting point here: https://github.com/scrapy-plugins/scrapy-splash

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Maintaining the highlight of the active row in Oracle Apex Classic Report even after the dialog window is closed

Greetings to everyone gathered here! Currently, I am working on a single page in Oracle Apex (version 4.2.6.00.03) that contains two Classic Reports — one serving as the "master" report and the other displaying the corresponding "details". Additionally, ...

Ways to prompt a window resize event using pure javascript

I am attempting to simulate a resize event using vanilla JavaScript for testing purposes, but it seems that modern browsers prevent the triggering of the event with window.resizeTo() and window.resizeBy(). I also tried using jQuery $(window).trigger(' ...

I am just starting to explore firebase and I'm having trouble organizing my data. I've attempted to use the query function and orderBy

After experimenting with query and orderBy() methods, I'm still struggling to properly integrate it into my code. Here's what I have so far: Methods: async saveMessage(){ try { const docRef = await addDoc(collection(db, "chat"), ...

Aggregate array based on specified criteria in ReactJS

Let's consider the following array data: array = [ { id: 1, count: 0.5 cost: 100 user: {id: 1, name: "John 1"}, type: {id: 1, name: "T1"}, period: {id: 1, name: "2021"} ...

I keep encountering the error message "ReferenceError: window is not defined" in Next.js whenever I refresh the page with Agora imported. Can someone explain why this is happening?

Whenever I refresh my Next.js page with Agora SDK imported, I keep encountering the error "ReferenceError: window is not defined". It seems like the issue is related to the Agora import. I attempted to use next/dynamic for non-SSR imports but ended up with ...

Trigger SocketIO message when the page is closed or when the user confirms leaving the page with on

My server application is responsible for executing firmware updates on remote devices using radio communication. Occasionally, the update process may drag on indefinitely due to disruptions in the radio network. If this happens, users might want to interr ...

Interconnected realms communication

I'm currently in the process of developing a Facebook iframe app. At one point, I initiate a friends dialog from Facebook and embed an HTML button to add some customized functionality for my app. dialog = FB.ui({ method:'fbml.di ...

Guide to refreshing a localStorage variable before transferring it to an Ajax request

I have a scenario where I need to update the localStorage value when an option is clicked from a list. The data-id value of the clicked option should be stored in localStorage and then sent through an Ajax call. However, the issue I am facing is that the l ...

Adjust the form action and text input name according to the selected radio input

Seeking assistance with the following code, can someone help? $(document).ready(function() { $('#searchform').submit(function() { var action = ''; if($('.action_url').val() == 'l_catalog') { ...

What drawbacks should be considered when utilizing meteor.js for development?

After viewing the meteor.js screencast, I was truly impressed by its seamless web application development capabilities, especially in terms of live updates and database synchronization. However, I am curious about its scalability once the website goes live ...

Label the timeline map generated with the leaftime plug-in for leaflet in R with the appropriate tags

Here is a code snippet extracted from the R leaftime package documentation examples. It generates a map with a timeline that displays points as they appear over time. I am interested in adding labels to these points to show their unique id numbers. Upon ...

Tips for turning off hash-based redirection in AngularJS

Here is a specific URL: http://www.something.com/sometest/test/#registration Some code has been written based on #registration for Internet Explorer. However, when using AngularJS, it redirects to the following URL, which disrupts the logic: http://www ...

React js code to create a position ranking table

Currently, I am in the process of developing a web application using Reactjs with a ranking table managed by Firebase. However, I have encountered a question: Is it possible to dynamically change the position numbers after sorting the table based on the am ...

Automatically generated list items are failing to react to the active class

I am facing an issue with two divs in my project. The first div uses the Bootstrap class list-group and is populated with a basic example provided by Bootstrap. The second div is supposed to be populated with list-group-items obtained from an AJAX GET requ ...

Cease the progress of a Sequelize promise within an Express.js application

Exploring the realm of promises is a new adventure for me, and I'm still trying to grasp their full potential in certain situations. It's refreshing to see Sequelize now supporting promises, as it greatly enhances the readability of my code. One ...

Enhance your coding experience with Firebase Autocomplete on VScode

Recently, I installed VScode along with the necessary packages for JavaScript development. As I started writing code involving Firebase, I noticed that the autocomplete feature, which worked perfectly fine in Xcode, was not functioning in VScode. How can I ...

Cypress - Adjusting preset does not impact viewportHeight or Width measurements

Today is my first day using cypress and I encountered a scenario where I need to test the display of a simple element on mobile, tablet, or desktop. I tried changing the viewport with a method that seems to work, but unfortunately, the config doesn't ...

Div Randomly Transforms Its Vertical Position

After successfully creating a VS Code Extension for code completion, I decided to develop a website as a landing page where users can sign up and customize their extension settings. The editor I built pops up first on the page seemed to be working fine in ...

What strategies can be employed to minimize redundant re-rendering of React components while utilizing the useEffect hook?

Hey everyone, I'm facing a challenge in my React/Electron project where I need to minimize renders while using the useEffect hook to meet my client's requirements. Currently, I have a container/component structure with an index.js file that house ...

Launching event handlers and applying CSS classes within a single scenario

How can I toggle the visibility of a button based on form field validation in JavaScript? I want to show or hide the button when the .confirm button is clicked, and if the form is valid, add a checkmark to the body element through event listener. The issu ...