Looking to extract data from JavaScript using Scrapy 1.4.0?

Question

Looking to extract data from JavaScript using Scrapy 1.4.0?

Apologies for my lack of proficiency in English. As a beginner in scrapy, I am seeking guidance on an issue I encountered while trying to scrape a particular website. Below is the code for my spider:

import scrapy
from bs4 import BeautifulSoup as bs

class SomeSiteSpider(scrapy.Spider):
    name = 'somesite'

    def start_requests(self):
        urls = [
            'http://somesite.ru/proxies/'
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        token = response.css('input[name="xf0"]::attr(value)').extract_first()
        data = {
            'xpp': '4',
            'xf1': '4',
            'xf0': token,
            'xf2': '0',
            'xf4': '0'
        }
        yield scrapy.FormRequest(url='http://somesite.ru/proxies/', formdata=data, callback=self.parse_proxy, method='POST')

    def parse_proxy(self, response):
        page = bs(response.body, "html.parser")
        table = page.select('td[align="center"] > table[cellspacing="1"]')
        table = bs(str(table), 'html.parser')
        print(table.prettify())

I am aiming to extract the following information:

<font class="spy14">
  "200.200.200.200"
  <script type="text/javascript"></script>
  <font class="spy2">:</font>
  "8080"
</font>

However, the output from my spider includes some unexpected elements:

<font class="spy14">
    200.200.200.200
    <script type="text/javascript>
     document.write("<font class=spy2>:<\/font>"+(l2k1o5^f6l2)+(j0s9i9^e5z6)+(i9w3m3^s9p6)+(g7u1q7^u1j0)+(h8x4r8^n4s9))
    </script>
</font>

This site does not make use of AJAX requests.

For reference, here is an image showing the output generated by the spider: Spider Output Picture

javascript scrapy

Answer 1

Answer №1

If you're looking to execute Javascript in Scrapy, you'll need to incorporate a browser simulation tool like PhantomJS or Splash. Another option is using Selenium to run the Javascript in a real browser, but this can be more complex.

For beginners, I suggest starting with Splash as it is well-documented and seamlessly integrates with Scrapy since it was developed by the same team. You can find a great starting point here: https://github.com/scrapy-plugins/scrapy-splash

Answer 2

If you're looking to execute Javascript in Scrapy, you'll need to incorporate a browser simulation tool like PhantomJS or Splash. Another option is using Selenium to run the Javascript in a real browser, but this can be more complex.

For beginners, I suggest starting with Splash as it is well-documented and seamlessly integrates with Scrapy since it was developed by the same team. You can find a great starting point here: https://github.com/scrapy-plugins/scrapy-splash

Looking to extract data from JavaScript using Scrapy 1.4.0?

Answer №1

Similar questions

Maintaining the highlight of the active row in Oracle Apex Classic Report even after the dialog window is closed

Ways to prompt a window resize event using pure javascript

I am just starting to explore firebase and I'm having trouble organizing my data. I've attempted to use the query function and orderBy

Aggregate array based on specified criteria in ReactJS

I keep encountering the error message "ReferenceError: window is not defined" in Next.js whenever I refresh the page with Agora imported. Can someone explain why this is happening?

Trigger SocketIO message when the page is closed or when the user confirms leaving the page with on

Interconnected realms communication

Guide to refreshing a localStorage variable before transferring it to an Ajax request

Adjust the form action and text input name according to the selected radio input

What drawbacks should be considered when utilizing meteor.js for development?

Label the timeline map generated with the leaftime plug-in for leaflet in R with the appropriate tags

Tips for turning off hash-based redirection in AngularJS

React js code to create a position ranking table

Automatically generated list items are failing to react to the active class

Cease the progress of a Sequelize promise within an Express.js application

Enhance your coding experience with Firebase Autocomplete on VScode

Cypress - Adjusting preset does not impact viewportHeight or Width measurements

Div Randomly Transforms Its Vertical Position

What strategies can be employed to minimize redundant re-rendering of React components while utilizing the useEffect hook?

Launching event handlers and applying CSS classes within a single scenario