Attempting to extract JavaScript URLs using scraping methods, however, receiving an empty string when utilizing

Question

Attempting to extract JavaScript URLs using scraping methods, however, receiving an empty string when utilizing

I need help accessing and extracting data from a URL that is embedded within a specific tag. The tag in question looks like this:

<script src="http://includes.mpt-static.com/data/7CE5047496" type="text/javascript" charset="utf-8"></script>

So far, I have attempted to use Selenium to open the URL, but it just returns an empty string. It seems that when I manually click on the source URL, a page opens displaying a table of the desired data. However, pasting the URL directly into a browser results in an empty response. Additionally, each time I refresh the page, a new source URL is generated. Can someone explain why this behavior is occurring?

The URL in question is: view-source:

Below is the relevant portion of my code:

import time
from fake_useragent import UserAgent
import urllib2
import csv
from bs4 import BeautifulSoup
import json
from selenium import webdriver

#FAKE-USER_AGENT
ua = UserAgent(cache = False)
headers = {'User-Agent': ua.randome}


#SENDING REQUEST TO PRICETRACKER WEBSITE
product = 'B00N2BW2PK'
page = requests.get('http://www.mypricetrack.com/amazon/'+str(product), headers = headers)
soup = BeautifulSoup(page.text)
#print(soup.prettify())

#GETTING URL FOR DATA
data_link = []
for tag in soup.findAll('script',{'charset':'utf-8'}):
    data_link = data_link + [tag['src']]
string2 = data_link[1]
print string2
#OPENING URL FOR DATA

driver = webdriver.Firefox()
driver.get(string2)
time.sleep(5)
htmlSource = driver.page_source
print htmlSource

javascript selenium selenium-webdriver web-scraping

Answer 1

Answer №1

To download JavaScript, you must include the correct "Referer" header in your request.

Instead of using Selenium, a more lightweight option is to fetch it using Python requests:

import requests
import re
from bs4 import BeautifulSoup
# Set browser-like headers
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'en-US,en;q=0.8,es;q=0.6'
})
# Visit product page
product_page = 'http://mypricetrack.com/amazon/B00N2BW2PK'
res = session.get(product_page)
# find link
link = soup.find('script', {'src':re.compile('http://includes.mpt-static.com/data')})
link_src = link['src']
# Get JavaScript content
res = session.get(src, headers={'Referer':product_page}).text

Answer 2

To download JavaScript, you must include the correct "Referer" header in your request.

Instead of using Selenium, a more lightweight option is to fetch it using Python requests:

import requests
import re
from bs4 import BeautifulSoup
# Set browser-like headers
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'en-US,en;q=0.8,es;q=0.6'
})
# Visit product page
product_page = 'http://mypricetrack.com/amazon/B00N2BW2PK'
res = session.get(product_page)
# find link
link = soup.find('script', {'src':re.compile('http://includes.mpt-static.com/data')})
link_src = link['src']
# Get JavaScript content
res = session.get(src, headers={'Referer':product_page}).text

Attempting to extract JavaScript URLs using scraping methods, however, receiving an empty string when utilizing

Answer №1

Similar questions

Searching for ways to filter out specific tags using regular expressions

Loading fonts using next.js and style jsx

AngularJS directive for Ionic Leaflet - Utilizing Service to switch tileLayer from side menu

What is the best way to alternate between displaying HTML content with v-html and plain text in Vue.js?

Does the organization of files and directories (such as modular programming) impact the speed at which AngularJS loads?

The concept of setTimeout and how it affects binding in JavaScript

Access an object value within a JSON response

Dynamic importing fails to locate file without .js extension

How to pass the Node environment to layout.jade in Express without explicitly specifying the route

Removing a value from an array contained within an object

Setting up authorization levels for roles in Discord.js

Extend and retract within a row of a table

VueJS restricts the selection of only one checkbox based on its class name

What is the best way to invoke a TypeScript function within a jQuery function?

The WebGLRenderer in ThreeJS is unable to update the domElement property

Is it necessary to make multiple calls following a successful AJAX request?

The content within the iframe is not displayed

Using JavaScript drag and drop feature to remove the dragged element after a successful drop operation

Combine all parameters into a single parameter, called 'useParams', displaying all values

Activate the input autofocus feature when displaying a dialog in Vue.js