Guide to parsing a Web page with Python (Beautiful Soup?) that has been rendered using Javascript

Question

Guide to parsing a Web page with Python (Beautiful Soup?) that has been rendered using Javascript

Hey everyone, I'm currently using BS4 to parse a webpage. However, the block of code returned by BS4 is written in JS as a string and it's not recognizing the URLs I'm trying to extract.

I have identified the part I need to extract in BS4:

              var vd1="\x3c\x73\x6f\x75\x72\x63\x65\x20\x73\x72\x63\x3d\x27";
              var vd2="\x27\x20\x74\x79\x70\x65\x3d\x27\x76\x69\x64\x65\x6f\x2f\x6d\x70\x34\x27\x3e";

              var luu=pkl("uggc://navzrurnira.rh/tvs.cuc?vcqrgrpgrq");

                    // Other encrypted variables...

              document.write("<video "+" class='vid'  id='videodiv' width='100%' autoplay='autoplay' preload='none'>"+ vd1 +soienfu+ vd2 + vd1+iusfdb+ vd2 + vd1+ufbjhse+ vd2 +"Your browser does not support the video tag.</video> ");

But when I view this on the website's HTML, all I see is:

Your browser does not support the video tag.

My goal is to retrieve the video URL from this HTML block, which looks like this:

This is the code snippet I am using to achieve this:

import requests,bs4,re,sys,os
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
mainsite="http://animeheaven.eu/"
r2=requests.get(url)
r2.raise_for_status()
soup2=bs4.BeautifulSoup(r2.text,"html.parser")
dlink=soup2.select("script")

However, I am facing issues parsing 'dlink' for the URL due to the JavaScript content. As I'm new to web scraping and not very familiar with JS, I'm struggling to figure out a solution.

# would extract standard url
mylink=re.compile(r"href='(.*)'")
downlink=mylink.search(str(dlink[3]))[1]

javascript python-3.x selenium web-scraping beautifulsoup

Answer 1

Answer №1

When dealing with a javascript rendered webpage, the script's content is often referred to as 'minified content', making it unreadable to humans (and even beautiful soup).

Selenium provides a solution for executing the necessary javascript to render the site so that we can then access and manipulate the content.

Interested in learning how to use selenium? Here are the steps:

1. Obtain selenium by using the command pip install selenium

2. Next, you'll need to install a driver. (Personally, I recommend using the chrome driver)

3. Utilize the 'inspect element' function on your preferred browser to identify elements such as the video source. In this example, the video is identified by an id attribute with the value videodiv.

<video id="videodiv" width="100%" height="100%" style="display: block; cursor: none;" autoplay="autoplay" preload="none">
  <source src="http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s4tyh.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s3sd.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4>Your browser does not support the video tag.</video>

https://i.sstatic.net/EyF2m.png

4. With the discovered id and tag from step 3, you can now write Python code to retrieve the source:

from selenium import webdriver
browser = webdriver.Chrome(executable_path="C:\Users\yourname\Desktop\chromedriver.exe")
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
browser.get(url)
viddiv = browser.find_element_by_id('videodiv')
source = viddiv.find_element_by_tag_name('source')
source.get_attribute('src')

Output:

'http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130'

Answer 2

When dealing with a javascript rendered webpage, the script's content is often referred to as 'minified content', making it unreadable to humans (and even beautiful soup).

Selenium provides a solution for executing the necessary javascript to render the site so that we can then access and manipulate the content.

Interested in learning how to use selenium? Here are the steps:

1. Obtain selenium by using the command pip install selenium

2. Next, you'll need to install a driver. (Personally, I recommend using the chrome driver)

3. Utilize the 'inspect element' function on your preferred browser to identify elements such as the video source. In this example, the video is identified by an id attribute with the value videodiv.

<video id="videodiv" width="100%" height="100%" style="display: block; cursor: none;" autoplay="autoplay" preload="none">
  <source src="http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s4tyh.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s3sd.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4>Your browser does not support the video tag.</video>

https://i.sstatic.net/EyF2m.png

4. With the discovered id and tag from step 3, you can now write Python code to retrieve the source:

from selenium import webdriver
browser = webdriver.Chrome(executable_path="C:\Users\yourname\Desktop\chromedriver.exe")
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
browser.get(url)
viddiv = browser.find_element_by_id('videodiv')
source = viddiv.find_element_by_tag_name('source')
source.get_attribute('src')

Output:

'http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130'

Guide to parsing a Web page with Python (Beautiful Soup?) that has been rendered using Javascript

Answer №1

Similar questions

A JavaScript object that performs a callback function

The React component is experiencing a delay in its updates

Storing information in MongoDB using NodeJS and Webix

JavaScript for Accessing PDF Files Remotely

What is the method for extracting text from an HTML file using Java programming?

How to hide an image in the React carousel display

Utilizing JodaTime with JavaScript (AngularJS): A Comprehensive Guide

Create a loop to iterate through dates within a specified range using the Fetch API

Exploring the implementation of float type in TypeScript

Transferring a document using Selenium

Underscore - Evaluating the differences between two arrays of objects (positions)

I needed to integrate CustomPicker into my functional component within a react native project

Struggling to capture a "moment in time" of a form without losing any of the data

Reload the precise URL using JavaScript or jQuery

The URL being navigated to by $location.path() is incorrect

Discovering the applied column filter in Angular's UI-Grid

RTK Query may sometimes encounter undefined values when fetching data

How to choose a javascript drop down using selenium?

Using Umbraco Razor to transfer an array to JavaScript

Utilizing JSON encoding in PHP to populate a dropdown menu