Guide to parsing a Web page with Python (Beautiful Soup?) that has been rendered using Javascript

Hey everyone, I'm currently using BS4 to parse a webpage. However, the block of code returned by BS4 is written in JS as a string and it's not recognizing the URLs I'm trying to extract.

I have identified the part I need to extract in BS4:

              var vd1="\x3c\x73\x6f\x75\x72\x63\x65\x20\x73\x72\x63\x3d\x27";
              var vd2="\x27\x20\x74\x79\x70\x65\x3d\x27\x76\x69\x64\x65\x6f\x2f\x6d\x70\x34\x27\x3e";

              var luu=pkl("uggc://navzrurnira.rh/tvs.cuc?vcqrgrpgrq");

                    // Other encrypted variables...

              document.write("<video "+" class='vid'  id='videodiv' width='100%' autoplay='autoplay' preload='none'>"+ vd1 +soienfu+ vd2 + vd1+iusfdb+ vd2 + vd1+ufbjhse+ vd2 +"Your browser does not support the video tag.</video> ");

But when I view this on the website's HTML, all I see is:

Your browser does not support the video tag.

My goal is to retrieve the video URL from this HTML block, which looks like this:

This is the code snippet I am using to achieve this:

import requests,bs4,re,sys,os
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
mainsite="http://animeheaven.eu/"
r2=requests.get(url)
r2.raise_for_status()
soup2=bs4.BeautifulSoup(r2.text,"html.parser")
dlink=soup2.select("script")

However, I am facing issues parsing 'dlink' for the URL due to the JavaScript content. As I'm new to web scraping and not very familiar with JS, I'm struggling to figure out a solution.

# would extract standard url
mylink=re.compile(r"href='(.*)'")
downlink=mylink.search(str(dlink[3]))[1]

Answer №1

When dealing with a javascript rendered webpage, the script's content is often referred to as 'minified content', making it unreadable to humans (and even beautiful soup).

Selenium provides a solution for executing the necessary javascript to render the site so that we can then access and manipulate the content.

Interested in learning how to use selenium? Here are the steps:

1. Obtain selenium by using the command pip install selenium

2. Next, you'll need to install a driver. (Personally, I recommend using the chrome driver)

3. Utilize the 'inspect element' function on your preferred browser to identify elements such as the video source. In this example, the video is identified by an id attribute with the value videodiv.

<video id="videodiv" width="100%" height="100%" style="display: block; cursor: none;" autoplay="autoplay" preload="none">
  <source src="http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s4tyh.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s3sd.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4>Your browser does not support the video tag.</video>

https://i.sstatic.net/EyF2m.png

4. With the discovered id and tag from step 3, you can now write Python code to retrieve the source:

from selenium import webdriver
browser = webdriver.Chrome(executable_path="C:\Users\yourname\Desktop\chromedriver.exe")
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
browser.get(url)
viddiv = browser.find_element_by_id('videodiv')
source = viddiv.find_element_by_tag_name('source')
source.get_attribute('src')

Output:

'http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130'

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

A JavaScript object that performs a callback function

I am delving into learning node.js and experimenting with creating a new TCP Server connection. Check out the code snippet below: var server = require('net').createServer(function(socket) { console.log('new connection'); socket.se ...

The React component is experiencing a delay in its updates

I've been experiencing delayed updates when using React.useEffect(). Can anyone shed some light on why this might be happening? function Process(props) { const [results, setResults] = React.useState({ number: "", f: {} }); let ...

Storing information in MongoDB using NodeJS and Webix

This is my debut post on Stack Overflow, seeking assistance as I am willing to contribute help if needed. Currently, I am working on building my own database with a user-friendly interface for Adding/Editing/Removing values in my datatable. While I can su ...

JavaScript for Accessing PDF Files Remotely

Currently, I am developing an iPhone application that is capable of reading PDF files stored in a folder on a web server. I have successfully implemented the feature that allows the app to read PDF files, but now I am faced with the challenge of loading al ...

What is the method for extracting text from an HTML file using Java programming?

Can someone help me with extracting the text "Status" from this HTML code using Selenium TestNG? <div class="dojoxGridSortNode">Status</div> I tried using the following code: public static String getText(){ String Value = null; try{ ...

How to hide an image in the React carousel display

I am having an issue with my Carousel, specifically with the image not being displayed even though I have set up the data in a const called items. Here is how my const looks: var items = [ { url:'../../assets/img/hors1.jpg', ...

Utilizing JodaTime with JavaScript (AngularJS): A Comprehensive Guide

I am utilizing DateTime to create Date objects and sending them as JSON to the UI. How can I work with this in JavaScript (specifically AngularJS) and convert it back and forth? For instance, if I need the user to update the time, I should be able to retr ...

Create a loop to iterate through dates within a specified range using the Fetch API

When I need to get the exchange rate from the bank for a specific interval specified in the input, I follow these steps. The interval is defined as [startdate; enddate]. However, in order to make a successful request to the bank, the selected dates must be ...

Exploring the implementation of float type in TypeScript

Is it possible to use Number, or is there a more type-specific alternative? In the past, I have relied on Number and it has proven effective for me. For example, when defining a variable like percent:Number = 1.01... ...

Transferring a document using Selenium

When trying to upload two image (.png) files, I encountered an issue where the second file was not getting uploaded. The first file uploads successfully using action keys and send keys. Actions action1 = new Actions(m.driver); action1.moveToElement(m.dri ...

Underscore - Evaluating the differences between two arrays of objects (positions)

Is it possible to compare arrays based on the changes in their element positions? I have an original array of objects that has one of its elements' values changed, resulting in a new array: origElements = [{id: 1, value: 50}, ...

I needed to integrate CustomPicker into my functional component within a react native project

Could someone please clarify if I wish to convert the CustomExample Class component into a functional component **like this: ** const CustomExample = () =>{...} then how would I modify the following code to function in a similar manner: <Custo ...

Struggling to capture a "moment in time" of a form without losing any of the data

My form is highly dynamic, with interacting top-level elements triggering a complete transformation of the lower-level elements. I needed a method to maintain state so that if users partially entered data in one category, switched temporarily to another, a ...

Reload the precise URL using JavaScript or jQuery

I need a solution to refresh the current URL after an ajax success. I attempted the following methods: location.reload() history.go(0) location.href = location.href location.href = location.pathname location.replace(location.pathname) However, I encounter ...

The URL being navigated to by $location.path() is incorrect

I have encountered an issue with this HTML tag. It appears correctly when I use the Inspect Element tool: <tr style="cursor: pointer" ng-repeat="i in games" ng-click="go('/admin/{{i._id}}')"> However, when it is rendered on the page, it l ...

Discovering the applied column filter in Angular's UI-Grid

I am currently working with ui-grid and implementing server-side filtering. I make a request to the API for each column based on the filter value, with the default parameter being empty. var filterOptions = { filterBy: '&$fil ...

RTK Query may sometimes encounter undefined values when fetching data

I am new to using Redux and facing an issue while trying to display data in a Material UI Select. When I try to show the user's name, it works perfectly, but when I do the same for the partner's data, they appear as undefined. In my server index ...

How to choose a javascript drop down using selenium?

Here is the HTML code for a JavaScript drop-down menu that contains various options, including "All Resumes". I am attempting to select this option using Selenium WebDriver: <div id="resume_freshness_container"> <div class="dropdown_small_wrapper ...

Using Umbraco Razor to transfer an array to JavaScript

In my Razor code, I am using a foreach loop to display images from the Multiple Media Picker in Umbraco. The Response.Write is just there for debugging purposes and the images are displaying correctly. My actual question pertains to populating the image ta ...

Utilizing JSON encoding in PHP to populate a dropdown menu

How can I update a select dropdown with dynamic data every few seconds using PHP and JavaScript? I have a PHP script that retrieves an array of numbers from 1 to 10 and returns it as a JSON response. However, when I try to update the select dropdown with ...