Tips for scraping websites using BeautifulSoup with the attribute "application/ld+json" and "data-react-helmet"

Looking for guidance on web scraping with Python. I've successfully pulled data from a job portal site using Selenium and BeautifulSoup, following these steps:

  1. Scrape the links of job postings on the site
  2. Retrieve detailed information from each job posting link by looping through them

My issue arises when trying to extract detailed information using BeautifulSoup's find_all method on script tags type='application/ld+json' and data-react-helmet. I'm encountering an error message indicating 'list index out of range'. Any suggestions on troubleshooting this?

https://i.sstatic.net/mXmIJ.png

job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
   headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'referrer': 'https://google.com',
    'Accept': 
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Pragma': 'no-cache',
   }
   response = requests.get(url=url, headers=headers)
   soup = BeautifulSoup(response.text, 'lxml')
   script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld+json'})
   metadata = script_tags[-1].text

   temp_dict = {}

   try:
     job_info_json = json.loads(metadata, strict=False)
     try:
          jobID = job_info_json['identifier']['value']
          temp_dict['Job ID'] = jobID
          print('Job ID = ' +  jobID)
     except AttributeError :
          jobID = ''
  
     try:
         jobTitle = job_info_json['title']
         temp_dict['Job Title'] = jobTitle
         print('Title = ' +  jobTitle)
     except AttributeError :
         jobTitle = ''
      
     try:
         occupationalCategory = job_info_json['occupationalCategory']
         temp_dict['occupationalCategory'] = occupationalCategory
         print('Occupational Category = ' +  occupationalCategory)
     except AttributeError :
         occupationalCategory = ''
  
     temp_dict['Job Link'] = URL_job_list

     job_main_data = job_main_data.append(temp_dict, ignore_index=True)
      
   except json.JSONDecodeError:
     print("Empty response")

Answer №1

Using Javascript, data is dynamically loaded from API calls and can be extracted in various ways. The example below demonstrates how data can be extracted from an API using the requests module exclusively.

import requests
import json

payload={
   "requests":[
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=20&maxValuesPerFacet=1000&page=0&facets=%5B%22*%22%2C%22city.work_country_name%22%2C%22position.name%22%2C%22industries.vertical_name%22%2C%22experience%22%2C%22job_type.name%22%2C%22is_salary_visible%22%2C%22has_equity%22%2C%22currency.currency_code%22%2C%22salary_min%22%2C%22taxonomies.slug%22%5D&tagFilters=&facetFilters=%5B%5B%22city.work_country_name%3AIndonesia%22%5D%5D"
      },
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=1&maxValuesPerFacet=1000&page=0&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&clickAnalytics=false&facets=city.work_country_name"
      }
   ]
}
headers={'content-type': 'application/x-www-form-urlencoded'}
api_url = "https://219wx3mpv4-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0%3BJS%20Helper%202.26.1&x-algolia-application-id=219WX3MPV4&x-algolia-api-key=b528008a75dc1c4402bfe0d8db8b3f8e"

jsonData=requests.post(api_url,data=json.dumps(payload),headers=headers).json()
#print(jsonData)

for item in jsonData['results'][0]['hits']:
    title=item['_highlightResult']['title']['value']
    company=item['_highlightResult']['company']['name']['value']
    skill=item['_highlightResult']['job_skills'][0]['name']['value']
    salary_max=item['salary_max']
    salary_min=item['salary_min']
 

    print(title)

    print(company)

    print(skill)

    print(salary_max)

    print(salary_min)

Output:

Corporate PR
Rocketindo
Sales Strategy & Management
12000000
7000000
Social Media Specialist
Rocketindo
Content Marketing
12000000
7000000
Performance Marketing Analyst (Mama's Choice)
The Parent Inc (theAsianparent)
Marketing Strategy
12000000
5000000
Business Development (Associate Consultant) - CRM
Mekari (PT. Mid Solusi Nusantara)
Business Development & Partnerships
7000000
5000000
Account Payable
Ritase
Corporate Finance
0
0
Data Engineer
Topremit
Databases
0
0
Public Relation KOL
Rocketindo
Business Development & Partnerships
7000000
5000000
Graphic Designer
Rocketindo
Adobe Illustrator
12000000
7000000
Yogyakarta City Coordinator
Deliveree Indonesia
Business Operations
6000000
5250000
Marketing Manager
Deliveree Indonesia
Marketing Strategy
0
0
Graphic Designer
Deliveree Indonesia
Graphic Design
6000000
5250000
Quality Assurance
PT Rekeningku Dotcom Indonesia
Javascript
10000000
4500000
Internship Program
TADA
Attention to Detail
3700000
3000000
Product Management Support
Hangry
Data Warehouse
0
0
Content Writer
Bobobox Indonesia
Copywriting
0
0
UX Researcher
Bobobox Indonesia
UI/UX Design
0
0
UX Copywriter
Bobobox Indonesia
Problem Solving
0
0
Internship HR (Recruitment)
PT Formasi Agung Selaras (Famous Allstars)
Human Resources
1500000
1000000
Fullstack Developer - Banking Industry
SIGMATECH
React.js
12000000
8000000
REACT NATIVE DEVELOPER
BGT Solution
MySQL
16000000
6000000

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Preventing non-breaking spaces and characters from being added in Jquery Mobile

When developing an application using jQuery Mobile, I encountered a strange issue on my homepage. In between the listview, this code snippet appeared: &nbsp;&nbsp;&nbsp;&nbsp; But without the following meta tag: <meta http-equiv="Cont ...

Guide to setting up an admin state in your React app with Firebase integration

In my web application, I have implemented authentication using Firebase. The appbar in my application displays different buttons depending on whether a user is logged in or not. Now, I want to add another button to the appbar specifically for users who a ...

Utilizing Vue 2 and Axios to toggle visibility of a div based on the response of an Axios request

I have a div on my website <div class="ui-alert" v-show="isCaptchaError"> ... </div> The variable isCaptchaError is initialized in my Vue instance: const form = new Vue({ el: '#main-form', data: { ...

Neither .getJSON() nor .ajax() are functioning for making a REST API call

Could someone please explain how to execute a REST call using jQuery or JavaScript? I attempted to use both .getJSON() and .ajax(), but neither worked for me. Here is the URL for the REST service: Sample Code: $.getJSON('http://ws1.airnowgateway.or ...

The image fails to display when using THREE.js and Panolens.js

Trying to create a 360-degree environment with informational buttons using THREE.js and Panolens.JS However, I'm unable to resolve why the image is not appearing. Continuously encountering the error: Uncaught ReferenceError: process is not defined. ...

Navigating through content with scrollable views while utilizing the popular Angular framework

Being relatively new to famous and somewhat familiar with angular. Question: Given that everything in famous is fixed positioning, how should I go about creating a scrollable section using famous angular? HTML: This code snippet creates a grid of square ...

The function of the Nuxt.js server-side plugin is not functioning as intended

I recently developed a server-side plugin and encountered the following error: context.app.handleServerError is not a function // hanlde-server-error.js export default ({ app }, inject) => { app.handleServerError = (method, error, data) => { ...

The callback function in NodeJS is returning an error message saying "callback is not a valid function"

I'm having trouble with my scandir() function. I call it with a parameter and a callback function, but when the function completes, I get an error saying 'callback is not a function'. Can anyone help me figure out what's wrong? Here is ...

Prevent the dropdown from closing after clicking on a href link

Is there a way to ensure that my dropdown remains open even after a page reload? I'm looking for a solution where clicking on an item(href) within the dropdown will keep it open after redirection. I've tried using the jQuery method called stopPr ...

What is the best way to obtain the output produced by a function when a button is clicked

When I click on a button, the desired action is to trigger a function that adds a new property inside an object within a large array of multiple objects. This function then eventually returns a new array. How can I access and utilize this new array? I am ...

Uninstalling or downgrading the Selenium Webdriver in Ruby: Step-by-

Seeking guidance on how to downgrade or uninstall the gem "Selenium-Webdriver 3.1.0" to version "Selenium-Webdriver 2.53.3". I attempted to remove it using this command: gem uninstall selenium-webdriver --version 3.1.0 It appeared successful in uninstall ...

Press the button once the disabled state has been removed

In my jQuery code, I am trying to submit a button that was previously disabled. However, when I remove the disable attribute and then trigger a .click(), it doesn't send me to the next page as expected. Interestingly, if I include an alert("XYZ"); bef ...

Using Python's range() function in Django templates

My Django application requires rendering some HTML using the following code block: {% for i in review.mark|range %} img src="{% static 'core/img/star-yellow.svg' %}" alt="star"> {% endfor %} On certain occasions, I find that the range f ...

Developing a C# Selenium framework with a focus on assert statements

I need a confirmation message to display when the record is successfully updated. Here is the HTML code: <div class="hide"> <div class="fullRow"> <div class="notice success"> The record has b ...

Issues with background image slideshow functionality

For one of my websites, I am using a theme that originally came with a single background image. I modified it to have multiple background images for the 'section' tag <section id="intro" class="intro"></section> U ...

The function array_key_exists() requires the second parameter to be an array, but it was actually passed a

I've created a method loadNotes in a Controller named edit_flow.php function loadNotes_get() { $object = json_decode($this->input->post("inputJson"), true); if (array_key_exists('subject_id', $object) && array_key_exists ...

performing asynchronous iteration with HTTP PUT requests

I'm attempting to send multiple HTTP PUT requests to my server, but I am only able to successfully send one JSON object to the database. What could be missing in my code? var data1 = JSON.stringify(require('./abc.json')), data2 = JSON ...

Fade in each input box using jQuery's `.each()` method

Is it possible to fade in multiple input boxes based on a given number? For example, if the number entered is 5, I would like to fade in 5 input boxes. Can someone assist with achieving this? $('.submit').click(function(){ var num = $(&apos ...

"Exploring the world of flags enums and JSON manipulation in JavaScript

I have a variety of C# enums, with some containing flags set. For instance: [Flags] public enum MyEnum { item1 = 0x0000, item2 = 0x0008 } To replicate this in JavaScript, I created something like the following: my.namespace.MyEnum = { ITEM1: "item ...

Having difficulty grasping the concept of integrating Google Sheets API into my project

Being a newbie in web development, I am trying to extract data from a Google sheet and visualize it within a Javascript-animated canvas object. However, navigating through the Google Sheets API documentation has been quite challenging for me. Although my ...