Tips for scraping websites using BeautifulSoup with the attribute "application/ld+json" and "data-react-helmet"

Question

Tips for scraping websites using BeautifulSoup with the attribute "application/ld+json" and "data-react-helmet"

Looking for guidance on web scraping with Python. I've successfully pulled data from a job portal site using Selenium and BeautifulSoup, following these steps:

Scrape the links of job postings on the site
Retrieve detailed information from each job posting link by looping through them

My issue arises when trying to extract detailed information using BeautifulSoup's find_all method on script tags type='application/ld+json' and data-react-helmet. I'm encountering an error message indicating 'list index out of range'. Any suggestions on troubleshooting this?

https://i.sstatic.net/mXmIJ.png

job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
   headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
    'referrer': 'https://google.com',
    'Accept': 
    'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Pragma': 'no-cache',
   }
   response = requests.get(url=url, headers=headers)
   soup = BeautifulSoup(response.text, 'lxml')
   script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld+json'})
   metadata = script_tags[-1].text

   temp_dict = {}

   try:
     job_info_json = json.loads(metadata, strict=False)
     try:
          jobID = job_info_json['identifier']['value']
          temp_dict['Job ID'] = jobID
          print('Job ID = ' +  jobID)
     except AttributeError :
          jobID = ''
  
     try:
         jobTitle = job_info_json['title']
         temp_dict['Job Title'] = jobTitle
         print('Title = ' +  jobTitle)
     except AttributeError :
         jobTitle = ''
      
     try:
         occupationalCategory = job_info_json['occupationalCategory']
         temp_dict['occupationalCategory'] = occupationalCategory
         print('Occupational Category = ' +  occupationalCategory)
     except AttributeError :
         occupationalCategory = ''
  
     temp_dict['Job Link'] = URL_job_list

     job_main_data = job_main_data.append(temp_dict, ignore_index=True)
      
   except json.JSONDecodeError:
     print("Empty response")

javascript python-3.x selenium web-scraping beautifulsoup

Answer 1

Answer №1

Using Javascript, data is dynamically loaded from API calls and can be extracted in various ways. The example below demonstrates how data can be extracted from an API using the requests module exclusively.

import requests
import json

payload={
   "requests":[
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=20&maxValuesPerFacet=1000&page=0&facets=%5B%22*%22%2C%22city.work_country_name%22%2C%22position.name%22%2C%22industries.vertical_name%22%2C%22experience%22%2C%22job_type.name%22%2C%22is_salary_visible%22%2C%22has_equity%22%2C%22currency.currency_code%22%2C%22salary_min%22%2C%22taxonomies.slug%22%5D&tagFilters=&facetFilters=%5B%5B%22city.work_country_name%3AIndonesia%22%5D%5D"
      },
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=1&maxValuesPerFacet=1000&page=0&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&clickAnalytics=false&facets=city.work_country_name"
      }
   ]
}
headers={'content-type': 'application/x-www-form-urlencoded'}
api_url = "https://219wx3mpv4-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0%3BJS%20Helper%202.26.1&x-algolia-application-id=219WX3MPV4&x-algolia-api-key=b528008a75dc1c4402bfe0d8db8b3f8e"

jsonData=requests.post(api_url,data=json.dumps(payload),headers=headers).json()
#print(jsonData)

for item in jsonData['results'][0]['hits']:
    title=item['_highlightResult']['title']['value']
    company=item['_highlightResult']['company']['name']['value']
    skill=item['_highlightResult']['job_skills'][0]['name']['value']
    salary_max=item['salary_max']
    salary_min=item['salary_min']
 

    print(title)

    print(company)

    print(skill)

    print(salary_max)

    print(salary_min)

Output:

Corporate PR
Rocketindo
Sales Strategy & Management
12000000
7000000
Social Media Specialist
Rocketindo
Content Marketing
12000000
7000000
Performance Marketing Analyst (Mama's Choice)
The Parent Inc (theAsianparent)
Marketing Strategy
12000000
5000000
Business Development (Associate Consultant) - CRM
Mekari (PT. Mid Solusi Nusantara)
Business Development & Partnerships
7000000
5000000
Account Payable
Ritase
Corporate Finance
0
0
Data Engineer
Topremit
Databases
0
0
Public Relation KOL
Rocketindo
Business Development & Partnerships
7000000
5000000
Graphic Designer
Rocketindo
Adobe Illustrator
12000000
7000000
Yogyakarta City Coordinator
Deliveree Indonesia
Business Operations
6000000
5250000
Marketing Manager
Deliveree Indonesia
Marketing Strategy
0
0
Graphic Designer
Deliveree Indonesia
Graphic Design
6000000
5250000
Quality Assurance
PT Rekeningku Dotcom Indonesia
Javascript
10000000
4500000
Internship Program
TADA
Attention to Detail
3700000
3000000
Product Management Support
Hangry
Data Warehouse
0
0
Content Writer
Bobobox Indonesia
Copywriting
0
0
UX Researcher
Bobobox Indonesia
UI/UX Design
0
0
UX Copywriter
Bobobox Indonesia
Problem Solving
0
0
Internship HR (Recruitment)
PT Formasi Agung Selaras (Famous Allstars)
Human Resources
1500000
1000000
Fullstack Developer - Banking Industry
SIGMATECH
React.js
12000000
8000000
REACT NATIVE DEVELOPER
BGT Solution
MySQL
16000000
6000000

Answer 2

Using Javascript, data is dynamically loaded from API calls and can be extracted in various ways. The example below demonstrates how data can be extracted from an API using the requests module exclusively.

import requests
import json

payload={
   "requests":[
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=20&maxValuesPerFacet=1000&page=0&facets=%5B%22*%22%2C%22city.work_country_name%22%2C%22position.name%22%2C%22industries.vertical_name%22%2C%22experience%22%2C%22job_type.name%22%2C%22is_salary_visible%22%2C%22has_equity%22%2C%22currency.currency_code%22%2C%22salary_min%22%2C%22taxonomies.slug%22%5D&tagFilters=&facetFilters=%5B%5B%22city.work_country_name%3AIndonesia%22%5D%5D"
      },
      {
         "indexName":"job_postings",
         "params":"query=&hitsPerPage=1&maxValuesPerFacet=1000&page=0&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&clickAnalytics=false&facets=city.work_country_name"
      }
   ]
}
headers={'content-type': 'application/x-www-form-urlencoded'}
api_url = "https://219wx3mpv4-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0%3BJS%20Helper%202.26.1&x-algolia-application-id=219WX3MPV4&x-algolia-api-key=b528008a75dc1c4402bfe0d8db8b3f8e"

jsonData=requests.post(api_url,data=json.dumps(payload),headers=headers).json()
#print(jsonData)

for item in jsonData['results'][0]['hits']:
    title=item['_highlightResult']['title']['value']
    company=item['_highlightResult']['company']['name']['value']
    skill=item['_highlightResult']['job_skills'][0]['name']['value']
    salary_max=item['salary_max']
    salary_min=item['salary_min']
 

    print(title)

    print(company)

    print(skill)

    print(salary_max)

    print(salary_min)

Output:

Corporate PR
Rocketindo
Sales Strategy & Management
12000000
7000000
Social Media Specialist
Rocketindo
Content Marketing
12000000
7000000
Performance Marketing Analyst (Mama's Choice)
The Parent Inc (theAsianparent)
Marketing Strategy
12000000
5000000
Business Development (Associate Consultant) - CRM
Mekari (PT. Mid Solusi Nusantara)
Business Development & Partnerships
7000000
5000000
Account Payable
Ritase
Corporate Finance
0
0
Data Engineer
Topremit
Databases
0
0
Public Relation KOL
Rocketindo
Business Development & Partnerships
7000000
5000000
Graphic Designer
Rocketindo
Adobe Illustrator
12000000
7000000
Yogyakarta City Coordinator
Deliveree Indonesia
Business Operations
6000000
5250000
Marketing Manager
Deliveree Indonesia
Marketing Strategy
0
0
Graphic Designer
Deliveree Indonesia
Graphic Design
6000000
5250000
Quality Assurance
PT Rekeningku Dotcom Indonesia
Javascript
10000000
4500000
Internship Program
TADA
Attention to Detail
3700000
3000000
Product Management Support
Hangry
Data Warehouse
0
0
Content Writer
Bobobox Indonesia
Copywriting
0
0
UX Researcher
Bobobox Indonesia
UI/UX Design
0
0
UX Copywriter
Bobobox Indonesia
Problem Solving
0
0
Internship HR (Recruitment)
PT Formasi Agung Selaras (Famous Allstars)
Human Resources
1500000
1000000
Fullstack Developer - Banking Industry
SIGMATECH
React.js
12000000
8000000
REACT NATIVE DEVELOPER
BGT Solution
MySQL
16000000
6000000

Tips for scraping websites using BeautifulSoup with the attribute "application/ld+json" and "data-react-helmet"

Answer №1

Similar questions

Preventing non-breaking spaces and characters from being added in Jquery Mobile

Guide to setting up an admin state in your React app with Firebase integration

Utilizing Vue 2 and Axios to toggle visibility of a div based on the response of an Axios request

Neither .getJSON() nor .ajax() are functioning for making a REST API call

The image fails to display when using THREE.js and Panolens.js

Navigating through content with scrollable views while utilizing the popular Angular framework

The function of the Nuxt.js server-side plugin is not functioning as intended

The callback function in NodeJS is returning an error message saying "callback is not a valid function"

Prevent the dropdown from closing after clicking on a href link

What is the best way to obtain the output produced by a function when a button is clicked

Uninstalling or downgrading the Selenium Webdriver in Ruby: Step-by-

Press the button once the disabled state has been removed

Using Python's range() function in Django templates

Developing a C# Selenium framework with a focus on assert statements

Issues with background image slideshow functionality

The function array_key_exists() requires the second parameter to be an array, but it was actually passed a

performing asynchronous iteration with HTTP PUT requests

Fade in each input box using jQuery's `.each()` method

"Exploring the world of flags enums and JSON manipulation in JavaScript

Having difficulty grasping the concept of integrating Google Sheets API into my project