Looking for guidance on web scraping with Python. I've successfully pulled data from a job portal site using Selenium and BeautifulSoup, following these steps:
- Scrape the links of job postings on the site
- Retrieve detailed information from each job posting link by looping through them
My issue arises when trying to extract detailed information using BeautifulSoup's find_all method on script tags type='application/ld+json' and data-react-helmet. I'm encountering an error message indicating 'list index out of range'. Any suggestions on troubleshooting this?
https://i.sstatic.net/mXmIJ.png
job_main_data = pd.DataFrame()
for i, url in enumerate(URL_job_list):
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
response = requests.get(url=url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
script_tags = soup.find_all('script', attrs={'data-react helmet':'true','type':'application/ld+json'})
metadata = script_tags[-1].text
temp_dict = {}
try:
job_info_json = json.loads(metadata, strict=False)
try:
jobID = job_info_json['identifier']['value']
temp_dict['Job ID'] = jobID
print('Job ID = ' + jobID)
except AttributeError :
jobID = ''
try:
jobTitle = job_info_json['title']
temp_dict['Job Title'] = jobTitle
print('Title = ' + jobTitle)
except AttributeError :
jobTitle = ''
try:
occupationalCategory = job_info_json['occupationalCategory']
temp_dict['occupationalCategory'] = occupationalCategory
print('Occupational Category = ' + occupationalCategory)
except AttributeError :
occupationalCategory = ''
temp_dict['Job Link'] = URL_job_list
job_main_data = job_main_data.append(temp_dict, ignore_index=True)
except json.JSONDecodeError:
print("Empty response")