Extracting data from a website using R

Question

Extracting data from a website using R

I'm currently attempting to extract information such as the names, cities, states, emails, etc of professionals from the website using rvest. However, I'm facing difficulties in identifying the CSS selectors with selector gadget and it appears that the email addresses are protected with JavaScript.

I have searched through various forums but haven't come across a similar issue like this one.

javascript r web-scraping css-selectors

Answer 1

Answer №1

This code snippet utilizes seleniumPipes and the RSelenium package. To make it work, ensure you have downloaded phantomjs, unzipped it, and placed the .exe file in your R working directory.
The technique involves a headless browser (phantomjs) that mimics user actions, allowing it to interpret values generated by JavaScript.

library(rvest)
library(RSelenium) # start a server with utility function
library(seleniumPipes)
rD <- rsDriver(browser = 'chrome', chromever = "latest", port = 4444L)
#open browser
remDr <- remoteDr(browserName = "chrome")

main_page_url <- "http://www.napo.net/search/newsearch.asp"
#go to home page
remDr %>% go(main_page_url)
#switch to iframe
remDr %>% switchToFrame(Id = "SearchResultsFrame")
#get all relative path
relative_path <- remDr %>% getPageSource() %>% html_nodes(".lineitem a[href]") %>% html_attr("href")
#all individual urls:
full_paths <- paste0("http://www.napo.net", relative_path)
#scrape email from each page
email_address <- list()
#Retrieve email address from the first three results
for(i in seq_along(full_paths[1:3])){
    remDr %>% go(full_paths[i])
    email_adress <- remDr %>% getPageSource()  %>% html_nodes('a[href^="mailto"]') %>% html_text()
    temp_list <- list(email = email_adress)
    email_address <- c(email_address, temp_list)
    Sys.sleep(3)
}
#display result
email_address[1]
    $email
[1] "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e48985968885a4d5d6d78b9683858a8d9e81ca878b89">[email protected]</a>"

The above pertains to page one; to navigate to page two:

remDr %>% go(main_page_url)
remDr %>% switchToFrame(Id = "SearchResultsFrame")
#click on page two within the iframe to proceed to page 2:
remDr %>% findElement(using = "css selector", value = ".DotNetPager a:nth-child(2)") %>% elementClick()
#get relative and full paths again
relative_path <- remDr %>% getPageSource() %>% html_nodes(".lineitem a[href]") %>% html_attr("href")
full_paths <- paste0("http://www.napo.net", relative_path)
#Repeat the for loop
for(i in seq_along(full_paths[1:3])){
    remDr %>% go(full_paths[i])
    email_adress <- remDr %>% getPageSource()  %>% html_nodes('a[href^="mailto"]') %>% html_text()
    temp_list <- list(email = email_adress)
    email_address <- c(email_address, temp_list)
    Sys.sleep(3)
}
#display result[6]
$email
[1] "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="670b1e0902131302270e1314140e0a170b1e170b060402034904080a">[email protected]</a>"

email_address
#You can also use a loop to scrape all pages
#-----
#delete session and close server
remDr %>% deleteSession()
rD[["server"]]$stop()

Answer 2

This code snippet utilizes seleniumPipes and the RSelenium package. To make it work, ensure you have downloaded phantomjs, unzipped it, and placed the .exe file in your R working directory.
The technique involves a headless browser (phantomjs) that mimics user actions, allowing it to interpret values generated by JavaScript.

library(rvest)
library(RSelenium) # start a server with utility function
library(seleniumPipes)
rD <- rsDriver(browser = 'chrome', chromever = "latest", port = 4444L)
#open browser
remDr <- remoteDr(browserName = "chrome")

main_page_url <- "http://www.napo.net/search/newsearch.asp"
#go to home page
remDr %>% go(main_page_url)
#switch to iframe
remDr %>% switchToFrame(Id = "SearchResultsFrame")
#get all relative path
relative_path <- remDr %>% getPageSource() %>% html_nodes(".lineitem a[href]") %>% html_attr("href")
#all individual urls:
full_paths <- paste0("http://www.napo.net", relative_path)
#scrape email from each page
email_address <- list()
#Retrieve email address from the first three results
for(i in seq_along(full_paths[1:3])){
    remDr %>% go(full_paths[i])
    email_adress <- remDr %>% getPageSource()  %>% html_nodes('a[href^="mailto"]') %>% html_text()
    temp_list <- list(email = email_adress)
    email_address <- c(email_address, temp_list)
    Sys.sleep(3)
}
#display result
email_address[1]
    $email
[1] "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e48985968885a4d5d6d78b9683858a8d9e81ca878b89">[email protected]</a>"

The above pertains to page one; to navigate to page two:

remDr %>% go(main_page_url)
remDr %>% switchToFrame(Id = "SearchResultsFrame")
#click on page two within the iframe to proceed to page 2:
remDr %>% findElement(using = "css selector", value = ".DotNetPager a:nth-child(2)") %>% elementClick()
#get relative and full paths again
relative_path <- remDr %>% getPageSource() %>% html_nodes(".lineitem a[href]") %>% html_attr("href")
full_paths <- paste0("http://www.napo.net", relative_path)
#Repeat the for loop
for(i in seq_along(full_paths[1:3])){
    remDr %>% go(full_paths[i])
    email_adress <- remDr %>% getPageSource()  %>% html_nodes('a[href^="mailto"]') %>% html_text()
    temp_list <- list(email = email_adress)
    email_address <- c(email_address, temp_list)
    Sys.sleep(3)
}
#display result[6]
$email
[1] "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="670b1e0902131302270e1314140e0a170b1e170b060402034904080a">[email protected]</a>"

email_address
#You can also use a loop to scrape all pages
#-----
#delete session and close server
remDr %>% deleteSession()
rD[["server"]]$stop()

Answer 3

Answer №2

I will be completing this task in two distinct steps.

Firstly, I need to acquire the link leading to the embedded search result pages:

require(rvest)
require(magrittr)
yourlink <- "http://www.napo.net/search/newsearch.asp"
linktoresult <- yourlink %>% read_html() %>%
                html_nodes("iframe") %>% extract(1) %>%
                html_attr("src")

# /searchserver/people.aspx?id=FE0436D0-08ED-4763-8588-09112794521D&cdbid=&canconnect=0&canmessage=0&map=True&toggle=False&hhSearchTerms=

Secondly, I will proceed to scrape data from the actual search result page:

pagelink <- paste0("http://www.napo.net", linktoresult)
# "http://www.napo.net/searchserver/people.aspx?id=FE0436D0-08ED-4763-8588-09112794521D&cdbid=&canconnect=0&canmessage=0&map=True&toggle=False&hhSearchTerms="

yourresult <- pagelink %>% read_html() %>%
              html_nodes("#SearchResultsGrid>.lineitem") %>%
              html_nodes("a") %>% 
              html_attr("href")
#/members/?id=42241027
#NA
#/members/?id=46636113
#/members/?id=37474237
#/members/?id=39530420
#...

Answer 4

I will be completing this task in two distinct steps.

Firstly, I need to acquire the link leading to the embedded search result pages:

require(rvest)
require(magrittr)
yourlink <- "http://www.napo.net/search/newsearch.asp"
linktoresult <- yourlink %>% read_html() %>%
                html_nodes("iframe") %>% extract(1) %>%
                html_attr("src")

# /searchserver/people.aspx?id=FE0436D0-08ED-4763-8588-09112794521D&cdbid=&canconnect=0&canmessage=0&map=True&toggle=False&hhSearchTerms=

Secondly, I will proceed to scrape data from the actual search result page:

pagelink <- paste0("http://www.napo.net", linktoresult)
# "http://www.napo.net/searchserver/people.aspx?id=FE0436D0-08ED-4763-8588-09112794521D&cdbid=&canconnect=0&canmessage=0&map=True&toggle=False&hhSearchTerms="

yourresult <- pagelink %>% read_html() %>%
              html_nodes("#SearchResultsGrid>.lineitem") %>%
              html_nodes("a") %>% 
              html_attr("href")
#/members/?id=42241027
#NA
#/members/?id=46636113
#/members/?id=37474237
#/members/?id=39530420
#...

Extracting data from a website using R

Answer №1

Answer №2

Similar questions

Combining two rows into a single cell using AG-GRID

Display modal popup only once the dropdown has been validated, with the validation focusing on criteria other than the dropdown itself

Iterate through an array index within a map function in a ReactJS component

JQuery receives an enchanting response from the magic line

Maximizing jQuery DataTables performance with single column filtering options and a vast amount of data rows

Update the text on Bootstrap Tooltip when it is clicked

Steps for importing a HDFS file into R mapreduce, processing it, and saving the result back into HDFS file

The Ajax script triggers the PHP script twice

How to close a JavaScript popup with Selenium automation

Retrieving data from a JSON using Typescript and Angular 2

Issue with Laravel 5.7 Autocomplete search: JavaScript unable to recognize the specified route

Implementing jQuery getScript in a Ruby on Rails 3 application

Sequence of events in ComponentDidMount

What steps can be taken to add a radio button group to a form in order to choose between the smoking and non-smoking sections of a restaurant?

Bringing Typescript functions into the current module's scope

I'm currently in the process of creating a Firefox extension and my goal is to accurately count the total number of text boxes on a webpage while excluding any hidden text boxes. Can someone please

Exploring through a dynamically generated table containing JSON data

Modify the spacing between subplots within a panel figure (ggplot) to accommodate varying axis title and text properties

Resetting the state back to its initial value - which one to use: useState or useReduce?

Incorporating the "+ " icon in Vuejs Dropzone to indicate the presence of existing images