Extracting data from a website using R

I'm currently attempting to extract information such as the names, cities, states, emails, etc of professionals from the website using rvest. However, I'm facing difficulties in identifying the CSS selectors with selector gadget and it appears that the email addresses are protected with JavaScript.

I have searched through various forums but haven't come across a similar issue like this one.

Answer №1

This code snippet utilizes seleniumPipes and the RSelenium package. To make it work, ensure you have downloaded phantomjs, unzipped it, and placed the .exe file in your R working directory.
The technique involves a headless browser (phantomjs) that mimics user actions, allowing it to interpret values generated by JavaScript.

library(rvest)
library(RSelenium) # start a server with utility function
library(seleniumPipes)
rD <- rsDriver(browser = 'chrome', chromever = "latest", port = 4444L)
#open browser
remDr <- remoteDr(browserName = "chrome")

main_page_url <- "http://www.napo.net/search/newsearch.asp"
#go to home page
remDr %>% go(main_page_url)
#switch to iframe
remDr %>% switchToFrame(Id = "SearchResultsFrame")
#get all relative path
relative_path <- remDr %>% getPageSource() %>% html_nodes(".lineitem a[href]") %>% html_attr("href")
#all individual urls:
full_paths <- paste0("http://www.napo.net", relative_path)
#scrape email from each page
email_address <- list()
#Retrieve email address from the first three results
for(i in seq_along(full_paths[1:3])){
    remDr %>% go(full_paths[i])
    email_adress <- remDr %>% getPageSource()  %>% html_nodes('a[href^="mailto"]') %>% html_text()
    temp_list <- list(email = email_adress)
    email_address <- c(email_address, temp_list)
    Sys.sleep(3)
}
#display result
email_address[1]
    $email
[1] "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e48985968885a4d5d6d78b9683858a8d9e81ca878b89">[email protected]</a>"

The above pertains to page one; to navigate to page two:

remDr %>% go(main_page_url)
remDr %>% switchToFrame(Id = "SearchResultsFrame")
#click on page two within the iframe to proceed to page 2:
remDr %>% findElement(using = "css selector", value = ".DotNetPager a:nth-child(2)") %>% elementClick()
#get relative and full paths again
relative_path <- remDr %>% getPageSource() %>% html_nodes(".lineitem a[href]") %>% html_attr("href")
full_paths <- paste0("http://www.napo.net", relative_path)
#Repeat the for loop
for(i in seq_along(full_paths[1:3])){
    remDr %>% go(full_paths[i])
    email_adress <- remDr %>% getPageSource()  %>% html_nodes('a[href^="mailto"]') %>% html_text()
    temp_list <- list(email = email_adress)
    email_address <- c(email_address, temp_list)
    Sys.sleep(3)
}
#display result[6]
$email
[1] "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="670b1e0902131302270e1314140e0a170b1e170b060402034904080a">[email protected]</a>"

email_address
#You can also use a loop to scrape all pages
#-----
#delete session and close server
remDr %>% deleteSession()
rD[["server"]]$stop()

Answer №2

I will be completing this task in two distinct steps.

Firstly, I need to acquire the link leading to the embedded search result pages:

require(rvest)
require(magrittr)
yourlink <- "http://www.napo.net/search/newsearch.asp"
linktoresult <- yourlink %>% read_html() %>%
                html_nodes("iframe") %>% extract(1) %>%
                html_attr("src")

# /searchserver/people.aspx?id=FE0436D0-08ED-4763-8588-09112794521D&cdbid=&canconnect=0&canmessage=0&map=True&toggle=False&hhSearchTerms=

Secondly, I will proceed to scrape data from the actual search result page:

pagelink <- paste0("http://www.napo.net", linktoresult)
# "http://www.napo.net/searchserver/people.aspx?id=FE0436D0-08ED-4763-8588-09112794521D&cdbid=&canconnect=0&canmessage=0&map=True&toggle=False&hhSearchTerms="

yourresult <- pagelink %>% read_html() %>%
              html_nodes("#SearchResultsGrid>.lineitem") %>%
              html_nodes("a") %>% 
              html_attr("href")
#/members/?id=42241027
#NA
#/members/?id=46636113
#/members/?id=37474237
#/members/?id=39530420
#...

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Combining two rows into a single cell using AG-GRID

Currently, I am utilizing AG-GRID for angular in my code. Within this grid, I have two columns available - Account.Name and Account.Surname. My goal is to display the values from these two columns within a single cell. To achieve this, I attempted the meth ...

Display modal popup only once the dropdown has been validated, with the validation focusing on criteria other than the dropdown itself

Looking for a way to validate dropdown values. Popup should only show if the dropdown values are selected; otherwise, the popup should remain hidden. Below is the code snippet: <div class="main-search-input-item location"> ...

Iterate through an array index within a map function in a ReactJS component

I am working with a map that contains images of metros, and I need to increment the number by 1 at each loop iteration. For example, the first loop will display {metrosImages[0]}, then {metrosImages[1]}, and so on until the loop reaches its end. The code ...

JQuery receives an enchanting response from the magic line

I could really use some assistance with this problem. I've managed to make some progress but now I'm stuck! Admittedly, my JQuery skills are not that great! Currently, I have a magic line set up where the .slide functionality is triggered by cli ...

Maximizing jQuery DataTables performance with single column filtering options and a vast amount of data rows

My current project involves a table with unique column select drop-downs provided by an amazing jQuery plug-in. The performance is excellent with about 1000 rows. However, the client just informed me that the data could increase to 40000 rows within a mont ...

Update the text on Bootstrap Tooltip when it is clicked

I am looking to update the content of my tooltip when it is clicked. Below is the current code snippet I am using: function myFunction() { var copyText = document.getElementById("myInput"); copyText.select(); document.execCommand("copy"); ...

Steps for importing a HDFS file into R mapreduce, processing it, and saving the result back into HDFS file

I have a question that is very similar to the one discussed in this link on stackoverflow R+Hadoop: How to read CSV file from HDFS and execute mapreduce? My issue involves reading a file located at "/somnath/logreg_data/ds1.10.csv" in HDFS, reducing its ...

The Ajax script triggers the PHP script twice

Utilizing AJAX on my HTML page, I am able to dynamically load data from a MySQL database without reloading the page and send email notifications upon certain events. The process involves Ajax calls to script.php which then makes requests to the database an ...

How to close a JavaScript popup with Selenium automation

I am currently working on a Python project that involves using Selenium to extract information from Hemnet website related to my area. However, I am encountering a problem with a popup that appears when I open the page through Selenium. I have attempted va ...

Retrieving data from a JSON using Typescript and Angular 2

Here is an example of what my JSON data structure looks like: { "reportSections": [ { "name": "...", "display": true, "nav": false, "reportGroups": { "reports": [ { "name": "...", "ur ...

Issue with Laravel 5.7 Autocomplete search: JavaScript unable to recognize the specified route

I've been following a tutorial on this video: https://www.youtube.com/watch?v=D4ny-CboZC0 After completing all the steps, I encountered an error in the console during testing: jquery.min.js:2 POST http://apr2.test/admin/posts/%7B%7B%20('autocom ...

Implementing jQuery getScript in a Ruby on Rails 3 application

After watching a railscast tutorial on loading records through ajax when a link is clicked, I implemented the following javascript: $(function() { $("#dash_container th a").live("click", function() { $.getScript(this.href); return false; }); } ...

Sequence of events in ComponentDidMount

One query that has been on my mind is this: In the componentDidMount() function, I have 4 functions written, but strangely the order in which they are written is not followed. componentDidMount() { this.checkPermission(); (1) this.checkInitialBl ...

What steps can be taken to add a radio button group to a form in order to choose between the smoking and non-smoking sections of a restaurant?

I'm trying to replicate the functionality of radio buttons within a bootstrap form-group, where a line in a form contains multiple buttons ("btn btn-success" for example) that can be selected, but only one at a time. I am aiming for an output like thi ...

Bringing Typescript functions into the current module's scope

Is it possible to import and reference a module (a collection of functions) in typescript without the need for the Module. prefix? For instance: import * as Operations from './Operations'; Can I access Operations.example() simply as example()? ...

I'm currently in the process of creating a Firefox extension and my goal is to accurately count the total number of text boxes on a webpage while excluding any hidden text boxes. Can someone please

As I work on creating a Firefox extension, I am faced with the task of counting the total number of visible text boxes on a webpage while ignoring any hidden ones. Many webpages contain hidden text fields for future use, so my goal is to exclude these fr ...

Exploring through a dynamically generated table containing JSON data

I have successfully implemented a dynamic HTML table that is populated with JSON data based on the value of a variable when the "go" button is clicked. The initial population and search functionality work flawlessly. However, I encountered an issue when ch ...

Modify the spacing between subplots within a panel figure (ggplot) to accommodate varying axis title and text properties

How can I adjust the spacing between aligned plots in a panel using the cowplot package, especially when some plots have axis titles/labels and others don't? For example: Let's generate three plots: library(tidyverse) library(cowplot) set.seed ...

Resetting the state back to its initial value - which one to use: useState or useReduce?

In order to enhance the functionality of a third-party Authentication service Auth0's useAuth0 hook, I have developed a custom hook called useAuth. This custom hook is responsible for managing local variables that store essential user information like ...

Incorporating the "+ " icon in Vuejs Dropzone to indicate the presence of existing images

Looking to enhance my Vue-dropzone file uploader component by adding an icon or button labeled "Add more images" when there are already images present in the dropzone. This will help users understand that they can upload multiple photos. Any suggestions on ...