A step-by-step guide on extracting all documents within a directory collection from the official national archives website using the R programming language

Question

A step-by-step guide on extracting all documents within a directory collection from the official national archives website using the R programming language

I'm currently seeking a way to scrape all available files for a data file series on archive.gov programmatically using R. It seems that archives.gov utilizes JavaScript. My objective is to extract the URL of each available file along with its name.

The Home Mortgage Disclosure Act data file series consists of 153 entries

When in a browser, by clicking the "Export" button I can obtain a CSV file structured as follows:

first_exported_record <-    
    structure(list(resultType = structure(1L, .Label = "fileUnit", class = "factor"), 
    creators.0 = structure(1L, .Label = "Federal Reserve System. Board of Governors. Division of Consumer and Community Affairs. ca. 1981- (Most Recent)", class = "factor"), 
    date = structure(1L, .Label = "1981 - 2013", class = "factor"),
    // ... Truncated for brevity

In addition, beyond these entries, there are file unit pages containing several files available for downloading. For example, the first exported record leads to:

Since both of these pages appear to be powered by JavaScript, I am uncertain if I need a tool like PhantomJS or Selenium, or if I can export the catalog using simpler tools like rvest?

Once I have the URL of each file, I can download them easily:

tf <- tempfile()
download.file("https://catalog.archives.gov/catalogmedia/lz/electronic-records/rg-082/hmda/233_32LU_TSS.pdf?download=false", tf, mode = 'wb')

The resulting file would be titled:

"Technical Specifications Summary, 2012 Ultimate LAR."

Many thanks!

Update:

The main question at hand is determining how I can systematically navigate from the initial link (the series ID) to the titles and URLs of all downloadable files within the series. I have attempted commands with rvest and httr but have not had much success... :/ Thank you

javascript r selenium web-scraping rvest

Answer 1

Answer №1

There is no need to manually load and process the page content as the records are fetched through a straightforward Ajax request.

If you want to view the requests, you can monitor them using the browser's devtools and select the initial one that returns JSON data. After that, utilize the jsonlite library to retrieve the same URL with R, which will automatically handle the parsing of the response.

To display all the files (along with their descriptions and URLs) for the total of 153 entries:

library(jsonlite)
options(timeout=60000) # setting timeout to 60 seconds (default is 10 seconds)

json = fromJSON("https://catalog.archives.gov/OpaAPI/iapi/v1?action=search&f.level=fileUnit&f.parentNaId=2456161&q=*:*&offset=0&rows=10000&tabType=all")
ids = json$opaResponse$results$result$naId

for (id in ids) { # loop through each id
    json = fromJSON(sprintf("https://catalog.archives.gov/OpaAPI/iapi/v1/id/%s", id))
    records = json$opaResponse$content$objects$objects$object

    for (r in 1:nrow(records)) {  # iterate over each record
        # output the file description and URL
        print(records[r, 'description'])
        print(records[r, '@renditionBaseUrl'])
    }
}

Answer 2

There is no need to manually load and process the page content as the records are fetched through a straightforward Ajax request.

If you want to view the requests, you can monitor them using the browser's devtools and select the initial one that returns JSON data. After that, utilize the jsonlite library to retrieve the same URL with R, which will automatically handle the parsing of the response.

To display all the files (along with their descriptions and URLs) for the total of 153 entries:

library(jsonlite)
options(timeout=60000) # setting timeout to 60 seconds (default is 10 seconds)

json = fromJSON("https://catalog.archives.gov/OpaAPI/iapi/v1?action=search&f.level=fileUnit&f.parentNaId=2456161&q=*:*&offset=0&rows=10000&tabType=all")
ids = json$opaResponse$results$result$naId

for (id in ids) { # loop through each id
    json = fromJSON(sprintf("https://catalog.archives.gov/OpaAPI/iapi/v1/id/%s", id))
    records = json$opaResponse$content$objects$objects$object

    for (r in 1:nrow(records)) {  # iterate over each record
        # output the file description and URL
        print(records[r, 'description'])
        print(records[r, '@renditionBaseUrl'])
    }
}

Answer 3

Answer №2

If you're familiar with utilizing httr, you may want to consider leveraging the National Archives Catalog API to communicate with their server directly. By exploring the website, it seems there is a method for querying and requesting data without needing to scrape the webpage.

In my experimentation within the sandbox environment without an api key, I attempted translating your webpage query into an api query:

https://catalog.archives.gov/api/v1?&q=*:*&resultTypes=fileUnit&parentNaId=2456161

Unfortunately, it appears that the parentNaId field name is not recognized... possibly due to not having sufficient privileges without an api key. Nevertheless, since I'm not personally knowledgeable in R, you will need to figure out how to integrate all of this within httr.

Hopefully, this insight proves to be somewhat helpful.

Answer 4

If you're familiar with utilizing httr, you may want to consider leveraging the National Archives Catalog API to communicate with their server directly. By exploring the website, it seems there is a method for querying and requesting data without needing to scrape the webpage.

In my experimentation within the sandbox environment without an api key, I attempted translating your webpage query into an api query:

https://catalog.archives.gov/api/v1?&q=*:*&resultTypes=fileUnit&parentNaId=2456161

Unfortunately, it appears that the parentNaId field name is not recognized... possibly due to not having sufficient privileges without an api key. Nevertheless, since I'm not personally knowledgeable in R, you will need to figure out how to integrate all of this within httr.

Hopefully, this insight proves to be somewhat helpful.

Answer 5

Answer №3

Provided by the team behind the API at National Archives and Records Administration..

Hello Anthony,

No need to resort to scraping; NARA's catalog offers an open API. To clarify, you are interested in retrieving all media files (referred to as "objects" in our catalog) within the file units of the series "Home Mortgage Disclosure Data Files" (NAID 2456161).

The API supports fielded searches on any data field, meaning instead of having a specific search parameter like "parentNaId," it would be more effective to search based on that specific field - bringing back records where the parent series NAID is 2456161. By exploring one of these file units using its identifier (e.g., ), you can identify the field containing the parent series as "description.fileUnit.parentSeries." Therefore, all your file units and their objects will be accessible at . If you only require the objects without the file unit records, you can include the "&type=object" parameter. Similarly, for the file unit metadata, you can refine results with "type=description", as each file unit record includes data for its child objects. As there are over 1000 results, you may need to use the "rows" parameter to retrieve all results or paginate using the "offset" parameter with smaller "rows" values since the default response displays only the first 10 results.

Upon examining the object metadata, you will find URLs for downloading media along with other pertinent information. Notably, some objects are classified as electronic records sourced from agencies while others are NARA-generated technical documentation, evident in the "designator" field.

Feel free to reach out if further questions arise.

Thank you! Dominic

Answer 6

Provided by the team behind the API at National Archives and Records Administration..

Hello Anthony,

No need to resort to scraping; NARA's catalog offers an open API. To clarify, you are interested in retrieving all media files (referred to as "objects" in our catalog) within the file units of the series "Home Mortgage Disclosure Data Files" (NAID 2456161).

The API supports fielded searches on any data field, meaning instead of having a specific search parameter like "parentNaId," it would be more effective to search based on that specific field - bringing back records where the parent series NAID is 2456161. By exploring one of these file units using its identifier (e.g., ), you can identify the field containing the parent series as "description.fileUnit.parentSeries." Therefore, all your file units and their objects will be accessible at . If you only require the objects without the file unit records, you can include the "&type=object" parameter. Similarly, for the file unit metadata, you can refine results with "type=description", as each file unit record includes data for its child objects. As there are over 1000 results, you may need to use the "rows" parameter to retrieve all results or paginate using the "offset" parameter with smaller "rows" values since the default response displays only the first 10 results.

Upon examining the object metadata, you will find URLs for downloading media along with other pertinent information. Notably, some objects are classified as electronic records sourced from agencies while others are NARA-generated technical documentation, evident in the "designator" field.

Feel free to reach out if further questions arise.

Thank you! Dominic

Answer 7

Answer №4

If you are looking to utilize Rselenium and Rvest, feel free to make use of the following code:

library(RSelenium)
library(rvest)
url <- "https://catalog.archives.gov/search?q=*:*&f.parentNaId=2456161&f.level=fileUnit&sort=naIdSort%20asc&rows=500"
rD <- rsDriver()
remDr <- rD[["client"]]

remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
links <- page %>% html_nodes(".row.result .titleResult a") %>% html_attr("href")
links <- gsub("\\?\\&.{1,}","",links)
links <- paste0("https://catalog.archives.gov",links)

files <- NULL
names <- NULL

for (link in links) {
     remDr$navigate(link)
     Sys.sleep(3)
     page <- read_html(remDr$getPageSource()[[1]])
     file <- page %>% html_nodes(".uer-list.documents .uer-row1 a") %>% 
     html_attr("href")
     name <- page %>% html_nodes(".uer-list.documents .uer-row1 a span") %>% ht ml_text()
    ind <- which(regexpr("Technical",name) != -1)
    file <- file[ind]
    name <- name[ind]
    files <-c(files,file)
    names <-c(names,file)
    Sys.sleep(1)
 }

Hopefully this will be effective for your setup. I am operating on W10 x64

Gottavianoni

Answer 8

If you are looking to utilize Rselenium and Rvest, feel free to make use of the following code:

library(RSelenium)
library(rvest)
url <- "https://catalog.archives.gov/search?q=*:*&f.parentNaId=2456161&f.level=fileUnit&sort=naIdSort%20asc&rows=500"
rD <- rsDriver()
remDr <- rD[["client"]]

remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
links <- page %>% html_nodes(".row.result .titleResult a") %>% html_attr("href")
links <- gsub("\\?\\&.{1,}","",links)
links <- paste0("https://catalog.archives.gov",links)

files <- NULL
names <- NULL

for (link in links) {
     remDr$navigate(link)
     Sys.sleep(3)
     page <- read_html(remDr$getPageSource()[[1]])
     file <- page %>% html_nodes(".uer-list.documents .uer-row1 a") %>% 
     html_attr("href")
     name <- page %>% html_nodes(".uer-list.documents .uer-row1 a span") %>% ht ml_text()
    ind <- which(regexpr("Technical",name) != -1)
    file <- file[ind]
    name <- name[ind]
    files <-c(files,file)
    names <-c(names,file)
    Sys.sleep(1)
 }

Hopefully this will be effective for your setup. I am operating on W10 x64

Gottavianoni

A step-by-step guide on extracting all documents within a directory collection from the official national archives website using the R programming language

Answer №1

Answer №2

Answer №3

Answer №4

Similar questions

Tips for changing the TextField variant when it receives input focus and keeping the focus using Material-UI

Implementing dynamic page loading with ajax on your Wordpress website

What is the best way to effectively apply a mask within a PixiJS container so that its color does not display upon page refresh?

Utilizing Thymeleaf With JavaScript in Spring Boot: A Comprehensive Guide

I am working with an array of objects in React JavaScript, and I need to find a way to convert it into

Implementing promises in my MEAN stack application

choosing from dropdown menu items

How can you access the preloaded resolve value in AngularJS ui-router when the $stateChangeSuccess event is triggered?

When attempting to bind various data to a single div using knockout js, the issue of duplicate records appearing arises

If the user decides to change their answer, the current line between the two DIVs will be removed

Tips for linking attributes to a function using ES6

Enhance your data visualization with d3.js version 7 by using scaleOrdinal to effortlessly color child nodes in

As the user types, the DD/MM/YYYY format will automatically be recognized in the

Encountering issues with the addEventListener function in a React application

BS4 Dynamic Countdown Clock

What tips can you provide for shrinking the size of an AngularJS website with the help of Google Closure Compiler?

Socket.io continuously refreshing and updating multiple instances

Issue with Material UI: An invalid value was detected for the select component, even though the value is within the available options

lm() function in linear regression produced an unexpected outcome that left me amazed

What are some of the techniques for customizing the DOM generated by material-ui in ReactJS?