A step-by-step guide on extracting all documents within a directory collection from the official national archives website using the R programming language

I'm currently seeking a way to scrape all available files for a data file series on archive.gov programmatically using R. It seems that archives.gov utilizes JavaScript. My objective is to extract the URL of each available file along with its name.

The Home Mortgage Disclosure Act data file series consists of 153 entries

When in a browser, by clicking the "Export" button I can obtain a CSV file structured as follows:

first_exported_record <-    
    structure(list(resultType = structure(1L, .Label = "fileUnit", class = "factor"), 
    creators.0 = structure(1L, .Label = "Federal Reserve System. Board of Governors. Division of Consumer and Community Affairs. ca. 1981- (Most Recent)", class = "factor"), 
    date = structure(1L, .Label = "1981 - 2013", class = "factor"),
    // ... Truncated for brevity
   

In addition, beyond these entries, there are file unit pages containing several files available for downloading. For example, the first exported record leads to:

Since both of these pages appear to be powered by JavaScript, I am uncertain if I need a tool like PhantomJS or Selenium, or if I can export the catalog using simpler tools like rvest?

Once I have the URL of each file, I can download them easily:

tf <- tempfile()
download.file("https://catalog.archives.gov/catalogmedia/lz/electronic-records/rg-082/hmda/233_32LU_TSS.pdf?download=false", tf, mode = 'wb')

The resulting file would be titled:

"Technical Specifications Summary, 2012 Ultimate LAR."

Many thanks!

Update:

The main question at hand is determining how I can systematically navigate from the initial link (the series ID) to the titles and URLs of all downloadable files within the series. I have attempted commands with rvest and httr but have not had much success... :/ Thank you

Answer №1

There is no need to manually load and process the page content as the records are fetched through a straightforward Ajax request.

If you want to view the requests, you can monitor them using the browser's devtools and select the initial one that returns JSON data. After that, utilize the jsonlite library to retrieve the same URL with R, which will automatically handle the parsing of the response.

To display all the files (along with their descriptions and URLs) for the total of 153 entries:

library(jsonlite)
options(timeout=60000) # setting timeout to 60 seconds (default is 10 seconds)

json = fromJSON("https://catalog.archives.gov/OpaAPI/iapi/v1?action=search&f.level=fileUnit&f.parentNaId=2456161&q=*:*&offset=0&rows=10000&tabType=all")
ids = json$opaResponse$results$result$naId

for (id in ids) { # loop through each id
    json = fromJSON(sprintf("https://catalog.archives.gov/OpaAPI/iapi/v1/id/%s", id))
    records = json$opaResponse$content$objects$objects$object

    for (r in 1:nrow(records)) {  # iterate over each record
        # output the file description and URL
        print(records[r, 'description'])
        print(records[r, '@renditionBaseUrl'])
    }
}

Answer №2

If you're familiar with utilizing httr, you may want to consider leveraging the National Archives Catalog API to communicate with their server directly. By exploring the website, it seems there is a method for querying and requesting data without needing to scrape the webpage.

In my experimentation within the sandbox environment without an api key, I attempted translating your webpage query into an api query:

https://catalog.archives.gov/api/v1?&q=*:*&resultTypes=fileUnit&parentNaId=2456161

Unfortunately, it appears that the parentNaId field name is not recognized... possibly due to not having sufficient privileges without an api key. Nevertheless, since I'm not personally knowledgeable in R, you will need to figure out how to integrate all of this within httr.

Hopefully, this insight proves to be somewhat helpful.

Answer №3

Provided by the team behind the API at National Archives and Records Administration..

Hello Anthony,

No need to resort to scraping; NARA's catalog offers an open API. To clarify, you are interested in retrieving all media files (referred to as "objects" in our catalog) within the file units of the series "Home Mortgage Disclosure Data Files" (NAID 2456161).

The API supports fielded searches on any data field, meaning instead of having a specific search parameter like "parentNaId," it would be more effective to search based on that specific field - bringing back records where the parent series NAID is 2456161. By exploring one of these file units using its identifier (e.g., ), you can identify the field containing the parent series as "description.fileUnit.parentSeries." Therefore, all your file units and their objects will be accessible at . If you only require the objects without the file unit records, you can include the "&type=object" parameter. Similarly, for the file unit metadata, you can refine results with "type=description", as each file unit record includes data for its child objects. As there are over 1000 results, you may need to use the "rows" parameter to retrieve all results or paginate using the "offset" parameter with smaller "rows" values since the default response displays only the first 10 results.

Upon examining the object metadata, you will find URLs for downloading media along with other pertinent information. Notably, some objects are classified as electronic records sourced from agencies while others are NARA-generated technical documentation, evident in the "designator" field.

Feel free to reach out if further questions arise.

Thank you! Dominic

Answer №4

If you are looking to utilize Rselenium and Rvest, feel free to make use of the following code:

library(RSelenium)
library(rvest)
url <- "https://catalog.archives.gov/search?q=*:*&f.parentNaId=2456161&f.level=fileUnit&sort=naIdSort%20asc&rows=500"
rD <- rsDriver()
remDr <- rD[["client"]]

remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
links <- page %>% html_nodes(".row.result .titleResult a") %>% html_attr("href")
links <- gsub("\\?\\&.{1,}","",links)
links <- paste0("https://catalog.archives.gov",links)

files <- NULL
names <- NULL

for (link in links) {
     remDr$navigate(link)
     Sys.sleep(3)
     page <- read_html(remDr$getPageSource()[[1]])
     file <- page %>% html_nodes(".uer-list.documents .uer-row1 a") %>% 
     html_attr("href")
     name <- page %>% html_nodes(".uer-list.documents .uer-row1 a span") %>% ht ml_text()
    ind <- which(regexpr("Technical",name) != -1)
    file <- file[ind]
    name <- name[ind]
    files <-c(files,file)
    names <-c(names,file)
    Sys.sleep(1)
 }

Hopefully this will be effective for your setup. I am operating on W10 x64

Gottavianoni

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Tips for changing the TextField variant when it receives input focus and keeping the focus using Material-UI

When a user focuses on the input, I'd like to change the variant of the TextField. The code snippet below accomplishes this, but the input loses focus. This means the user has to click again on the input to focus and start typing. import React, { useS ...

Implementing dynamic page loading with ajax on your Wordpress website

I'm currently facing an issue with loading pages in WordPress using ajax. I am trying to implement animated page transitions by putting the page content into a div that I will animate into view. However, my current logic only works correctly about 50% ...

What is the best way to effectively apply a mask within a PixiJS container so that its color does not display upon page refresh?

After implementing the code snippet below to add a mask in a container, I encountered an issue where upon clicking the refresh button on the page (chrome), the pixi stage would turn completely white until the refreshing process is completed. Can anyone p ...

Utilizing Thymeleaf With JavaScript in Spring Boot: A Comprehensive Guide

Within my Spring Boot project, I am attempting to utilize Thymeleaf for injecting data into a JavaScript file that undergoes preprocessing with babel via WebPack. The Thymeleaf setup is outlined as follows: @Bean public SpringTemplateEngine templateEngine ...

I am working with an array of objects in React JavaScript, and I need to find a way to convert it into

Currently, I am in the process of creating this JavaScript function: export function convertArrayObjToFileDownload(data, filename, filetype = 'application/octet-stream') { const blob = new Blob(data, { type: filetype }); downloadBlob(blob ...

Implementing promises in my MEAN stack application

I have developed a controller that performs a Bing search based on the user's input in the URL. After testing the controller with console.log, it seems to be functioning correctly and I have set the variable to return the results. However, when trying ...

choosing from dropdown menu items

Learn HTML <table> <tr> <td>ENTER PRINCIPLE AMOUNT</td> <td><input type="text" name="principle" size="7"></td> </tr> <tr> <td>ENTER RATE OF INTEREST</td> <td><input type="text" na ...

How can you access the preloaded resolve value in AngularJS ui-router when the $stateChangeSuccess event is triggered?

$stateProvider.state('home', { url: '/', resolve: { person: function() { return 'good' } } Can you help me figure out how to access the value of 'person' in the $stateChangeSuccess callback f ...

When attempting to bind various data to a single div using knockout js, the issue of duplicate records appearing arises

I am encountering an issue with a div that is set up to display 10 records at a time. When the user clicks on the next link, the next set of 10 records should be loaded from the server. However, after binding the newly added records, they are being shown m ...

If the user decides to change their answer, the current line between the two DIVs will be removed

After clicking on two DIVs, I created two lines. Now, I am facing an issue with resetting the unwanted line when changing my answer. To reset the line, you can refer to the code snippet below: var lastSelection; ...

Tips for linking attributes to a function using ES6

I'm currently working on a stateless React.js component called myView. I want to keep the syntax as concise as possible, but I'm having trouble understanding how to export a default function with an attached object. Here's an example in myVi ...

Enhance your data visualization with d3.js version 7 by using scaleOrdinal to effortlessly color child nodes in

Previously, I utilized the following functions in d3 v3.5 to color the child nodes the same as the parent using scaleOrdinal(). However, this functionality seems to be ineffective in d3 v7. const colorScale = d3.scaleOrdinal() .domain( [ "Parent" ...

As the user types, the DD/MM/YYYY format will automatically be recognized in the

In my React component, I have an input box meant for entering a date of birth. The requirement is to automatically insert a / after each relevant section, like 30/03/2017 I am looking for something similar to this concept, but for date of birth instead of ...

Encountering issues with the addEventListener function in a React application

Here's the scenario: I'm currently working on integrating a custom web component into a React application and I'm facing some challenges when it comes to handling events from this web component. It seems that the usual way of handling events ...

BS4 Dynamic Countdown Clock

I have successfully integrated a countdown timer into my website, utilizing the following HTML code: <div id="clockdiv"> <div> <span class="days"></span> <div class="smalltext">Days</ ...

What tips can you provide for shrinking the size of an AngularJS website with the help of Google Closure Compiler?

Is there a way to decrease the size of an Angularjs site using Google Closure Compiler? I have a website built on Angularjs 1.8.x and I am interested in compiling it with Closure. Are there any examples or demonstrations available to help me achieve this ...

Socket.io continuously refreshing and updating multiple instances

Incorporating socket.io into a small React application, I configured all the listeners within the "componentWillMount" method. See the code snippet below for reference: componentWillMount() { const socket = io(); socket.on('update', f ...

Issue with Material UI: An invalid value was detected for the select component, even though the value is within the available options

I am facing an issue with a dropdown list component using material UI. The dropdown is populated by an API call and displays a list of departments with their corresponding IDs and names. Upon selecting a record from a table, the department name associated ...

lm() function in linear regression produced an unexpected outcome that left me amazed

After applying a linear regression to my dataset using the lm function in R, I noticed that the results were not as expected. It seems like a specific group of points, located at x=15-25 and y=0-20, are not being accurately represented by the calculated in ...

What are some of the techniques for customizing the DOM generated by material-ui in ReactJS?

Is there a way to style the dynamically created DOM elements from material-ui? I'm currently using GridListTileBar, which generates a DOM structure like this for its title property. <div class="MuiGridListTileBar-title-29" data-reactid=".0.3.0.$3 ...