I'm currently seeking a way to scrape all available files for a data file series on archive.gov programmatically using R. It seems that archives.gov utilizes JavaScript. My objective is to extract the URL of each available file along with its name.
The Home Mortgage Disclosure Act data file series consists of 153 entries
When in a browser, by clicking the "Export" button I can obtain a CSV file structured as follows:
first_exported_record <-
structure(list(resultType = structure(1L, .Label = "fileUnit", class = "factor"),
creators.0 = structure(1L, .Label = "Federal Reserve System. Board of Governors. Division of Consumer and Community Affairs. ca. 1981- (Most Recent)", class = "factor"),
date = structure(1L, .Label = "1981 - 2013", class = "factor"),
// ... Truncated for brevity
In addition, beyond these entries, there are file unit pages containing several files available for downloading. For example, the first exported record leads to:
Since both of these pages appear to be powered by JavaScript, I am uncertain if I need a tool like PhantomJS or Selenium, or if I can export the catalog using simpler tools like rvest?
Once I have the URL of each file, I can download them easily:
tf <- tempfile()
download.file("https://catalog.archives.gov/catalogmedia/lz/electronic-records/rg-082/hmda/233_32LU_TSS.pdf?download=false", tf, mode = 'wb')
The resulting file would be titled:
"Technical Specifications Summary, 2012 Ultimate LAR."
Many thanks!
Update:
The main question at hand is determining how I can systematically navigate from the initial link (the series ID) to the titles and URLs of all downloadable files within the series. I have attempted commands with rvest and httr but have not had much success... :/ Thank you