OAI interface request, response, and data download with R

This tutorial provides examples of queries of the OAI interface with R. It deals with exemplary queries addressed to the OAI-PMH interface of the Staatsbibliothek zu Berlin – PK. It also explains how to download the data and save it in a CSV file.

Setting up the working environment

First of all, you should define the working directory in which you will save the files generated in the course of this tutorial. To do this, you can use getwd() to find out where RStudio would save your files and change this with setwd() if necessary.

getwd()
setwd("pfad/zu/meinem/arbeitsverzeichnis")

The working environment is then set up by installing the required libraries.

install.packages("oai")
install.packages("stringr")
install.packages("jsonlite")

To use the installed libraries, please call them now so that RStudio can access the code they contain.

library(oai)
library(stringr)
library(jsonlite)

Querying the OAI interface

 

In the first step, we will now look at the basic information of the OAI-PMH interface of the Staatsbibliothek zu Berlin – PK, using the command identify id() and the base URL of the OAI interface (https://oai.sbb.berlin).

id("https://oai.sbb.berlin/")

Query all data sets

In the next step, all available data sets are queried and displayed using the list_sets function.

list_sets("https://oai.sbb.berlin/?verb=ListSets")

To obtain the query results in a tabular format which is easy to read, please use the following code, which presents the results in tabular form.

SBB_Sets <- list_sets("https://oai.sbb.berlin/?verb=ListSets")
View(SBB_Sets)

To find out in which metadata formats the OAI interface provides data sets, we can use the following code.

Metadata <- list_metadataformats(url = "https://oai.sbb.berlin/")
View(Metadata)

In addition, we can display how many data records are available in the respective metadata format.

count_identifiers(url = "https://oai.sbb.berlin/metadataPrefix=mets")

Total = 216.854 mets records

count_identifiers(url = "https://oai.sbb.berlin/metadataPrefix=oai_dc")

Total = 216.854 oai records

In the following examples, we will look at the set “Illustrierte Liedflugschriften”. Firstly, let’s look at how many data records this set contains.

count_identifiers(url = "https://oai.sbb.berlin/?verb=ListIdentifiers&set=illustrierte.liedflugschriften", 
                  prefix = "oai_dc")

Total = 1589 records

Based on this, we look at all data records and display them in the form of a table for better readability.

record_list <- list_records("https://oai.sbb.berlin/", 
                            metadataPrefix="oai_dc", 
                            set="illustrierte.liedflugschriften")

Viewing the table

View(record_list)

Output of titles and authors of a set

If you have now decided that the authors and titles of the song flyers are particularly important to you, you can get these data using the following code; please note that the number of the output has been limited to 10 here. If you want to display more responses, please change the number in the brackets.

head(record_list$title, 10)
head(record_list$creator, 10)

In the responses you will see the indication “NA”, which stands for “not available”. This means that not for every song leaflet an author is known.

We need another library for the following code, which is why we will now install and execute it.

install.packages("dplyr")
library(dplyr)

The following code outputs the links to the digitised objects. We have to create these manually by replacing the OAI address with the link to the presentation of the work.

record_list$Werkansicht <- record_list$identifier
record_list %>% 
  mutate(Werkansicht = str_replace_all(Werkansicht, "oai:digital.staatsbibliothek-berlin.de:", 
         "https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=")) -> record_list

We will now output the first 10 links to the digitised objects. If you want to output more links, simply change the corresponding number in the brackets.

head(record_list$Werkansicht, 10)

Downloading the data as a CSV file and in JSON format

Let us now assume that you need the following metadata for further use: date, coverage, publisher and creator. As an example, we will include these in a table for all 1589 objects in the “Illustrierte Liedflugschriften” set.

my_metadata <- select(record_list, date, coverage, publisher, creator)
View(my_metadata)

We will now save this table as a CSV file in the working folder.

write.csv(my_metadata, file = "my_metadata_liedflugschriften.csv", fileEncoding = "UTF-8", row.names = F, na = "")

If you would like to save the file with all metadata as a CSV file, please use the following code.

write.csv(record_list, file = "record_list_liedflugschriften.csv", fileEncoding = "UTF-8", row.names = F, na = "")

However, if you would like to query the data records in OAI-DC metadata format and save them as a JSON file, you can use the following code.

record_data_oai_dc_xml <- get_records(ids = record_list$identifier,
                                      url="https://oai.sbb.berlin/",
                                      prefix="oai_dc",
                                      as = "parsed")
json_obj_dc = toJSON(record_data_oai_dc_xml, pretty=TRUE, auto_unbox=TRUE)
write(json_obj_dc, "illustrierte.liedflugschriften.oai_dc.json")

Video-Tutorial

For inspiration on how to prepare the data visually, helpful tutorials are available at https://infovis.fh-potsdam.de/tutorials/. These tutorials were created by Prof Dr Marian Dörk from the Potsdam University of Applied Sciences.

Download the R script:
https://github.com/joergleh/OAI-Tutorial-mit-R