OAI interface request, response, downloading data with python

This tutorial provides examples of queries via the OAI interface with Python. It deals with exemplary requests to the OAI-PMH interface of the Staatsbibliothek zu Berlin - PK. It also explains how to download the data and save it in a CSV file.

Setting up the working environment

Firstly, we set up the working environment by importing the required Python libraries. The "sickle" library is used for queries via the OAI interface, while the "etree" library is used for processing the XML data.
from sickle import Sickle
from lxml import etree

Querying all datasets

The OAI base URL of the Staatsbibliothek zu Berlin - PK is https://oai.sbb.berlin. Please note that the response output is limited to 10 answers for the sake of clarity. If you want to view all responses, delete the corresponding part of the code or place a # in front of the lines.
sickle = Sickle('https://oai.sbb.berlin')
oai_sets = sickle.ListSets()
counter = 0
for oai_set in oai_sets:
    print('setSpec value for selective harvesting: ' + oai_set.setSpec)
    print('Name of the set (setName): ' + oai_set.setName + 'n')
    # The following lines of code are there to limit the displayed results to 10. 
    # If you want to see all the results,
    # delete the segment below or insert a # before all the lines.
    counter += 1
    if counter >= 10:
        break
setSpec value for selective harvesting: all
Name of the set (setName): Alle Kategorien

setSpec value for selective harvesting: historische.drucke
Name of the set (setName): Historische Drucke

setSpec value for selective harvesting: theologie
Name of the set (setName): Theologie

setSpec value for selective harvesting: rechtswissenschaft
Name of the set (setName): Rechtswissenschaft

setSpec value for selective harvesting: militaerwesen
Name of the set (setName): Militärwesen

setSpec value for selective harvesting: geschichte.ethnographie.geographie
Name of the set (setName): Geschichte / Ethnographie / Geographie

setSpec value for selective harvesting: krieg.1914.1918
Name of the set (setName): Krieg 1914-1918

setSpec value for selective harvesting: landwirtschaft
Name of the set (setName): Landwirtschaft / Forstwirtschaft

setSpec value for selective harvesting: politik.staat.gesellschaft.wirtschaft
Name of the set (setName): Politik / Staat / Gesellschaft / Wirtschaft

setSpec value for selective harvesting: sprachen.literaturen
Name of the set (setName): Sprachen / Literaturen

Output of titles and authors of a set

In the following example, we will list the titles and authors from the set "Illustrierte Liedflugschriften". The metadata format and set must be specified for the request. Here we are querying the Dublin Core metadata, DC for short. The retrieved data is contained in the XML response and is represented by numbered descriptive elements. For more information on the DC (Dublin Core) metadata format, you can refer to the DC specifications at http://www.openarchives.org/OAI/2.0/oai_dc/ and http://purl.org/dc/elements/1.1/. Further OAI functions are specified at http://www.openarchives.org/OAI/2.0/ and described on the Stabi-Lab page under the "Data" tab. Please note that, for the sake of clarity, the response output is limited to 10 answers. If you want to view all responses, delete the corresponding part of the code or place a # in front of the lines.
namespaces = {
    'oai': 'http://www.openarchives.org/OAI/2.0/',
    'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
    'dc': 'http://purl.org/dc/elements/1.1/'
}

count=0
counter = 0

for record in sickle.ListRecords(**{'metadataPrefix': 'oai_dc', 'set': 'illustrierte.liedflugschriften'}):
    
    if ('ger' in record.raw):
        tree = etree.ElementTree(record.xml)
        result = tree.xpath('/oai:record/oai:metadata/oai_dc:dc/dc:title/text()', namespaces=namespaces)
        if (result):
            count += 1
            author = tree.xpath('/oai:record/oai:metadata/oai_dc:dc/dc:creator/text()', namespaces=namespaces)
            print(str(count) + ": " + result[0])
            # The following lines of code are there to limit the displayed results to 10. 
            # If you want to see all the results,
            # delete the segment below or insert a # before all the lines.
            counter += 1
            if counter >= 10:
                break
            # delete until here or insert a # before all the lines.
            if author:
                print(author[0])
               
1: Sechs Geystliche Wey=||nacht Lieder/ Von der geburt Christi/|| vnd von den heiligen drey Kœnigen/|| Jm thon/ Wie bey einem jeden || Lied verzeichnet ist.||
2: Alte vnd || Newe Geistli=||che Lieder vnd Lob=||gesenge/ von der Ge=||burt Christi/ vnsers || Herrn/ fuer die Jun=||ge Christen.|| Johan Spang.||
Spangenberg, Johann
3: Ein schœn geist||lich Liedt/ Ach Herre || Gott/ mich treybt die not.|| Ein ander geistlich Lied/|| Der genaden Brunn thut fliessen. Jm thon/|| Die Bruenlein die thun fliessen.||
4: Ein new Lied/ der Je=||ger Geistlich.|| Ein ander geistlich lied/|| Jn dem thon/ Auss hertem || wee klagt sich ein Held.||
5: Ein schœn New Geist=||lich Lied/ Wach auff wach auff O || menschen kindt/ [et]c. Jm Thon/|| Kompt her zu mir spricht || Gottes Son.||
6: Der 103. Psalm/ Nu lob || mein seele den Herren. Jn || gesangs weyß. [Übers. v. Johannes Poliander] || Mer drey schöner Geist||licher Lieder. Das erst/ Herr Gott deine || gewalt/ ist vber jung vnd alt. Das ander/|| der mensch lebt nicht allein in brod.|| Das dritte/ Allein zu dir Herr || Jhesu Christ. [v. Johannes Schneesing]||
Schneesing, Johannes
7: Ein schœn new || Liede/ von Herrn D. Martini || Luthers sterben/ darinn kuertzlich be=||griffen/ was er inn der letzten zeit geredt || sehr trœstlich allen Christen/ durch || Leonhardum Ketner von || Herßbruck.|| Jm thon/ Jch rueff zu dir Herr || Jesu Christ.||
Kettner, Leonhart
8: Ein Christli=||cher Abentreien/ vom Leben || vnd ampt Johannis des Tauf=||fers/ fuer Christliche/ zuechtige || Jungfrawlein. N.H.||
Herman, Nikolaus
9: Zwey schœne newe Geist-||licher Liede/ auß Gœttlicher Schrifft/|| Von dem wuest~e wesen der jetzig~e bœsen || welt ... || ~Jm thon/|| Frisch auff jhr Landts=||knecht alle/ [et]c.|| Das Ander Lied zu bit=||ten vmb vergebung der Suend/ vnnd || vmb sterck~ug des glaubens/ Auch vmb || ein seliges endt. ~Jm thon/ wie der 13.|| Psalm/ HErr Gott wie lang || vergissest mein/ [et]c.||
Müntzer, M. R.
10: Zwey Schœne newe Lie=||der. Das erst/ Jch armer Suender klag || mich sehr. Das ander/ O Got Vat=||er im hœchsten Thron/ vnd || sind in dem Thon/ Jch || armes Meydlein || klag mich || seh/ etc.||
The following code outputs the links to the digitised objects. For this purpose, the DC schema /oai:record/oai:metadata/oai_dc:dc/dc:identifier/text() is addressed and output together with the title. Please note that the response output is limited to 10 responses for the sake of clarity. If you want to view all responses, delete the corresponding part of the code or place a # in front of the lines.
namespaces = {
    'oai': 'http://www.openarchives.org/OAI/2.0/',
    'oai_dc': 'http://www.openarchives.org/OAI/2.0/oai_dc/',
    'dc': 'http://purl.org/dc/elements/1.1/'
}

count=0
counter=0

for record in sickle.ListRecords(**{'metadataPrefix': 'oai_dc', 'set': 'illustrierte.liedflugschriften'}):
    
    if ('ger' in record.raw):
        tree = etree.ElementTree(record.xml)
        result = tree.xpath('/oai:record/oai:metadata/oai_dc:dc/dc:title/text()', namespaces=namespaces)
        if (result):
            count += 1
            urn = tree.xpath('/oai:record/oai:metadata/oai_dc:dc/dc:identifier/text()', namespaces=namespaces)
            print(str(count) + ": " + result[0])
            if urn:
                print("https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=" + urn[0])
                # The following lines of code are there to limit the displayed results to 10. 
                # If you want to see all the results,
                # delete the segment below or insert a # before all the lines.
                counter += 1
                if counter >= 10:
                    break
1: Sechs Geystliche Wey=||nacht Lieder/ Von der geburt Christi/|| vnd von den heiligen drey Kœnigen/|| Jm thon/ Wie bey einem jeden || Lied verzeichnet ist.||
https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=PPN66805073X
2: Alte vnd || Newe Geistli=||che Lieder vnd Lob=||gesenge/ von der Ge=||burt Christi/ vnsers || Herrn/ fuer die Jun=||ge Christen.|| Johan Spang.||
https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=PPN668091029
3: Ein schœn geist||lich Liedt/ Ach Herre || Gott/ mich treybt die not.|| Ein ander geistlich Lied/|| Der genaden Brunn thut fliessen. Jm thon/|| Die Bruenlein die thun fliessen.||
https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=PPN668097213
4: Ein new Lied/ der Je=||ger Geistlich.|| Ein ander geistlich lied/|| Jn dem thon/ Auss hertem || wee klagt sich ein Held.||
https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=PPN668102799
5: Ein schœn New Geist=||lich Lied/ Wach auff wach auff O || menschen kindt/ [et]c. Jm Thon/|| Kompt her zu mir spricht || Gottes Son.||
https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=PPN66814730X
6: Der 103. Psalm/ Nu lob || mein seele den Herren. Jn || gesangs weyß. [Übers. v. Johannes Poliander] || Mer drey schöner Geist||licher Lieder. Das erst/ Herr Gott deine || gewalt/ ist vber jung vnd alt. Das ander/|| der mensch lebt nicht allein in brod.|| Das dritte/ Allein zu dir Herr || Jhesu Christ. [v. Johannes Schneesing]||
https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=PPN668148160
7: Ein schœn new || Liede/ von Herrn D. Martini || Luthers sterben/ darinn kuertzlich be=||griffen/ was er inn der letzten zeit geredt || sehr trœstlich allen Christen/ durch || Leonhardum Ketner von || Herßbruck.|| Jm thon/ Jch rueff zu dir Herr || Jesu Christ.||
https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=PPN668153601
8: Ein Christli=||cher Abentreien/ vom Leben || vnd ampt Johannis des Tauf=||fers/ fuer Christliche/ zuechtige || Jungfrawlein. N.H.||
https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=PPN668165987
9: Zwey schœne newe Geist-||licher Liede/ auß Gœttlicher Schrifft/|| Von dem wuest~e wesen der jetzig~e bœsen || welt ... || ~Jm thon/|| Frisch auff jhr Landts=||knecht alle/ [et]c.|| Das Ander Lied zu bit=||ten vmb vergebung der Suend/ vnnd || vmb sterck~ug des glaubens/ Auch vmb || ein seliges endt. ~Jm thon/ wie der 13.|| Psalm/ HErr Gott wie lang || vergissest mein/ [et]c.||
https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=PPN668409428
10: Zwey Schœne newe Lie=||der. Das erst/ Jch armer Suender klag || mich sehr. Das ander/ O Got Vat=||er im hœchsten Thron/ vnd || sind in dem Thon/ Jch || armes Meydlein || klag mich || seh/ etc.||
https://digital.staatsbibliothek-berlin.de/werkansicht?ppn=PPN66841247X

Downloading the data as a CSV file

Below we will show you how to read out the metadata, integrate it into a DataFrame and save it as a CSV file. In order to identify the metadata elements suitable for you and your enquiry, we will first take a look at the sub-elements of an object. Please note that we are looking at the DC metadata here, not the METS metadata.
import requests

response = requests.get('https://oai.sbb.berlin/?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:digital.staatsbibliothek-berlin.de:PPN668091029')
flugschrift1_xml = response.text
print(flugschrift1_xml)
<!--?xml version='1.0' encoding='UTF-8'?-->
<!--?xml-stylesheet href="/oai-static/oai2.xsl" type="text/xsl"?-->

2024-01-04T09:46:32Z
https://oai.sbb.berlin/

<header>oai:digital.staatsbibliothek-berlin.de:PPN668091029
2018-07-13T13:55:50Z
historische.drucke
musik
theologie</header>Alte vnd || Newe Geistli=||che Lieder vnd Lob=||gesenge/ von der Ge=||burt Christi/ vnsers || Herrn/ fuer die Jun=||ge Christen.|| Johan Spang.||
Spangenberg, Johann
Historische Drucke
Musik
Theologie
Sachse, Melchior
monograph
text
application/mets+xml
image/jpeg
PPN668091029
PPN567450481
http://resolver.staatsbibliothek-berlin.de/SBB00005DC200000000
ger
1544
Erfurt
Public Domain Mark 1.0
Let us now assume that you need the following metadata for further use: dc:date, dc:coverage, dc:publisher and dc:creator. As an example, we will include these for all 1589 objects in the "Illustrated song fonts" set in a DataFrame (a data matrix in which data is organised in rows and columns).
# Import the necessary libraries
import requests
import xml.etree.ElementTree as ET
import pandas as pd

# Defining the OAI parameters
base_url = 'https://oai.sbb.berlin/'
metadata_prefix = 'oai_dc'
set_name = 'illustrierte.liedflugschriften'

# Define the function for the extraction of data from the OAI response
def extract_data(response_xml):
    ns = {
        "oai_dc": "http://www.openarchives.org/OAI/2.0/oai_dc/",
        "dc": "http://purl.org/dc/elements/1.1/"
    }
    dates = []
    places = []
    publishers = []
    creators = []

    flugschriften = ET.fromstring(response_xml)
    for item in flugschriften.findall(".//oai_dc:dc", ns):
        # Hier können Sie andere Metadaten-Elemnte abfragen, indem Sie
        # anstatt .//dc:date ein anderes Metadaten-Element einfügen.
        dates.append(item.find('.//dc:date', ns).text)
        places.append(item.find('.//dc:coverage', ns).text)
        publisher_element = item.find('.//dc:publisher', ns)
        if publisher_element is not None:
            publishers.append(publisher_element.text)
        else:
            publishers.append(None)
        creator_element = item.find('.//dc:creator', ns)
        if creator_element is not None:
            creators.append(creator_element.text)
        else:
            creators.append(None)

    return dates, places, publishers, creators

# Define the function for the OAI request and extraction of the data
def get_data(url):
    response = requests.get(url)
    response_xml = response.text
    dates, places, publishers, creators = extract_data(response_xml)
    return dates, places, publishers, creators, response_xml

# Create empty lists and query the data from all pages
all_dates = []
all_places = []
all_publishers = []
all_creators = []

next_url = f"{base_url}?verb=ListRecords&metadataPrefix={metadata_prefix}&set={set_name}"
while next_url:
    dates, places, publishers, creators, response_xml = get_data(next_url)
    all_dates.extend(dates)
    all_places.extend(places)
    all_publishers.extend(publishers)
    all_creators.extend(creators)
    
    # Check whether there is a resumption token and further data
    root = ET.fromstring(response_xml)
    resumption_token = root.find('.//{http://www.openarchives.org/OAI/2.0/}resumptionToken')
    if resumption_token is not None:
        next_url = f"{base_url}?verb=ListRecords&resumptionToken={resumption_token.text}"
    else:
        next_url = None

# Create a data frame with all the data collected
flugschriften_df = pd.DataFrame({'date': all_dates, 'place': all_places, 'publisher': all_publishers, 'creator': all_creators})

flugschriften_df
date place publisher creator
0 1555 Nürnberg Gutknecht, Friedrich None
1 1544 Erfurt Sachse, Melchior Spangenberg, Johann
2 1560 Nürnberg Neuber, Valentin None
3 1560 s.l. None None
4 1560 Nürnberg Gutknecht, Friedrich None
... ... ... ... ...
1584 1850 [S.l.] None None
1585 1850 [S.l.] None None
1586 1870 [Berlin] Queva Queva, A.
1587 1850 [S.l.] None Mosen, Julius
1588 1850 [S.l.] None None
1589 rows × 4 columns
If you want to save this DataFrame as a CSV file and use it locally, use the following code. Remove the # in front of the code to execute it. You will then receive a CSV file that you can open and process with Jupyter Notebooks or Excel.
# flugschriften_df.to_csv('FS1.csv')

Video-tutorial

For inspiration on how to prepare the data visually, helpful tutorials are available at https://infovis.fh-potsdam.de/tutorials/. These tutorials were created by Prof Dr Marian Dörk from the Potsdam University of Applied Sciences. This Jupyter notebook was created by Ulrike Förstel as part of her bachelor's thesis in the Information and Data Management programme at the Potsdam University of Applied Sciences.