Data dumps
Data dumps
Libraries such as the Staatsbibliothek zu Berlin – Berlin State Library generally provide three types of data: Images (scans of books, images contained in the scanned material or others), full texts (OCR from digitised books or manuscripts) and metadata.
Datasets that may be of interest to users in different contexts are listed here as examples. The starting point is mostly the entry page of the Digitized Collections of the Staatsbibliothek zu Berlin or, for metadata, the interfaces described here.
Metadata and Full Texts
Materials:
Metadata is a little-researched resource, which is pityable: these metadata are of high quality as they have been created by trained librarians, archivists or other cultural heritage practitioners. The publication of a dataset comprising more than 200,000 lines of metadata therefore aims to make a little-researched, high-quality type of data available in a bundled form. The dataset presented here was generated from the METS files that are created when each individual work is digitised; they were converted into a tabular format.
The dataset consists of a single .csv table (comma-separated values, UTF-8 encoded) containing metadata of all 206,411 works available in the digitised collections of the Berlin State Library (SBB) on 23 January 2023.
Scope:
206,411 metadata records (csv) as a data dump (approx. 216 MB).
Specifics:
The dataset does not contain any annotations beyond the information available in the METS source files.
The only pre-processing step performed is the conversion of the METS datasets into .csv format. This includes the clean-up processes performed by mods4pandas resulting from the conversion of a hierarchical format (METS/MODS) into a tabular format.
Licence:
Contact person:
Zenodo DOI:
Materials:
The dataset comprises of descriptive metadata of 2,619,397 titles, which together form the „Alte Realkatalog“ of the Staatsbibliothek zu Berlin – Berlin State Library. The data are stored in columnar format and comprises of 375 columns. They were downloaded from the German Library Network System (CBS) in December 2023. Exemplary tasks that can be processed with this dataset are studies on the history of books between 1500 and 1955, on the paratextual formatting of scientific books between 1800 and 1955 and on pattern recognition based on bibliographic metadata.
Scope:
2,619,397 metadata records in .parquet format as a data dump (approx. 960 MB).
Specifics:
The dataset does not contain any annotations. The data for the 2.6 million titles were converted from the format available in the CBS into a columnar format, with each field available in the CBS forming a column.
Licence:
Contact person:
Zenodo DOI:
The intention in creating this comprehensive dataset was to facilitate research based on full texts available at the Staatsbibliothek zu Berlin – Berlin State Library. These full texts are usually generated by implementing optical character recognition (OCR) of the books in the digitised collections of the Staatsbibliothek zu Berlin – Berlin State Library. There, the full texts can also be downloaded manually and individually, work by work. The publication of a set of around 5 million OCR pages facilitates access to the full texts and enables distant reading of a golden pot of texts.
The dataset contains all full texts available in the digitised collections of the Berlin State Library on 21 August 2019.
Scope:
4,998,099 pages belonging to 28,909 individual works (identifiable via the Pica Production Number ppn) as a data dump (approx. 17 GB).
Specifics:
The dataset does not contain any annotations.
The following pre-processing steps were carried out:
– the extraction of the OCR results from the ALTO.xml files generated by the OCR pipeline
– matching the unique identifiers (PPNs, Pica Production Number) used for individual digitised works with the ALTO.xml file names containing the output of each individual page processed
– the provision of metrics (confidence, entropy) relating to the language(s) to be found on each page
Licence:
Contact person:
Zenodo DOI:
Stand: 17.03.2023