Data dumps

Libraries such as the Staatsbibliothek zu Berlin – Berlin State Library generally provide three types of data: Full texts (OCR from digitised books or manuscripts), metadata, and images (scans of books, images contained in the scanned material or others), .

Datasets that may be of interest to users in different contexts are listed here as examples. The starting point is mostly the entry page of the Digitized Collections of the Staatsbibliothek zu Berlin or, for metadata, the interfaces described here.

Full Texts, Metadata and Images

☍ Full texts of the Digitised Collections of the Berlin State Library

Materials:

The intention in creating this comprehensive dataset was to facilitate research based on full texts available at the Staatsbibliothek zu Berlin – Berlin State Library. These full texts are usually generated by implementing optical character recognition (OCR) of the books in the digitised collections of the Staatsbibliothek zu Berlin – Berlin State Library. There, the full texts can also be downloaded manually and individually, work by work. The publication of a set of around 5 million OCR pages facilitates access to the full texts and enables distant reading of a golden pot of texts.

The dataset contains all full texts available in the digitised collections of the Berlin State Library on 21 August 2019.

Scope:

4,998,099 pages belonging to 28,909 individual works (identifiable via the Pica Production Number ppn) as a data dump (approx. 17 GB).

Specifics:

The dataset does not contain any annotations.

The following pre-processing steps were carried out:

– the extraction of the OCR results from the ALTO.xml files generated by the OCR pipeline

– matching the unique identifiers (PPNs, Pica Production Number) used for individual digitised works with the ALTO.xml file names containing the output of each individual page processed

– the provision of metrics (confidence, entropy) relating to the language(s) to be found on each page

Licence:

Creative Commons Attribution 4.0 International

Contact person:

Clemens Neudecker

Zenodo DOI:

https://doi.org/10.5281/zenodo.7716097

☍ Metadata of the works in the digitised collections

Materials:

Metadata is a little-researched resource, which is pityable: these metadata are of high quality as they have been created by trained librarians, archivists or other cultural heritage practitioners. The publication of a dataset comprising more than 200,000 lines of metadata therefore aims to make a little-researched, high-quality type of data available in a bundled form. The dataset presented here was generated from the METS files that are created when each individual work is digitised; they were converted into a tabular format.

The dataset consists of a single .csv table (comma-separated values, UTF-8 encoded) containing metadata of all 206,411 works available in the digitised collections of the Berlin State Library (SBB) on 23 January 2023.

Scope:

206,411 metadata records (csv) as a data dump (approx. 216 MB).

Specifics:

The dataset does not contain any annotations beyond the information available in the METS source files.

The only pre-processing step performed is the conversion of the METS datasets into .csv format. This includes the clean-up processes performed by mods4pandas resulting from the conversion of a hierarchical format (METS/MODS) into a tabular format.

Licence:

Creative Commons Attribution 4.0 International

Contact person:

Clemens Neudecker

Zenodo DOI:

https://doi.org/10.5281/zenodo.7716032

☍ Metadata of the „Alter Realkatalog“ (ARK) of Staatsbibliothek zu Berlin – Berlin State Library

Materials:

The dataset comprises of descriptive metadata of 2,619,397 titles, which together form the „Alte Realkatalog“ of the Staatsbibliothek zu Berlin – Berlin State Library. The data are stored in columnar format and comprises of 375 columns. They were downloaded from the German Library Network System (CBS) in December 2023. Exemplary tasks that can be processed with this dataset are studies on the history of books between 1500 and 1955, on the paratextual formatting of scientific books between 1800 and 1955 and on pattern recognition based on bibliographic metadata.

Scope:

2,619,397 metadata records in .parquet format as a data dump (approx. 960 MB).

Specifics:

The dataset does not contain any annotations. The data for the 2.6 million titles were converted from the format available in the CBS into a columnar format, with each field available in the CBS forming a column.

Licence:

Creative Commons Attribution 4.0 International

Contact person:

Clemens Neudecker

Zenodo DOI:

https://doi.org/10.5281/zenodo.12783813

☍ Metadata of the “Verzeichnis der im deutschen Sprachraum erschienenen Drucke”

Materials:

This metadata publication can be regarded as the German national bibliography for the period 1500-1800. It consists of two files: The first contains all bibliographic master records of printed works published in the German-speaking world between 1500 and 1800, as the have been published in the union catalogue K10plus, i.e. the joint database of the Bibliotheksservice-Zentrums Baden-Württemberg (BSZ) and the Verbundzentrale des GBV (VZG). The second file lists the unique identifiers (“Pica production number”, PPN) of all master data records available in the union catalogue K10plus for those works that have been digitised, and it contains the links to their digital copies. The aim of this data publication was to provide large, machine-readable datasets consisting exclusively of bibliographic metadata. On the one hand, it is intended to stimulate research and development of AI applications and, on the other, to provide a basis for a computational analysis of available data on German book history between 1500 and 1800.

The dataset consists of two tables in .parquet format. They contain the metadata of all master data records listed in VD16, VD17 and VD18 that were available in the union cataloge K10plus in February 2025.

Scope:

750.342 master records in the bibliography (“VD-Bib-Metadata.parquet”, ca. 487 MB), 590.528 master records in list of digitised items (“VD-Digi-Metadata.parquet”, ca. 105 MB).

Specifics:

The dataset does not contain any annotations that go beyond the information which is available in the union catalogue K10plus anyway.

Licence:

Creative Commons Zero v1.0 Universal

Contact person:

Clemens Neudecker

Zenodo DOI:

https://doi.org/10.5281/zenodo.15167938

☍ Colibri: Illustrations in 19th Century Childrens and Youth Books

Materials:

This data publication was created with the intent to provide a single annotated computer vision dataset for research purposes and the development of AI applications. This data publication comprises 53,533 illustrations extracted from 3,412 children’s and youth books published between 1800 and 1925, accompanied by metadata and annotations and example images for each annotated class.

The primary intention for this data publication was the provision of a large computationally amenable dataset exclusively consisting of image files to stimulate research and the development of AI applications. In 2025, large (meta-)datasets from the field of historical cultural data are still missing. In this respect, the data publication provided here aims to fill a gap. The files are suitable for the computational use of digitised collections according to the „Collections as Data” principles.

Scope:

53.533 jpg-files in 3.412 directories in a zip-Container (“ColibriImagesDataset.zip”, ca. 44,6 GB), 53.533 rows of metadata including annotations in a csv-file (“ColibriMetadataAnnotations.csv”, ca. 22 MB), example images for each annotated class in a pdf-file (“ColibriExamplesForTheImagesInTheDataset.pdf”, ca. 10 MB).

Specifics:

The data set contains annotations that were added to the metadata partly manually and partly automatically.

Licence:

Creative Commons Attribution 4.0 International

Contact person:

Clemens Neudecker

Zenodo DOI:

https://doi.org/10.5281/zenodo.15535927

☍ ZEFYS2025: A German Dataset for Named Entity Recognition and Entity Linking for Historical Newspapers

Materialien:

Historical newspaper collectons were amongst the first materials to be scanned in order to preserve them for the future. To expand the ways in which specific types of information from digitised newspapers can be searched, explored and analysed, appropriate technologies need to be developed. Named entity recognition (NER) and entity linking (EL) are such information extraction techniques aiming at recognising, classifying, disambiguating and linking entities that carry a name, in particular proper names. However, large annotated datasets for historical newspapers are still rare. In order to enable the training of machine learning models capable of correctly identifying named entities and linking them to authority files such as, e.g., wikidata entities, we provide a corpus of 100 German-language newspaper pages published between 1837 and 1940. The machine learning task for which this dataset was collected falls into the domain of token classification and, more generally, of natural language processing.

Scope:

100 tsv-files in a zip-container (“ZEFYS2025.zip”, ca. 2,1 MB).

Specifics:

The dataset contains manual annotations of 4,389 entities of the class PER, 6,049 entities of the class LOC, and 3,223 entities of the class ORG.

Licence:

Creative Commons Attribution 4.0 International

Contact person:

Clemens Neudecker

Zenodo DOI:

https://doi.org/10.5281/zenodo.15771822

☍ OCR-D-GT-VD-SBB: A Ground Truth Dataset for Optical Character Recognition

Materials:

A ground truth (GT) dataset created within the OCR-D project and consisting of 348 pages extracted from historical documents pertaining to the “Verzeichnis der im deutschen Sprachraum erschienenen Drucke” (VD), all of which have been digitised by Staatsbibliothek zu Berlin – Berlin State Library (SBB).

The data publication consists of 348 .xml files with transcriptions for 348 .tif facsimile image files. The image files pertain to 67 distinct works; four images were extracted from each of the 65 works; from two further works, 49 and 39 images respectively were extracted to create the GT.

The dataset is complemented by a .csv file which contains a mapping between the identifiers used in this dataset and the unique identifiers used in the digitised collections of Staatsbibliothek zu Berlin – Berlin State Library, as well as a filelisting in .csv format.

Scope:

348 PAGE-XML-files as well as 348 TIF-files in a zip-container (“OCR-D-VD-SBB.zip”, ca. 797 MB), mapping between the identifiers used in this dataset to name the directories and the unique identifiers used in the digitised collections of Staatsbibliothek zu Berlin – Berlin State Library in .csv format (“MappingDirectoryName-PPN.csv”, 8 kB), file listing in csv-format (“OCR-D-GT-VD-SBB-filelisting.csv”, 39 kB).

Specifics:

The dataset contains both annotations of text regions and their labels on the respective individual pages as well as manually transcribed GT with an accuracy of data capture of 99.95% at character level.

Licence:

Creative Commons Attribution 4.0 International

Contact person:

Clemens Neudecker

Zenodo DOI:

https://doi.org/10.5281/zenodo.17395955

Last update: 06.11.2025