In my June 2020 post in this series, “Infrastructure for heritage institutions – change of course ” , I said:
“The results of both Data Licences and the Data Quality projects (Object PID’s, Controlled Vocabularies, Metadata Set) will go into the new Data Publication project, which will be undertaken in the second half of 2020. This project is aimed at publishing our collection data as open and linked data in various formats via various channels. A more detailed post will be published separately.”
In November 2020 we implemented ARK Persistent Identifiers for the central catalogue of the Library of the University of Amsterdam (see Infrastructure for heritage institutions – ARK PID’s ). And now, in May 2021, we present our open and linked data portals:
- The Open Data website with information on datasets, licences, formats, publication channels and links to downloads and harvesting endpoints
- A separate Linked Data portal
Here I will provide some background information on the choices made in the project, the workflow, the features, and the options made possible by publishing the data.
The general approach for publishing collection data is: determine data sources, define datasets, select and apply data licences, determine publication channels, define applicable data, record and syntax formats, apply transformations to obtain the desired formats and publish. A mix of expertise types is required: content, technical and communication. In the current project this general approach is implemented in a very pragmatic manner. This means that we haven’t always taken the ideal paths forward, but mainly the best possible options at the time. There will be shortcomings, but we are aware of them and they will be addressed in due course. It is also a learning project.
The Library maintains a number of systems/databases containing collection data, although the intention is to eventually minimize the number of systems and data sources in the context of the Digital Infrastructure programme. The bulk of the collection data is managed in the central Ex Libris Alma catalogue. Besides that there is an ArchivesSpace installation, as well as several Adlib systems and a KOHA installation originating from the adoption of collections of other organisations. Most of these databases will probably be incorporated into the central catalogue in the near future.
In this initial project we have focused on the Alma central catalogue of the University only (and not yet of our partner the Amsterdam University of Applied Sciences).
According to the official policy, the Library strives to make its collections as open as possible, including the data and metadata required to access these. For this reason the standard default licence for collection data is the Open Data Commons Public Domain Dedication and License (PDDL), which applies to databases as a whole. However, there is one important exception. A large part of the metadata records in the central catalogue originates from OCLC WorldCat. This situation is inherited from the good old Dutch national PICA shared catalogue. Of course there is nothing wrong with shared cataloguing, but unfortunately OCLC requires attribution to the WorldCat community, using an Open Data Commons Attribution License (ODC-BY), according to the WorldCat Community Norms. In practice you have to be able to distinguish between metadata originating from WorldCat and metadata not originating from WorldCat. On the bright side: OCLC considers referencing a WorldCat URI in the data sufficient attribution in itself. In the open data from the central catalogue canonical WorldCat URI’s are present when applicable, so the required licence is implied. But on the dark side: especially in the case of linked data (to which the OCLC ODC-BY licence explicitly applies) “the attribution requirement makes reuse and data integration more difficult in linked open data scenarios like WikiData” as Olaf Janssen of the National Library of The Netherlands cited the DBLP Data Licence Change comment on Twitter. An attribution licence might make sense if the database is reused as a whole, or if, in the case of the implicit URI reference, full database records are reused. But especially in linked data contexts it is not uncommon to reuse and combine individual data elements or properties, leaving out the URI references. This makes an ODC-BY licence practically unworkable. It is time that OCLC reconsider their licence policy and adapt it to the modern world.
The central catalogue contains descriptions of over four million items, of which more than three million books. The rest consists of maps, images, audio, video, sheet music, archaeological objects, museum objects etc. For various practical reasons it is not feasible to make the full database available as one large dataset. That is why it was decided to split the database into smaller segments and publish datasets for physical objects by material type. A separate dataset was defined for digitised objects (images) published in the Image Repository. Because of the large amount of books and manuscripts, of these two material types only datasets of incunabula and letters are published. Other book datasets are available on demand.
In Alma these datasets are defined as “Logical Sets”, which are basically saved queries with dynamic result records. These Logical Sets serve as input for Alma Publishing Profiles, used for creating export files and harvesting endpoints (see below).
Data format: the published datasets only contain public metadata from Alma. Internal business and confidential data are filtered out before publishing. Creator/contributor and subject fields are enriched with URI’s, based on available identifiers from external authority files (Library of Congress Name Authority File and OCLC FAST for more recent records, Dutch National Authors and Subjects Thesaurus for older records). Through these URI’s relations to other global authority files can be established, such as VIAF, Wikidata and AAT. This is especially important for linked data (see below).
If these fields only contain textual descriptions without identifiers, enrichment is not applied. This lack of identifiers is input for the data quality improvement activities currently taking place. Available OCLC numbers are converted to canonical WorldCat URI’s, as mentioned in the Licences section. These data format transformations are performed using Alma Normalization Rules Sets, from within the Publishing Profiles.
Record and syntax formats: currently the datasets are made available in MARC-XML and Dublin Core Unqualified, two of the standard Alma export formats. For linked data formats, see below.
For each Alma Logical Set once a month two export files are generated and written to a separate server. Two separate Alma Publishing Profiles are needed, one for each output format (MARC-XML and Dublin Core). The file names are generated using the syntax [institution-data source-dataset-format], for instance “uva_alma_maps_marc“, “uva_alma_maps_dc“. Alma automatically adds “_new” and zips the files, so the results are for instance “uva_alma_maps_marc_new.tar.gz” and “uva_alma_maps_dc_new.tar.gz“. These export files are moved by a shell script to a publicly accessible directory on the same server, replacing the already existing files in that directory. On the Library Open Data website the links to all, currently twenty, files are published on a static webpage.
OAI-PMH harvesting endpoints are created using the same Alma Publishing Profiles, one for each output format. The set_spec and set_name are [dataset-format] and [dataset] respectively. The set_spec is used in the Alma system OAI-PMH call, for instance:
The harvesting links for all datasets/formats are also published on the same static webpage.
For the Alma data source Ex Libris provides a number of API’s both for the Alma backend and for the Alma Discovery/Primo frontend. However, there are some serious limitations in using these. The Alma API’s can only be used for raw data and the full Alma database. No logical sets can be used, nor data transformations using Normalization Rules. This means that data can’t be enriched with PID’s and URI’s, non-public data can’t be hidden, no individual datasets can be addressed. For our business case this means that the Alma API’s are not useful. Alternatively Primo API’s could be used, where the display data is enriched with PID’s and URI’s. However, it is again not possible to publish only specific sets and to filter out private data. The internal local field labels (“lds01”, “lds02”, etc.) can’t be replaced by more meaningful labels. Moreover, for all API’s there are API keys and API call limits.
For our business case an alternative API approach is required, either developing and maintaining our own API’s, or using a separate data and/or API platform.
Just like API’s, Ex Libris provides linked data features for Alma and Primo, which are not useful for implementing real linked data (yet). The concept of linked data is characterised by the fact that it is essentially a combination of specific formats (RDF) and publication channels (Sparql, content negotiation). Alma provides specific RDF formats (BIBFRAME, JSON-LD, RDA-RDF) with URI enrichment, but it is not possible to publish the RDF with your own PID-URI’s (in our case ARK’s and Handles). Instead internal system dependent URI’s are used. The Alma RDF formats can be used in the Alma Publishing Profiles to generate downloadable files, and in the Alma API’s. We have already seen that the Alma API’s have serious limitations. Moreover, Ex Libris currently does not support Sparql endpoints and content negotiation. These features appear to be on the roadmap however. It is a pity that I have not been able to implement the Ex Libris Alma and Primo linked data features that ultimately resulted from the first linked data session I helped organise at the 2011 IGeLU annual conference and the subsequent establishment of the IGeLU/ELUNA Linked Open Data Working Group ten years ago.
Anyway, we ended up implementing a separate linked data platform that serves as an API platform at the same time: Triply. In order to publish the collection data on this new platform, another separate tool is required for transforming the collection’s MARC data to RDF. For this we currently use Catmandu. We have had previous experience with both tools during the AdamLink project some years ago.
RDF transformation with Catmandu
Catmandu is a multipurpose data transformation toolset, maintained by an international open source community. It provides import and export modules for a large number of formats, not only RDF. Between import and export the data can be transformed using all kinds of “fix” commands. In our case we depend heavily on the Catmandu MARC modules library and the example fix file MARC2RDF by Patrick Hochstenbach, as starting point.
The full ETL process makes use of the MARC-XML dataset files exported by the Alma Publishing profiles. These MARC-XML files are transformed to RDF using Catmandu, and the resulting RDF files are then imported into the Triply data platform using the Triply API.
The pragmatic approach resulted in the adoption of a simplified version of the Europeana Data Model (EDM) as the local RDF model for the library collection metadata. EDM largely fits the MARC21 record format used in the catalogue for all material types. EDM is based on Qualified Dublin Core. A MARC to DC mapping is used based on the official Library of Congress MARC to DC mapping, adapted to our own situation.
The original three EDM RDF core classes Provided Cultural Heritage Object, Web Resource, Aggregation are furthermore merged into one: Provided Cultural Heritage Object, with additional subclasses for the individual material types. The Library RDF model description is available from the Open Data website.
Triply data platform
The Library of the University of Amsterdam Triply platform currently shows separate datasets for each of the ten datasets defined in Alma, as well as one combined dataset for the nine physical material type datasets. For this Catalogue dataset and for the Image Repository dataset only, Sparql and ElasticSearch endpoints are defined.
Content negotiation is the differentiated resolution of a PID-URI to different targets based on requested response formats. This way one PID-URI for a specific item can lead to different representations of the item, for instance a record display for human consumption in a web interface, or a data representation for machine readable interfaces. The Triply API supports a number of response formats (such as Turtle, JSON, JSON-LD, N3 etc.), both in HTTP headers and as HTTP URL parameters.
We have implemented content negotiation for our ARK PID’s as simple redirect rules to these Triply API response formats.
Data publication workflows can differ greatly in composition and maintenance, depending on the underlying systems, metadata quality and infrastructure. The extent of dependency on specific systems is an important factor.
For the central catalogue certain required actions and transformations are performed using internal Alma system utilities: Logical Sets, Publishing Profiles, Normalization Rules, harvesting endpoints. This way basic transformations and publication channels are implemented with system specific utilities.
The more complicated transformations and publication channels (linked data, API’s, etc.) are implemented using generic external tools. In time, it might become possible to implement all data publication features with the Ex Libris toolset. When that time comes the Library should have decided on their strategic position in this matter: implement open data as much as possible with the features of the source systems in use, or be as system independent as possible? Depending fully on system features means that overall maintenance is easier but in the case of a system migration everything has to be developed from scratch again. Depending on generic external utilities means that you have to develop everything from scratch, but in the case of a system migration most of the externally implemented functionality will continue working.
After delivering these first open data utilities, the time has come for evaluation, improvement and expansion. Shortcomings, as mentioned above, can be identified based on user feedback and analysis of usage statistics. Datasets and formats can be added or updated, based on user requests and communication with target audiences. New data sources can be published, with appropriate workflows, datasets and formats. The current workflow can be evaluated internally and adapted if needed. The experiences with the data publication project will also have a positive influence on the internal digital infrastructure, data quality and data flows of the library. Last but not least, time will tell if and in what expected and unexpected ways the library collections open data will be used.