In the Digital Infrastructure program at the Library of the University of Amsterdam we have reached a first milestone. In my previous post in the Infrastructure for heritage institutions series, “Change of course“, I mentioned the coming implementation of ARK persistent identifiers for our collection objects. Since November 3, 2020, ARK PID’s are available for our university library Alma catalogue through the Primo user interface. Implementation of ARK PID’s for the other collection description systems will follow in due course.
The new ARK system will coexist with the two other PID systems we already have in place: Handle for the PURE/DARE institutional scholarly output repository and the Allard Pierson image repository, and Datacite DOI for the institutional figshare research datasets.
First of all, it is good to remember the hybrid, dual function of persistent identifiers:
- uniquely identify a specific object (the actual PID)
- provide persistent availability information for that object on the web (the PID-URI)
Handle or ARK
Among the available PID systems (as described in my article “Persistent identifiers for heritage objects“) the final choice was between Handle and ARK. It would seem logical to select Handle because we already maintain two Handle subsystems, but there are a few disadvantages to this option, dependent on configuration choices, most importantly the way the PID’s are constructed and managed.
A PID always consists of at least an indication of the PID system, a code identifying the assigning institution and the actual unique identifier within that context, optionally including a code for the “assignment stream” to differentiate between possible multiple collections, organizational units, sources, etc. For the actual identifier strings within the institutional PID namespace, two options are available:
- minting independent unique identifier strings
- using existing internal unique system identifiers
The internal system identifiers are made globally unique because of the PID system and institutional context, just as the minted identifiers. Because of the complicated and error-prone workflow of minting independent identifier strings for Alma, we decided to use the internal Alma identifiers. The same method is already in use for the institutional repository. Required conditions for this approach are that the internal system identifiers are stable and robust, and that the current system identifiers can be migrated to future new system environments where they can be directly accessed. This has already happened a couple of times with the institutional repository. In the current PURE environment there are three types of Handles, two based on system identifiers from two previous systems, and the new ones based on UUID’s generated in PURE.
In the case of PID’s based on existing identifiers there is no need for minting new PID’s and storing them in a mapping table for resolving, or storing them in the cataloguing system for publishing (although that still would be the preferred solution). PID-URI’s are simply formed based on a template consisting of a base URL, including the institutional ID and the prefix identifying the source or assignment stream, + a placeholder for the internal identifier.
If the PID’s are not stored in the cataloguing system, then the template must be implemented in the system’s user interface in order to publish and present the PID-URI’s. The same procedure must be implemented in all other data publication channels, like OAI, API’s and download options. In order to avoid maintenance errors, the idea is to minimize the number of places to implement this template procedure. If the PID is stored in the source system’s metadata, then the whole procedure can be omitted.
For resolving template based PID-URI’s, the PID redirection web server maps the incoming PID-URI to the target according to the template, replacing the URL with the target systems’ URL syntax for retrieving a single record and inserting the internal system identifier, provided in the PID’s actual identifier part, in the correct location. If it turns out that in a future system migration the old system identifiers can’t be used after all, Plan B is to store the existing PID’s in a newly created mapping table and switch to the minted identifier method, resolving PID-URI’s by reading the mapping table.
Back to the Handle or ARK question. In the case of Handle, a separate full local Handle server installation with additional add-on software is always necessary, even for template handles used for PID’s based on internal system identifiers. The template handle configuration in itself is quite complex as well. For ARK, in this case a simple web server configuration with redirects for each combination of institution/assignment stream is sufficient. No dedicated ARK software is necessary.
Moreover, ARK registration is completely free, while Handle charges a fee for each institutional prefix.
In the end, the ease of implementation and maintenance in combination with the method chosen turned the scales in favour of ARK.
An ARK consists of the ARK label “ark:/”, a Name Assigning Authority Number (NAAN) for the institution assigning the ARK, a unique string within the ARK/NAAN namespace (“name”) and an optional qualifier, for specific versions or representations. The “name” can be prefixed with a “shoulder” indicating the “assignment stream”. This part implements the unique identifier function (PID). The ARK is prefixed with a base URL (NMA – Name Mapping Authority) to make it actionable on the web. This implements the web availability function (PID-URI).
In our case we installed an Apache web server on the hostname pid.uba.uva.nl, with simple template based redirect configurations for each combination of NAAN/shoulder. ARK PID’s can be resolved directly in the local web server environment using the local base URL/namespace, without an intermediate global redirecting server, as usually is the case with Handle (https://hdl.handle.net). However, this configuration is also still an option for ARK, by means of the global resolving/redirection server http://n2t.net, if the local base URL is correctly entered in the registration for the NAAN in question. The n2t.net server also resolves a large number of other PID’s, such as Handle and DOI (using labels like hdl:/, doi:/, etc.). The base URL for the ARK PID-URI’s of the Library of the University of Amsterdam https://pid.uba.uva.nl can be replaced with http://n2t.net at all times:
Here “883238” is the NAAN for the Library of the University of Amsterdam, and “b1” is the shoulder for the Alma assignment stream. The string “ark:/88238/b1990020797420205131” is the actual PID.
Both actionable PID-URI’s resolve to:
In the web server the first part of the URL up to and including the “b1” shoulder is replaced by the Primo URL + syntax “https://lib.uva.nl/discovery/fulldisplay?vid=31UKB_UAM1_INST:UVA&docid=alma“.
On the Alma/Primo user interface side, the ARK PID-URI for the object described in the displayed record is generated on the fly using internal “Alma normalization rules for display”. If in the metadata a handle is available (for items from the institutional and image repository), then this handle is displayed as “Persistent identifier”, in all other cases the ARK PID-URI is constructed by prefixing the internal Alma system identifier with “https://pid.uba.uva.nl/ark:/88238/b1”.
Ideally institutions only assign persistent identifiers to their own unique (or semi-unique) collection objects. Usually, at first these PID’s are assigned retroactively in batch, using an automated tool, after which PID’s are assigned individually for each new object added and catalogued. If an institution uses separate cataloguing environments for “PID worthy” and “non PID worthy” objects, then there is no problem. However, if only one cataloguing environment is used for all types of collection objects, it is necessary to be able to differentiate between unique and non-unique items based on the available metadata, in order to be able to automatically assign PID’s to PID worthy objects only.
At the Library of the University of Amsterdam, the Alma system is used for all material types: books, journals, images, museum objects, archival items, etcetera. Unfortunately, because of long standing work procedures and standard cataloguing profiles, it is not really possible to make a sharp distinction between all PID worthy and non PID worthy objects in the database. There is no indication of “uniqueness” or something similar in the metadata. Even a combination of specific selection criteria, such as material type + date + location is not completely conclusive. A book from before 1850 might very well be a unique item, but it does not have to be. A poster located in the museum might be unique, but it probably is not. Assessing all objects in the catalogue individually would probably be a matter of decades.
We have opted for a pragmatic approach, publishing ARK PID’s for all objects described in our local Alma catalogue system. In this way our ARK PID’s function both as real unique identifiers for PID worthy objects, and as truly persistent web links for all types of objects, replacing the default, not persistent, system and instance dependent Primo permalinks.
Another issue brought about by existing workflows is the fact that newly created digital representations of existing physical collection objects have been and will be assigned handles (as mentioned above). The corresponding physical source objects are assigned the new ARK PID’s in Alma/Primo. Both the physical object and its digital representation have their own record in the Alma catalogue, one identified by an ARK, the other by a handle, without a direct relation available in the metadata. Fortunately, most of these pairs are displayed as a cluster in Primo, so the implicit relationship is visible. And in the corresponding record in the image repository the links to the original records are displayed in the form of the new ARK PID’s.
The practice of assigning separate PID’s to digital representations of an object is not incorrect as such, but it would be better for the usability of the data if the relation between them would be made explicit. It is probably possible to fix this omission retroactively.
There are still a couple of other collections catalogued in other systems that will have to get persistent identifiers. These will also be handled in a pragmatic way, depending on the options of the systems, the available metadata, etcetera.
At the moment we do not use ARK qualifiers. We will look into this matter when we will investigate the options for content negotiation in the context of linked data in the very near future.
All in all, the implementation of persistent identifiers for all collection objects of the Library is a big step towards a more efficient and usable digital infrastructure, both internally and externally. One of the next steps will be the publication of linked open usable collection data, for which persistent identifiers are essential.
Without the discussions in the Digital Infrastructure Team of the Library of the University of Amsterdam/University of Applied Sciences Amsterdam (consisting of metadata specialists, IT staff, project coordinators and information specialists) and other colleagues, it would not have been possible to reach this milestone.
I would like to thank Herbert Van de Sompel for the exchange of ideas leading to the theoretical, philosophical and practical foundations of this important infrastructural step, described in the article mentioned above.