Infrastructure for heritage institutions – first results

Permalink: https://purl.org/cpl/3017


In July 2019 I published the post Infrastructure for heritage institutions in which I described our planning to realise a “coherent and future proof digital infrastructure” for the Library of the University of Amsterdam. Time to look back: how far have we come? And time to look forward: what’s in store for the near future?

Ongoing activities

I mentioned three “currently ongoing activities”: 

  • Monitoring and advising on infrastructural aspects of new projects
  • Maintaining a structured dynamic overview of the current systems and dataflow environment
  • Communication about the principles, objectives and status of the programme

So, what is the status of these ongoing activities?

Monitoring and advising

We have established a small dedicated “governance” team that is charged with assessing, advising and monitoring large and small projects that impact the digital infrastructure, and with creating awareness among stakeholders about the role of the larger core infrastructure team. The person managing the institutional project portfolio has agreed to take on the role of governance team coordinator, which is a perfect combination of responsibilities.

Dynamic overview

Until now we have a number of unrelated instruments to describe infrastructural components and relations, with different objectives. The two main ones are a huge static diagram that tries to capture all internal and external systems and relationships without detailed specifications, and there is the dynamic DataMap repository describing all dataflows between systems and datastores. The latter uses a home made extended version of the Data Flow Diagram (DFD) methodology, as described in an earlier post Analysing library data flows for efficient innovation (see also my ELAG 2015 presentation Datamazed). In that post I already mentioned Archimate as a possible future way to go. And this is exactly what we are going to do now. DFD is OK for describing dataflows, but not for documenting the entire digital infrastructure including digital objects, protocols, etc. Archimate version 3.1 can be used for digital and physical  structures as well as for data, application and business structures. We are currently deciding on the templates and patterns to use (Archimate is very flexible and can be used in very many different ways). The plan is to collaborate with the central university architecture community and document our infrastructure in the tool that they are already using.

Communication

This series of posts is one of the ways we communicate about the programme externally. For internal communication we have set up a section on the university library intranet.

 

Projects

I mentioned thirteen short term projects. How are they coming on? For all projects we are adopting a pragmatic approach. Use what is already available, set short term realistic goals, avoid solutions that are too complicated.

Object PID’s

I did some research into persistent identifiers (PID’s) and documented my findings in an internal memo. It consists of a general theoretical description of PID’s (what they are, administration and use, characterization of existing PID systems, object types PID’s can be assigned to and linked data requirements), and a practical part describing current practices, pros and cons of existing PID systems, a list of requirements, practical considerations and recommendations. A generic English version of this document is published in Code4Lib Journal issue 47 with the title “Persistent identifiers for heritage objects“.

In January 2020 we have started to test the different scenarios that are possible for implementing PID’s.

Object platform/Digital objects/IIIF

The library is currently executing an exploratory study into options for a digital object platform. There have been conversations with a number of institutions similar to the university library (university libraries, archives, museums) discussing their existing and future solutions. There will also be discussions with vendors, among which Ex Libris, the supplier of our current central collection management platform Alma. This study will result in a recommendation in the first half of 2020, after which an implementation project will be started.

The Digital Objects and IIIF topics are part of this comprehensive project, and obviously Alma is considered as a candidate. The library has already developed a IIIF test environment as a separate pilot project.

Licensing

We are taking first steps in setting up a dedicated team for deciding on default standard licences and regulations for collections, metadata and digital objects, per type when relevant. Furthermore, the team will assess dedicated licences and regulations in case the default ones do not apply. We are currently thinking along the lines of Public Domain Mark or Creative Commons CC0 for content that is not subject to copyright, CC-BY or CC-BY-SA for copyrighted  content, and righsstatements.org for content for which copyright is unclear.

For metadata the corresponding Open Data Commons licences are considered. For that part of the metadata in our central cataloguing system Alma which originates in Worldcat, OCLC recommends applying an ODC-BY licence according to the OCLC Worldcat Rights and Responsibilities. For the remaining metadata we are considering a public domain mark or an ODC-BY.

If it is feasible, the assigned licences and regulations for objects may be added to the metadata of the relevant digital objects in the collection management systems, both as text and in machine-readable form. In any case, the licences and regulations will be published in all online end user interfaces and in all machine/application interfaces.

Metadata set/Controlled vocabularies

Both defining the standard minimum required metadata for various use cases and selecting and implementing controlled vocabularies/authority files are aspects of data quality assurance. Both issues will be addressed simultaneously.

Regarding the metadata sets required for the various use cases and target audiences, this is a long term process, which will have to be carried out in smaller batches focused on specific audiences and use cases. Then again, because of the large number of catalogued objects it is practically impossible to extend and enrich the metadata for all objects manually. New items are catalogued using RDA Core Elements, in which minimum elements required for describing resources by type are defined. There is also a huge legacy metadata records base with many non-standard descriptions. Hopefully automated tools can be employed in the future for improving and extending metadata for specific use cases. This will be explored in the Data enrichment and Georeference projects.

Regarding the controlled vocabularies, on the contrary, there are short term practical solutions available. Libraries have been using authority files for cataloguing for a long time, especially for people and organisations (creators, contributors) and subjects. In most cases, besides the string values, also the identifiers of the terms in the authority files used have been recorded in our cataloguing records. In the past we have used national authority files for The Netherlands, currently we are using international authority files: Library of Congress Name Authority File and FAST. Fortunately, all these authority files have been published on the web as open and linked data, with persistent URI’s for each term. This means that we can dynamically construct and publish these persistent URI’s through human and machine readable interfaces for all vocabulary terms that we have registered. We are currently testing the options.

Data enrichment/Georeference

The Data enrichment and Georeference projects are closely related to the Open Maps pilot, in which a set of digitised maps from a unique 19th century atlas serve as practical test bed and implementation for the Digital Infrastructure programme. As such, these projects do not contribute to the improvement of the digital infrastructure in the narrow sense. However they demonstrate the extended possibilities of such an improved digital infrastructure. Both projects are directly related to all other projects defined in the programme, and offer valuable input for them.

Essentially both projects are aimed at creating additional object metadata on top of the basic metadata set, targeted at specific audiences, derived from the objects themselves.

An initial draft action plan was created for both projects to be executed simultaneously, in collaboration with a university digital humanities group and the central university ICT department. For the Data enrichment project the idea is to use OCR, text mining and named entity recognition methods to derive valuable metadata from various types of texts printed on maps. The Georeference project is targeted at obtaining various georeferences for the maps themselves and for selected items on the maps. All new data should have resolvable identifiers/URI’s in order to be able to be used for linked data.

Other projects

The remaining projects (ETL Hub, Digitisation Workflow, Access/Reuse, Linked Data) are dependent on the other activities carried out in the programme.

An Extract-Transform-Load platform for streamlining data flows and data conversions can only be effectively implemented when a more or less complete overview of the system and dataflow environment is available, and the extent of the role of Alma as central data hub has become clear. Moreover, the standardisation of basic metadata set, controlled vocabularies and persistent identifiers is required. In the end it could also turn out that an ETL Hub is not necessary at all.

The Digitisation Workflow can only be defined when a Digital Object Platform is up and running, and digital object formats are sorted out. It is also dependent on functioning PID and License workflows and established metadata sets and controlled vocabularies.

Acces and Reuse of metadata and digital objects depends on the availability of a Digital Object Platform, standardised metadata sets, controlled vocabularies, PID’s and license policies.

Last but not least, linked data can only be published if PID’s as linked data URI’s, open licences, standardised metadata sets and controlled vocabularies with URI’s are implemented. For Linked Data an ETL Hub might be required.