Ten years linked open data

This post is the English translation of my original article in Dutch, published in META (2016-3), the Flemish journal for information professionals.

rdf

Ten years after the term “linked data” was introduced by Tim Berners-Lee it appears to be time to take stock of the impact of linked data for libraries and other heritage institutions in the past and in the future. I will do this from a personal historical perspective, as a library technology professional, systems and database designer, data infrastructure specialist, social scientist, internet citizen and information consumer.

Linked data is a set of universal methods for connecting information from multiple web sources in order to generate enriched or new information and prevent information redundancy and ambiguity. This is achieved by describing information as “triples” (relationships between two objects) in RDF (Resource Description Framework), in which both objects and relationships are represented as URI’s (Unique Resource Identifiers), which point to definitions of these on the web. The object’s type and attributes can also be represented as triples.

“Open data” means that the information concerned actually can and may be used.

It can be ascertained that the concept of “linked data” came too early for the library and heritage world in general. The majority of libraries, particularly public libraries, at that time simply did not possess the context and expertise to do something meaningful with it. Only larger institutions with sufficient expertise, technical staff and funding were capable of executing linked data pilot projects and implement linked data services, such as national libraries, scientific institutes, library consortia and renowned heritage institutions. Furthermore many institutions are dependent on external system, database and content providers. It is only in the last couple of years (roughly since 2014) that influential organisations in the international library and heritage world have seriously begun exploring linked data. These are for instance large commercial system vendors like OCLC and Ex Libris, and national and regional umbrella organisations like National Libraries and Library Consortia.

The first time I used the term “linked data” myself is documented on the web, in a blog post dated June 19, 2009, with the title ‘Linked Data for Libraries’, already in reference to libraries. The main assertion of my argumentation was “data is relationships” which still holds in full. The gist of my story was rather optimistic, focusing on a couple of technical and modelling aspects (URI’s, RDF, ontologies, content negotiation, etc.) for which there simply seemed to be a number of solutions at hand. In practice however these technical and modelling aspects turned out to be the subject of much discussion among linked data theorists and evangelists. Because of theoretical discussions like these, however necessary, consensus on standards and best practices is usually not reached very swiftly, which in turn leads to holding off development of universal and practical applications too.

At that time I already worked at the Library of the University of Amsterdam (UvA), in charge of a number of library systems. I had however already applied the concepts underlying linked data years before that, even before the term “linked data” existed, to be precise in the period 2000-2002 at the former NIWI (Netherlands Institute for Scientific Information Services), in collaboration with my colleague Niek van Baalen. Essentially we are dealing here with nothing more than very elementary and universal principles that can make life a lot easier for system and database designers. Our basic premise was that everything to be described was a thing or an object with a unique ID, to which a type or concept was assigned, such as ‘person’, ‘publication’, ‘organisation’ etc. Depending on the type, the object could have a number of attributes (such as ‘name’, ‘start date’, etc.) and relationships with other objects. The objects could be denoted with various textual labels in specific languages. All of this implemented in an independent relational database, with a fully decoupled web frontend based on object oriented software as a middle layer. This approach was a logical answer to the problem of integrating the various databases and information systems of the six former institutes of the KNAW (Dutch Royal Academy of Science) that constituted NIWI [See: http://www.slideshare.net/lukask/concepts-and-relations-2595603 and http://www.niekvanbaalen.net/swiftbox/].

Unfortunately both our concept-relational approach and NIWI were premature. The ideas on system independent concepts and relationships did not fall on fertile ground, and also the time was not right for an interdisciplinary scientific institute. From the late NIWI the current Dutch Data Archiving Institute DANS has risen, which continues the activities of the former Steinmetz Institute and the Dutch Historical Data Archive. One of the main areas of research for DANS nowadays is linked data.

Anyway, when I first learned about the concept of linked data in 2009, I was immediately converted. In 2010 I had the opportunity to carry out a linked data pilot in collaboration with Ad Aerts of the former Theatre Institute of the Netherlands (TIN) and my UvA colleague Roxana Popistasu, in which theatre texts in the UvA Aleph OPAC were enriched with related information about performances of the play in question from the TIN Adlib Theatre Productions Database. The objective of this pilot was to show the added value of enrichment of search results via linked data with relevant information from other databases, while at the same time exposing bottlenecks in the data used. In particular the lack of universally used identifiers for objects, people and subjects at that time appeared to be a barrier for successfully implementing linked data.

Example theatre linked data pilot: Waiting for Godot
Example theatre linked data pilot: Waiting for Godot

2010 was also the year that I first attended the SWIB conference (Semantic Web In Libraries). As it was only the second time the conference was organised, SWIB was still a largely German language meeting for a predominantly German audience. In the mean time SWIB has developed into one of the most important international linked open data conferences, held completely in English. Attending linked data conferences like SWIB often generates ambiguous feelings. On the one hand the discussions and the projects presented are a source of motivation, on the other hand they also give rise to frustration, because after returning to your own place of work it becomes clear once more that what large institutions can do in projects is not possible in everyday life. It is particularly the dependence on system providers that makes it difficult for libraries to implement linked data. In the theatre play pilot with the Ex Libris Aleph library system mentioned before it was only possible to use JavaScript add-ons in the user interface HTML pages, but not to adjust the internal system architecture and the international bibliographic MARC standard.

This vendor dependence was the immediate motive of establishing the Linked Open Data Special Interest Working Group (LOD SIWG) within IGeLU, the International Group of Ex Libris Users. This group’s objective was and is to convince the global library systems provider Ex Libris to implement linked data options in their systems. Some effort was needed to make Ex Libris appreciate the value of this, but after five years the company has officially initiated a “Linked Data Collaboration Program”, in which the Library of the University of Amsterdam is a partner. Besides the LOD SIWG activities, of course parallel developments in the library world have contributed to this as well, such as the Library of Congress BIBFRAME project and the linked data activities of competitor OCLC.

The BIBFRAME project is concerned with storing bibliographic data as linked data in RDF, replacing the international bibliographic MARC format. OCLC is primarily focused on publishing

BIBFRAME basic schema
BIBFRAME basic schema

WorldCat and authority information as linked data through URI’s and enhancing its findability in search engines like Google through schema.org. Storing linked data should in principle utilize information published as linked data elsewhere, especially authority files such as VIAF and LoC Vocabularies.

Consuming data published elsewhere is of course the actual goal of implementing linked data, in particular for the purpose of presenting end users with additional relevant information about topics they are interested in, without the need to execute similar searches in other systems. Academic libraries for example are increasingly developing an interest in presenting research output not only in the form of scholarly publications, but also in the form of related information about research projects, research data, procedures, networks, etc.

In 2012-2013 I have in this context carried out a pilot linking scholarly publications, harvested from the UvA institutional repository and loaded into the UvA Primo discovery index, to related information in the Dutch national research information repository NARCIS, which since a number of years is managed by the previously mentioned DANS. In NARCIS a limited subset of “Enhanced Publications” is available, in which all available research information is connected. These publications can also be retrieved as linked data/RDF. Unfortunately the only workable result of this test was adding an external link to author information in NARCIS. Processing of URI’s and linked data was and is not yet available in Primo. But this is going to change now with the aforementioned Ex Libris Linked Data Collaboration Program.

Example of NARCIS Enhanced Publications
Example of NARCIS Enhanced Publications

However, even if one has access to software that is targeted at storing and processing linked data and RDF, that does not suffice to actually tie together information from multiple sources. This was the outcome of another UvA pilot in the area of linked data and research information, using the open source linked data research information tool VIVO. This pilot showed that the data available in the internal university research information system was not good and complete enough. The objective of registering research information had always been limited to monitoring and publishing research output in an optimal way, mainly in the form of scholarly publications.

In 2016 the odds appear to be steadily turning in favour of a broader application of linked data in libraries and other heritage institutions, in any case in my own experience. The Library of the University of Amsterdam is a partner in the Ex Libris Linked Data Collaboration Program Discovery Track. And the term “linked data” appears more and more in official library policy documents.

Looking back on ten years of linked data and libraries one can conclude that successful implementation depends on the state of affairs in the full heritage information processing ecosystem. In this respect five preconditions within individual organisations are of importance: business case, tools, data, workflow and lifecycle.

Business case: an organisation always requires a business case for applying linked data. It is not a goal in its own right. For instance plans may exist for providing new services or improving efficiency in existing tasks for which linked data can be employed. For example presenting integrated research information, providing background information about the creation of works of art, or simply eliminating redundant information in multiple databases.

Tools: the software used must be suited for linked data. Publishing RDF, maintaining a SPARQL endpoint, processing external linked data through URI’s, storing data in a triple store. Specialised expertise is required in the case of homegrown software. For third party software this must be provided by the vendors.

Data: internal and external data must be available and suitable for publishing and consuming as linked data. The local information architecture and interoperability require profound attention. Excessive focus on individual systems with closed databases prohibits this.

Workflow: working procedures must be adapted to the processing of linked data. Existing working procedures are targeted at existing objectives, functionality and systems. Because all that changes with implementing linked data, procedures, jobs and the division of tasks will have to be adapted too. Particularly the use, continuity and reliability of internal and external linked data sources will have to be taken into account.

Lifecycle: new tools, data infrastructures and workflows will have to be secured in the organisation for the long term. It is important to adhere to existing standards and best practices, and to participate in collaboratives like open source communities, library consortia and user groups, if possible.

For the coming years I expect a number of standards and initiatives in the realm of linked data to reach maturity, which will enable individual libraries, archives and museums to get involved when they have practical implementations in mind, such as the aforementioned new services or efficiency improvements.

22 thoughts on “Ten years linked open data

Leave a Reply

Your email address will not be published. Required fields are marked *