Posted on January 5th, 2012 33 comments
2011 has in a sense been the year of library linked data. Not that libraries of all kinds are now publishing and consuming linked data in great numbers. No. But we have witnessed the publication of the final report of the W3C Library Linked Data Incubator Group, the Library of Congress announcement of the new Bibliographic Framework for the Digital Age based on Linked Data and RDF, the release by a number of large libraries and library consortia of their bibliographic metadata, many publications, sessions and presentations on the subject.
All these events focus mainly on publishing library bibliographic metadata as linked open data. Personally I am not convinced that this is the most interesting type of data that libraries can provide. Bibliographic metadata as such describe publications, in the broadest sense, providing information about title, authors, subjects, editions, dates, urls, but also physical attributes like dimensions, number of pages, formats, etc. This type of information, in FRBR terms: Work, Expression and Manifestation metadata, is typically shared among a large number of libraries, publishers, booksellers, etc. ‘Shared’ in this case means ‘multiplied and redundantly stored in many different local systems‘. It doesn’t really make sense if all libraries in the world publish identical metadata side by side, does it?
In essence only really unique data is worth publishing. You link to the rest.
Currently, library data that is really unique and interesting is administrative information about holdings and circulation. After having found metadata about a potentially relevant publication it is very useful for someone to know how and where to get access to it, if it’s not freely available online. Do you need to go to a specific library location to get the physical item, or to have access to the online article? Do you have to be affiliated to a specific institution to be entitled to borrow or access it?
Usage data about publications, both print and digital, can be very useful in establishing relevance and impact. This way information seekers can be supported in finding the best possible publications for their specific circumstances. There are some interesting projects dealing with circulation data already, such as the research project by Magnus Pfeffer and Kai Eckert as presented at the SWIB 11 conference, and the JISC funded Library Impact Data project at the University of Huddersfield. The Ex Libris bX service presents article recommendations based on SFX usage log analysis.
The consequence of this assertion is that if libraries want to publish linked open data, they should focus on holdings and circulation data, and for the rest link to available bibliographic metadata as much as possible. It is to be expected that the Library of Congress’ New Bibliographic Framework will take care of that part one way or another.
In order to achieve this libraries should join forces with each other and with publishers and aggregators to put their efforts into establishing shared global bibliographic metadata pools accessible through linked open data. We can think of already existing data sources like WorldCat, OpenLibrary, Summon, Primo Central and the like. We can only hope that commercial bibliographic metadata aggregators like OCLC, SerialsSolutions and Ex Libris will come to realise that it’s in everybody’s interest to contribute to the realisation of the new Bibliographic Framework. The recent disagreement between OCLC and the Swedish National Library seems to indicate that this may take some time. For a detailed analysis of this see the blog post ‘Can linked library data disrupt OCLC? Part one’.
An interesting initiative in this respect is LibraryCloud, an open, multi-library data service that aggregates and delivers library metadata. And there is the HBZ LOBID project, which is targeted at ‘the conversion of existing bibliographic data and associated data to Linked Open Data‘.
So what would the new bibliographic framework look like? If we take the FRBR model as a starting point, the new framework could look something like this. See also my slideshow “Linked Open Data for libraries”, slides 39-42.
The basic metadata about a publication or a unit of content, on the FRBR Work level, would be an entry in a global datastore identified by a URI ( Uniform Resource Identifier). This datastore could for instance be WorldCat, or OpenLibrary, or even a publisher’s datastore. It doesn’t really matter. We don’t even have to assume it’s only one central datastore that contains all Work entries.
The thing identified by the URI would have a text string field associated with it containing the original title, let’s say “The Da Vinci Code” as an example of a book. But also articles can and should be identified this way. The basic information we need to know about the Work would be attached to it using URIs to other things in the linked data web. A set of two things linked by a URI is called a ‘triple’. ‘Author’ could for instance be a link to OCLC’s VIAF (http://viaf.org/viaf/102403515 = Dan Brown), which would then constitute a triple. If there are more authors, you simply add a URI for every person or institution. Subjects could be links to DBPedia/Wikipedia, Freebase, the Library of Congress Authority files, etc. There could be some more basic information, maybe a year, or a URI to a source describing the background of the work.
At the Expression level, a Dutch translation would have it’s own URI, stored in the same or another datastore. I could imagine that the publisher who commissioned the translation would maintain a datastore with this information. Attached to the Expression there would be the URI of the original Work, a URI pointing to the language, a URI identifying the translator and a text string contaning the Dutch title, among others.
Every individual edition of the work could have it’s own Manifestation level URI, with a link to the Expression (in this case the Dutch translation), a publisher URI, a year, etc. For articles published according to the long standing tradition of peer reviewed journals, there would also be information about the journal. On this level there should also be URIs to the actual content when dealing with digital objects like articles, ebooks, etc., no matter if access is free or restricted.
So far we have everything we need to know about publications “in the cloud”, or better: in a number of datastores available on a number of servers connected to the world wide web. This is more or less the situation described by OCLC’s Lorcan Dempsey in his recent post ‘Linking not typing … knowledge organization at the network level’. The only thing we need now is software to present all linked information to the user.
No libraries in sight yet. For accessing freely available digital content on the web you actually don’t need a library, unless you need professional assistance finding the correct and relevant information. Here we have identified a possible role of librarians in this new networked information model.
Now we have reached the interesting part: how to link local library data to this global shared model? We immediately discover that the original FRBR model is inadequate in this networked environment, because it implies a specific local library situation. Individual copies of a work (the Items) are directly linked to the Manifestation, because FRBR refers to the old local catalogue which describes only the works/publications one library actually owns.
In the global shared library linked data network we need an extra explicit level to link physical Items owned by the library or online subscriptions of the library to the appropriate shared network level. I suggest to use the “Holding” level. A Holding would have it’s own URI and contain URIs of the Manifestation and of the Library. A specific Holding in this way would indicate that a specific library has one or more copies (Items) of a specific edition of a work (Manifestation), or offers access to an online digital article by way of a subscription.
If a Holding refers to physical copies (print books or journal issues for instance) then we also need the Item level. An Item would have it’s own URI and the URI of the Holding. For each Item, extra information can be provided, for instance ‘availability’, ‘location’, etc. Local circulation administration data can be registered for all Holdings and Items. For online digital content we don’t need Items, only subscription information directly attached to the Holding.
Local Holding and Item information can reside on local servers within the library’s domain or just as well on some external server ‘in the cloud’.
It’s on the level of the Holding that usage statistics per library can be collected and aggregated, both for physical items and for digital material.
Now, this networked linked library data model still allows libraries to present a local traditional catalogue type interface, showing only information about the library’s own print and digital holdings. What’s needed is software to do this using the local Holdings as entry level.
But the nice thing about the model is that there will also be a lot of other options. It will also be possible to start at the other end and search all bibliographic metadata available in the shared global network, and then find the most appropriate library to get access to a specific publication, much like WorldCat does, but on an even larger scale.
Another nice thing of using triples, URIs and linked data, is that it allows for adding all kinds of other, non-traditional bibliographic links to the old inward looking library world, making it into a flexible and open model, ready for future developments. It will for instance be possible for people to discover links to publications and library holdings from any other location on the web, for instance a Wikipedia page or a museum website. And the other way around, from an item in local library holdings to let’s say a recorded theatre performance on YouTube.
When this new data and metadata framework will be in place, there will be two important issues to be solved:
- Getting new software, systems and tools for both back end administrative functions and front end information finding needs. For this we need efforts from traditional library systems vendors but also from developers in libraries.
- Establishing future roles for libraries, librarians and information professionals in the new framework. This may turn out to be the most important issue.
Posted on June 19th, 2009 8 commentsLinked Data and bibliographic metadata models
Some time after I wrote “UMR – Unified Metadata Resources“, I came across Chris Keene’s post “Linked data & RDF : draft notes for comment“, “just a list of links and notes” about Linked Data, RDF and the Semantic Web, put together to start collecting information about “a topic that will greatly impact on the Library / Information management world“.
While reading this post and working my way through the links on that page, I started realising that Linked Data is exactly what I tried to describe as One single web page as the single identifier of every book, author or subject. I did mention Semantic Web, URI’s and RDF, but the term “Linked Data” as a separate protocol had escaped me.
The concept of Linked Data was described by Tim Berners Lee, the inventor of the World Wide Web. Whereas the World Wide Web links documents (pages, files, images), which are basically resources about things, (“Information Resources” in Semantic Web terms), Linked Data (or the Semantic Web) links raw data and real life things (“Non-Information Resources”).
There are several definitions of Linked Data on the web, but here is my attempt to give a simple definition of it (loosely based on the definition in Structured Dynamics’ Linked Data FAQ):Linked Data is a methodology for providing relationships between things (data, concepts and documents) anywhere on the web, using URI’s for identifying, RDF for describing and HTTP for publishing these things and relationships, in a way that they can be interpreted and used by humans and software.
I will try to illustrate the different aspects using some examples from the library world. The article is rather long, because of the nature of the subject, then again the individual sections are a bit short. But I do supply a lot of links for further reading.
Data is relationships
The important thing is that “data is relationships“, as Tim Berners Lee says in his recent presentation for TED.
Before going into relationships between things, I have to point out the important distinction between abstract concepts and real life things, which are “manifestations” of the concepts. In Object modeling these are called “classes” (abstract concepts, types of things) and “objects” (real life things, or “instances” of “classes“).
- the class book can have the instances/objects “Cloud Atlas“, “Moby Dick“, etc.
- the class person can have the instances/objects “David Mitchell“, “Herman Melville“, etc.
In the Semantic Web/RDF model the concept of triples is used to describe a relationship between two things: subject – predicate – object, meaning: a thing has a relation to another thing, in the broadest sense:
- a book (subject) is written by (predicate) a person (object)
You can also reverse this relationship:
- a person (subject) is the author of (predicate) a book (object)
The person in question is only an author because of his or her relationship to the book. The same person can also be a mother of three children, an employee of a library, and a speaker at a conference.
Moreover, and this is important: there can be more than one relationship between the same two classes or types of things. A book (subject) can also be about (predicate) a person (object). In this case the person is a “subject” of the book, that can be described by a “keyword”, “subject heading”, or whatever term is used. A special case would be a book, written by someone about himself (an autobiography).
The problem with most legacy systems, and library catalogues as an example of these, is that a record for let’s say a book contains one or more fields for the author (or at best a link to an entry in an authority file or thesaurus), and separately one or more fields for subjects. This way it is not possible to see books written by an author and books about the same author in one view, without using all kinds of workarounds, link resolvers or mash-ups.
Using two different relationships that link to the same thing would provide for an actual view or representation of the real world situation.
Another important option of Linked Data/RDF: a certain thing can have as a property a link to a concept (or “class”) , describing the nature of the thing: “object Cloud Atlas” has type “book“; “object David Mitchell” has type “person“; “object Cloud Atlas” is written by “object David Mitchell“.
And of course, the property/relationship/predicate can also link to a concept describing the nature of the link.
Anywhere on the web
So far so good. But you may argue that this relationship theory is not very new. Absolutely right, but up until now this data-relationship concept has mainly been used with a view to the inside, focused on the area of the specific information system in question, because of the nature and the limitations of the available technology and infrastructure.
The “triple” model is of course exactly the same as the long standing methodology of Entity Relationship Diagrams (ERD), with which relationships between entities (=”classes“) are described. An ERD is typically used to generate a database that contains data in a specific information system. But ERD’s could just as well be used to describe Linked Data on the web.
Information systems, such as library catalogs, have been, and still are, for the greatest part closed containers of data, or “silos” without connections between them, as Tim Berners Lee also mentions in his TED presentation.
Lots of these silo systems are accessible with web interfaces, but this does not mean that items in these closed systems with dedicated web front ends can be linked to items in other databases or web pages. Of course these systems can have API‘s that allow system developers to create scripts to get related information from other systems and incorporate that external information in the search results of the calling system. This is what is being done in web 2.0 with so-called mash-ups.
But in this situation you need developers who know how to make scripts using specific scripting languages for all the different proprietary API’s that are being supported for all the individual systems.
If Linked Data was a global standard and all open and closed systems and websites supported RDF, then all these links would be available automatically to RDF enabled browser and client software, using SPARQL, the RDF Query Language.
- Linked Data/RDF can be regarded as a universal API.
The good thing about Linked Data is, that it is possible to use Linked Data mechanisms to link to legacy data in silo databases. You just need to provide an RDF wrapper for the legacy system, like has been done with the Library of Congress Subject Headings.
Some examples of available tools for exposing legacy data as RDF:
- Triplify – a web applications plugin that converts relational database structures into RDF triples
- D2R Server – a tool for publishing relational databases on the Semantic Web
- wp-RDFa – a wordpress plugin that adds some RDF information about Author and Title to WordPress blog posts
Of course, RDF that is generated like this will very probably only expose objects to link TO, not links to RDF objects external to the system.
Also, Linked Data can be used within legacy systems, for mixing legacy and RDF data, open and closed access data, etc. In this case we have RDF triples that have a subject URI from one data source and an object URI from another data source. In a situation with interlinked systems it would for instance be possible to see that the author of a specific book (data from a library catalog) is also speaking at a specific conference (data from a conference website). Objects linked together on the web using RDF triples are also known as an “RDF graph”. With RDF-aware client software it is possible to navigate through all the links to retrieve additional information about an object.
URI’s (“Uniform Resource Identifiers”) are necessary for uniquely identifying and linking to resources on the web. A URI is basically a string that identifies a thing or resource on the web. All “Information Resources”, or WWW pages, documents, etc. have a URI, which is commonly known as a URL (Uniform Resource Locator).
With Linked Data we are looking at identifying “Non-information Resources” or “real world objects” (people, concepts, things, even imaginary things), not web pages that contain information about these real world objects. But it is a little more complicated than that. In order to honour the requirement that a thing and its relations can be interpreted and used by humans and software, we need at least 3 different representations of one resource (see: How to publish Linked Data on the web):
- Resource identifier URI (identifies the real world object, the concept, as such)
- RDF document URI (a document readable for semantic web applications, containing the real world object’s RDF data and relationships with other objects)
- HTML document URI (a document readable for humans, with information about the real world object)
For instance, there could be a Resource Identifier URI for a book called “Cloud Atlas“. The web resource at that URI can redirect an RDF enabled browser to the RDF document URI, which contains RDF data describing the book and its properties and relationships. A normal HTML web browser would be redirected to the HTML document URI, for instance a web page about the book at the publisher’s website.
There are several methods of redirecting browsers and application to the required representation of the resource. See Cool URIs for the Semantic Web for technical details.
There are also RDF enabled browsers that transform RDF into web pages readable by humans, like the FireFox addon “Tabulator“, or the web based Disco and Marbles browsers, both hosted at the Free University Berlin.
RDF, vocabularies, ontologies
RDF or Resource Description Framework, is, like the name suggests, just a framework. It uses XML (or a simpler non-XML method N3) to describe resources by means of relationships. RDF can be implemented in vocabularies or ontologies, which are sets of RDF classes describing objects and relationships for a given field.
Basically, anybody can create an RDF vocabulary by publishing an RDF document defining the classes and properties of the vocabulary, at a URI on the web. The vocabulary can then be used in a resource by referring to the namespace (the URI) and the classes in that RDF document.
A nice and useful feature of RDF is that more than one vocabularies can be mixed and used in one resource.
Also, a vocabulary itself can reference other vocabularies and thereby inherit well established classes and properties from other RDF documents.
Another very useful feature of RDF is that objects can be linked to similar object resources describing the same real world thing. This way confusion about which object we are talking about, can be avoided.
A couple of existing and well used RDF vocabularies/ontologies:
- RDF – the base RDF vocabulary
- RDFS (for RDF Schema)
- DC (for Dublin Core)
- FOAF (for FOAF- Friend of a Friend) – online identities and social networks
- SKOS (for SKOS – Simple Knowledge Organisation System) – thesauri, classification schemes, subject heading systems and taxonomies
- OWL (for OWL -Ontology Web Language)
(By the way, the links in the first column (to the RDF files themselves) may act as an illustration of the redirection mechanism described before. Some of them may link to either the RDF file with the vocabulary definition itself, or to a page about the vocabulary, depending on the type of browser you use: rdf-aware or not.)
A special case is:
<?xml version=”1.0″ encoding=”UTF-8″ ?>
- RDFa – a sort of microformat without a vocabulary of its own, which relies on other vocabularies for turning XHTML page attributes into RDF
<dc:publisher>Random House Trade Paperbacks</dc:publisher>
<dc:title>Cloud Atlas: A Novel</dc:title>
<rdfs:label>Cloud Atlas: A Novel</rdfs:label>
<rdfs:label>RDF document about the book: Cloud Atlas: A Novel</rdfs:label>
<rdfs:label>Review number 1 about: Cloud Atlas: A Novel</rdfs:label>
<rdfs:label>RDF Book Mashup</rdfs:label>
A partial view on this RDF file with the Marbles browser:
It seems obvious that Linked Data can be very useful in providing a generic infrastructure for linking data, metadata and objects, available in numerous types of data stores, in the online library world. With such a networked online data structure, it would be fairly easy to create all kinds of discovery interfaces for bibliographic data and objects. Moreover, it would also be possible to link to non-bibliographic data that might interest the users of these interfaces.
A brief and incomplete list of some library related Linked Data projects, some of which already mentioned above:
- RDF BookMashup – Integration of Web 2.0 data sources like Amazon, Google or Yahoo into the Semantic Web.
- Library of Congress Authorities – Exposing LoC Autorities and Vocabularies to the web using URI’s
- DBPedia – Exposing structured data from WikiPedia to the web
- LIBRIS – Linked Data interface to Swedish LIBRIS Union catalog
- Scriblio+Wordpress+Triplify – “A social, semantic OPAC Union Catalogue”
And what about MARC, AACR2 and RDA? Is there a role for them in the Linked Data environment? RDA is supposed to be the successor of AACR2 as a content standard that can be used with MARC, but also with other encoding standards like MODS or Dublin Core.
The RDA Entity Relationship Diagram, that incorporates FRBR as well, can of course easily be implemented as an RDF vocabulary, that could be used to create a universal Linked Data library network. It really does not matter what kind of internal data format the connected systems use.
Posted on April 12th, 2009 7 comments
One single web page as the single identifier of every book, author or subject
I like the concept of “the web as common publication platform for libraries“, and “every book its own url“, as described by Owen Stephens in two blog posts:
“Its time to change library systems ”
I’d suggest what we really need to think about is a common ‘publication’ platform – a way of all of our systems outputting records in a way that can then be easily accessed by a variety of search products – whether our own local ones, remote union ones, or even ones run by individual users. I’d go further and argue that platform already exists – it is the web!
and “The Future is Analogue ”
If every book in your catalogue had it’s own URL – essentially it’s own address on your web, you would have, in a single step, enabled anyone in the world to add metadata to the book – without making any changes to the record in your catalogue.
This concept of identifying objects by URL:Unified Resource Locator (or maybe better URI: Unified Resource Identifier) is central to the Semantic Web, that uses RDF (resource Description Framework) as a metadata model.
As a matter of fact at ELAG 2008 I saw Jeroen Hoppenbrouwers (“Rethinking Subject Access “) explaining his idea of doing the same for Subject Headings using the Semantic Web concept of triplets. Every subject its own URL or web page. He said: “It is very easy. You can start doing this right away“.
To make the picture complete we only need the third essential component: every author his or her or its own URL!
This ideal situation would have to conform to the Open Access guidelines of course. One single web page serving as the single identifier of every book, author or subject, available for everyone to link their own holdings, subscriptions, local keywords and circulation data to.
In real life we see a number of current initiatives on the web by commercial organisations and non commercial groups, mainly in the area of “books” (or rather “publications”) and “authors”. “Subjects” apparently is a less appealing area to start something like this, because obviously stand-alone “subjects” without anything to link them to are nothing at all, whereas you always have “publications” and “authors”, even without “subjects”. The only project I know of is MACS (Multilingual Acces to Subjects), which is hosted on Jeroen Hoppenbrouwers’ domain.
For publications we have OCLC’s WorldCat, Librarything, Open Library, to name just a few. And of course these global initiatives have had their regional and local counterparts for many years already (Union Catalogues, Consortia models). But this is again a typical example of multiple parallel data stores of the same type of entities. The idea apparently is that you want to store everything in one single database aiming to be complete, instead of the ideal situation of single individual URI’s floating around anywhere on the web.
Ex Libris’ new Unified Resource Management development (URM, and yes: the title of this blog post is an ironic allusion to that acronym), although it promotes sharing of metadata, it does this within another separate system into which metadata from other systems can be copied.
Of course, the ideal picture sketched above is much too simple. We have to be sure which version of a publication, which author and which translation of a subject for instance we are dealing with. For publications this means that we need to implement FRBR (in short: an original publication/work and all of its manifestations), for authors we need author names thesauri, for subjects multilingual access.
I have tried to illustrate this in this simplified and incomplete diagram:
In this model libraries can use their local URI-objects representing holdings and copies for their acquisitions and circulation management, while the bibliographic metadata stay out there in the global, open area. Libraries (and individuals of course) can also attach local keywords to the global metadata, which in turn can become available globally (“social tagging”).
It is obvious that the current initiatives have dealt with these issues with various levels of success. Some examples to illustrate this:
- Work: Desiderius Erasmus – Encomium Moriae (Greek), Laus Stultitiae (Latin), Lof der Zotheid (Dutch), Praise of Folly (English)
- Author: David Mitchell
- Erasmus in WorldCat Identities (one ID, many forms)
- David Mitchell in WorldCat Identities (one id per author)
- David Mitchell in VIAF (one id per author)
- Erasmus in OpenLibrary (one id, one incomplete form)
- Erasmus in VIAF (one id, although from The Netherlands, preferred forms are Swedish, French and German)
- Erasmus in Librarything (no identifier, numerous forms and occurrences)
- David Mitchell in Librarything (one form, “David Mitchell is composed of at least 12 distinct authors“, no way to distinguish)
- David Mitchell in OpenLibrary (one id for multiple authors)
- Erasmus “Praise of folly” in Librarything (numerous entries for all different title variations)
- Erasmus “Praise of folly” in OpenLibrary (numerous entries for all different title variations)
These findings seem to indicate that some level of coordination (which the commercial initiatives apparently have implemented better than the non-commercial ones) is necessary in order to achieve the goal of “one URI for each object”.
Who wants to start?