Posted on January 5th, 2012 33 comments
2011 has in a sense been the year of library linked data. Not that libraries of all kinds are now publishing and consuming linked data in great numbers. No. But we have witnessed the publication of the final report of the W3C Library Linked Data Incubator Group, the Library of Congress announcement of the new Bibliographic Framework for the Digital Age based on Linked Data and RDF, the release by a number of large libraries and library consortia of their bibliographic metadata, many publications, sessions and presentations on the subject.
All these events focus mainly on publishing library bibliographic metadata as linked open data. Personally I am not convinced that this is the most interesting type of data that libraries can provide. Bibliographic metadata as such describe publications, in the broadest sense, providing information about title, authors, subjects, editions, dates, urls, but also physical attributes like dimensions, number of pages, formats, etc. This type of information, in FRBR terms: Work, Expression and Manifestation metadata, is typically shared among a large number of libraries, publishers, booksellers, etc. ‘Shared’ in this case means ‘multiplied and redundantly stored in many different local systems‘. It doesn’t really make sense if all libraries in the world publish identical metadata side by side, does it?
In essence only really unique data is worth publishing. You link to the rest.
Currently, library data that is really unique and interesting is administrative information about holdings and circulation. After having found metadata about a potentially relevant publication it is very useful for someone to know how and where to get access to it, if it’s not freely available online. Do you need to go to a specific library location to get the physical item, or to have access to the online article? Do you have to be affiliated to a specific institution to be entitled to borrow or access it?
Usage data about publications, both print and digital, can be very useful in establishing relevance and impact. This way information seekers can be supported in finding the best possible publications for their specific circumstances. There are some interesting projects dealing with circulation data already, such as the research project by Magnus Pfeffer and Kai Eckert as presented at the SWIB 11 conference, and the JISC funded Library Impact Data project at the University of Huddersfield. The Ex Libris bX service presents article recommendations based on SFX usage log analysis.
The consequence of this assertion is that if libraries want to publish linked open data, they should focus on holdings and circulation data, and for the rest link to available bibliographic metadata as much as possible. It is to be expected that the Library of Congress’ New Bibliographic Framework will take care of that part one way or another.
In order to achieve this libraries should join forces with each other and with publishers and aggregators to put their efforts into establishing shared global bibliographic metadata pools accessible through linked open data. We can think of already existing data sources like WorldCat, OpenLibrary, Summon, Primo Central and the like. We can only hope that commercial bibliographic metadata aggregators like OCLC, SerialsSolutions and Ex Libris will come to realise that it’s in everybody’s interest to contribute to the realisation of the new Bibliographic Framework. The recent disagreement between OCLC and the Swedish National Library seems to indicate that this may take some time. For a detailed analysis of this see the blog post ‘Can linked library data disrupt OCLC? Part one’.
An interesting initiative in this respect is LibraryCloud, an open, multi-library data service that aggregates and delivers library metadata. And there is the HBZ LOBID project, which is targeted at ‘the conversion of existing bibliographic data and associated data to Linked Open Data‘.
So what would the new bibliographic framework look like? If we take the FRBR model as a starting point, the new framework could look something like this. See also my slideshow “Linked Open Data for libraries”, slides 39-42.
The basic metadata about a publication or a unit of content, on the FRBR Work level, would be an entry in a global datastore identified by a URI ( Uniform Resource Identifier). This datastore could for instance be WorldCat, or OpenLibrary, or even a publisher’s datastore. It doesn’t really matter. We don’t even have to assume it’s only one central datastore that contains all Work entries.
The thing identified by the URI would have a text string field associated with it containing the original title, let’s say “The Da Vinci Code” as an example of a book. But also articles can and should be identified this way. The basic information we need to know about the Work would be attached to it using URIs to other things in the linked data web. A set of two things linked by a URI is called a ‘triple’. ‘Author’ could for instance be a link to OCLC’s VIAF (http://viaf.org/viaf/102403515 = Dan Brown), which would then constitute a triple. If there are more authors, you simply add a URI for every person or institution. Subjects could be links to DBPedia/Wikipedia, Freebase, the Library of Congress Authority files, etc. There could be some more basic information, maybe a year, or a URI to a source describing the background of the work.
At the Expression level, a Dutch translation would have it’s own URI, stored in the same or another datastore. I could imagine that the publisher who commissioned the translation would maintain a datastore with this information. Attached to the Expression there would be the URI of the original Work, a URI pointing to the language, a URI identifying the translator and a text string contaning the Dutch title, among others.
Every individual edition of the work could have it’s own Manifestation level URI, with a link to the Expression (in this case the Dutch translation), a publisher URI, a year, etc. For articles published according to the long standing tradition of peer reviewed journals, there would also be information about the journal. On this level there should also be URIs to the actual content when dealing with digital objects like articles, ebooks, etc., no matter if access is free or restricted.
So far we have everything we need to know about publications “in the cloud”, or better: in a number of datastores available on a number of servers connected to the world wide web. This is more or less the situation described by OCLC’s Lorcan Dempsey in his recent post ‘Linking not typing … knowledge organization at the network level’. The only thing we need now is software to present all linked information to the user.
No libraries in sight yet. For accessing freely available digital content on the web you actually don’t need a library, unless you need professional assistance finding the correct and relevant information. Here we have identified a possible role of librarians in this new networked information model.
Now we have reached the interesting part: how to link local library data to this global shared model? We immediately discover that the original FRBR model is inadequate in this networked environment, because it implies a specific local library situation. Individual copies of a work (the Items) are directly linked to the Manifestation, because FRBR refers to the old local catalogue which describes only the works/publications one library actually owns.
In the global shared library linked data network we need an extra explicit level to link physical Items owned by the library or online subscriptions of the library to the appropriate shared network level. I suggest to use the “Holding” level. A Holding would have it’s own URI and contain URIs of the Manifestation and of the Library. A specific Holding in this way would indicate that a specific library has one or more copies (Items) of a specific edition of a work (Manifestation), or offers access to an online digital article by way of a subscription.
If a Holding refers to physical copies (print books or journal issues for instance) then we also need the Item level. An Item would have it’s own URI and the URI of the Holding. For each Item, extra information can be provided, for instance ‘availability’, ‘location’, etc. Local circulation administration data can be registered for all Holdings and Items. For online digital content we don’t need Items, only subscription information directly attached to the Holding.
Local Holding and Item information can reside on local servers within the library’s domain or just as well on some external server ‘in the cloud’.
It’s on the level of the Holding that usage statistics per library can be collected and aggregated, both for physical items and for digital material.
Now, this networked linked library data model still allows libraries to present a local traditional catalogue type interface, showing only information about the library’s own print and digital holdings. What’s needed is software to do this using the local Holdings as entry level.
But the nice thing about the model is that there will also be a lot of other options. It will also be possible to start at the other end and search all bibliographic metadata available in the shared global network, and then find the most appropriate library to get access to a specific publication, much like WorldCat does, but on an even larger scale.
Another nice thing of using triples, URIs and linked data, is that it allows for adding all kinds of other, non-traditional bibliographic links to the old inward looking library world, making it into a flexible and open model, ready for future developments. It will for instance be possible for people to discover links to publications and library holdings from any other location on the web, for instance a Wikipedia page or a museum website. And the other way around, from an item in local library holdings to let’s say a recorded theatre performance on YouTube.
When this new data and metadata framework will be in place, there will be two important issues to be solved:
- Getting new software, systems and tools for both back end administrative functions and front end information finding needs. For this we need efforts from traditional library systems vendors but also from developers in libraries.
- Establishing future roles for libraries, librarians and information professionals in the new framework. This may turn out to be the most important issue.
Posted on May 15th, 2009 22 comments
Why use a non-normalised metadata exchange format for suboptimal data storage?
This week I had a nice chat with André Keyzer of Groningen University library and Peter van Boheemen of Wageningen University Library who attended OCLC’s Amsterdam Mashathon 2009. As can be expected from library technology geeks, we got talking about bibliographic metadata formats, very exciting of course. The question came up: what on earth could be the reason for storing bibliographic metadata in exchange formats like MARC?
Exactly my idea! As a matter of fact I think I may have used the same words a couple of times in recent years, probably even at ELAG2008. The thing is: it really does not matter how you store bibliographic metadata in your database, as long as you can present and exchange the data in any format requested, be it MARC or Dublin Core or anything else.
Of course the importance of using internationally accepted standards is beyond doubt, but there clearly exists widespread misunderstanding of the functions of certain standards, like for instance MARC. MARC is NOT a data storage format. In my opinion MARC is not even an exchange format, but merely a presentation format.
With a background and experience in data modeling, database and systems design (among others), I was quite amazed about bibliographic metadata formats when I started working with library systems in libraries, not having a librarian training at all. Of course, MARC (“MAchine Readable Cataloging record“) was invented as a standard in order to facilitate exchange of library catalog records in a digital era.
But I think MARC was invented by old school cataloguers who did not have a clue about data normalisation at all. A MARC record, especially if it corresponds to an official set of cataloging rules like AARC2, is nothing more than a digitised printed catalog card.
In pre-computer times it made perfect sense to have a standardised uniform way of registering bibliographic metadata on a printed card in this way. The catalog card was simultaneously used as a medium for presenting AND storing metadata. This is where the confusion originates from!
But when the Library of Congress says “If a library were to develop a “home-grown” system that did not use MARC records, it would not be taking advantage of an industry-wide standard whose primary purpose is to foster communication of information” it is saying just plain nonsense.
Actually it is better NOT to use something like MARC for other purposes than exchanging, or better, presenting data. To illustrate this I will give two examples of MARC tags that have been annoying me since my first day as a library employee:
- 100 – Main Entry-Personal Name (NR) – subfield $a – Personal name (NR)
- 773 – Host Item Entry (R) – subfield $g – Relationship information (R)
100 – Main Entry-Personal Name
Besides storing an author’s name as a string in each individual bibliographic record instead of using a code, linking to a central authority table (“foreign key” in relational database terms), it is also a mistake to use a person’s name as one complete string in one field. Examples on the Library of Congress MARC website use forms like “Adams, Henry”, “Fowler, T. M.” and “Blackbeard, Author of”. To take only the simple first example, this author could also be registered as “Henry Adams”, “Adams, H.”, “H. Adams”. And don’t say that these forms are not according to the rules! They are out there! There is no way to match these variations as being actually one and the same.
In a normalised relational database, this subfield $a would be stored something like this (simplified!):
- First name=Henry
773 – Host Item Entry
Subfield $g of this MARC tag is used for storing citation information for a journal article, volume, issue, year, start page, end page, all in one string, like: “Vol. 2, no. 2 (Feb. 1976), p. 195-230“. Again I have seen this used in many different ways. In a normalised format this would look something like this, using only the actual values:
- Start page=195
- End page=230
In a presentation of this normalised data record extra text can be added like “Vol.” or “Volume“, “Issue” or “No.“, brackets, replacing codes by descriptions (Month 2 = Feb.) etc., according to the format required. So the stored values could be used to generate the text “Vol. 2, no. 2 (Feb. 1976), p. 195-230” on the fly, but also for instance “Volume 2, Issue 2, dated February 1976, pages 195-230“.
The strange thing with this bibliographic format aimed at exchanging metadata is that it actually makes metadata exchange terribly complicated, especially with these two tags Author and Host Item. I can illustrate this with describing the way this exchange is handled between two digital library tools we use at the Library of the University of Amsterdam, MetaLib and SFX , both from the same vendor, Ex Libris.
The metasearch tool MetaLib is using the described and preferred mechanism of on the fly conversion of received external metadata from any format to MARC for the purpose of presentation.
But if we want to use the retrieved record to link to for instance a full text article using the SFX link resolver, the generated MARC data is used as a source and the non-normalised data in the 100 and 773 MARC tags has to be converted to the OpenURL format, which is actually normalised (example in simple OpenUrl 0.1):
isbn=;issn=0927-3255;date=1976; volume=2;issue=2;spage=195;epage=230; aulast=Adams;aufirst=Henry;auinit=;
In order to do this all kinds of regular expressions and scripting functions are needed to extract the correct values from the MARC author and citation strings. Wouldn’t it be convenient, if the record in MetaLib would already have been in OpenURL or any other normalised format?
The point I am trying to make is of course that it does not matter how metadata is stored, as long as it is possible to get the data out of the database in any format appropriate for the occasion. The SRU/SRW protocol is particularly aimed at precisely this: getting data out of a database in the required format, like MARC, Dublin Core, or anything else. An SRU server is a piece of middleware that receives requests, gets the requested data, converts the data and then returns the data in the requested format.
Currently at the Library of the University of Amsterdam we are migrating our ILS which also involves converting our data from one bibliographic metadata format (PICA+) to another (MARC). This is extremely complicated, especially because of the non-normalised structure of both formats. And I must say that in my opinion PICA+ is even the better one.
Also all German and Austrian libraries are meant to migrate from the MAB format to MARC, which also seems to be a move away from a superior format.
All because of the need to adhere to international standards, but with the wrong solution.
Maybe the projected new standard for resource description and access RDA will be the solution, but that may take a while yet.