Posted on January 7th, 2013 98 comments
The future of the academic library as a data services hub
Is there a future for libraries, or more specifically: is there a future for academic libraries? This has been the topic of lots of articles, blog posts, books and conferences. See for instance Aaron Tay’s recent post about his favourite “future of libraries” articles. But the question needs to be addressed over and over again, because libraries, and particularly academic libraries, continue to persevere in their belief that they will stay relevant in the future. I’m not so sure.
I will focus here on academic libraries. I work for one, the Library of the University of Amsterdam. Academic libraries in my view are completely different from public libraries in audience, content, funding and mission. As far as I’m concerned, they only have the name in common. For a vision on the future of public libraries, see Ed Summer’s excellent post “The inside out library”. As for research and special libraries, some of what I am about to say will apply to these libraries as well.
So, is there a future for academic libraries? Personally I think in the near future we will see the end of the academic library as we know it. Let’s start with looking at what are perceived to be the core functions of libraries: discovery and delivery, of books and articles.
For a complete overview of the current library ecosystem you should read Lorcan Dempsey’s excellent article “Thirteen Ways of Looking at Libraries, Discovery, and the Catalog: Scale, Workflow, Attention”.
“Discovery happens elsewhere”. Lorcan Dempsey said this already in 2007 . What this means is that the audience the library aims at, primarily searches for and finds information via other platforms than the library’s website and search interfaces. Several studies (for instance OCLC’s “Perceptions of libraries, 2010“) show that the most popular platforms are general search engines like Google and Wikipedia but also specific databases. And of course, if you’re looking for instant information, you don’t go to the library catalogue, because it only points you to items that you have to read in order to ascertain that they may or may not contain the information you need.
And if you are indeed looking for publications (books, articles, etc.) you could of course search your library’s catalogue and discovery interface. But you can find exactly the same and probably even more results elsewhere: in other libraries’ search interfaces, or aggregators that collect bibliographic metadata from all over the world. Moreover, academic libraries are doing their best to get their local holdings metadata in WorldCat and their journal holdings in Google Scholar. As I said in my EMTACL12 talk: you can’t find all you need with one local discovery tool.
Also, the traditional way of discovery through browsing the shelves is disappearing rapidly. The physical copies at the University of Amsterdam Library for instance are all stored in a storage facility in a suburb. Apart from some reference works and latest journal issues there is nothing to find in the library buildings. There is no need for a university library building for discovery purposes anymore.
Utrecht University Library has taken the logical next step: they decided not to acquire a new discovery tool, discontinue their local homegrown article search index and focus on delivery. See the article “Thinking the unthinkable: a library without a catalogue” .
So, if discovery is something that academic libraries should not invest in anymore, is delivery really the only core responsibility left? Let’s have a closer look.
Delivery in the traditional academic library sense means: giving the customer access to the publications he or she selected, both in print and digital form. In the case of subscription based e-journal articles, delivery consists of taking a subscription and leading the customer to the appropriate provider website to obtain the online article. Taking subscriptions is an administrative and financial activity. For historical reasons the university library has been taking care of this task. Because they handled the print subscriptions, they also started taking care of the digital versions. But actually it’s not the library that holds the subscription, it’s the university. And it really does not require librarian skills to handle subscriptions. This could very well be taken care of by the central university administration. For free and open access journals you don’t even need that.
The selection and procurement of journal packages from a large number of publishers and content providers is a different issue. Specific expertise is required for this. I will come to that later.
The task of leading the customer to the appropriate online copy is only a technical procedure, involving setting up link resolvers. Again, no librarian skills needed. This task could be done by some central university agency, maybe even using an external global linking registry.
As for the delivery of physical print copies, this is obviously nothing more than a logistics workflow, no different from delivery of furniture, tools, food, or any other physical business. The item is ordered, it is fetched from the shelf, sometimes by huge industrial robot installations, put in a van or cart, transported to the desired location and put in the customer’s locker or something similar. Again: no librarian skills whatsoever. Physical delivery only needs a separate internal or external logistics unit.
So, if discovery and delivery will cease to be core activities of the central university library organisation, what else is there?
Selection of print and digital material was already mentioned. It is evident that the selection of printed and digital books and journal subscriptions needs to be governed by expert knowledge and decisions in order to provide staff and students with the best possible material, because there is a lot of money involved. Typically this task is carried out by subject specialists (also called subject librarians), not by generalists. These ‘faculty liaisons’ usually have had an education in the disciplines they are responsible for, and they work closely together with their customers (academic staff and students). Many universities have semiautonomous discipline oriented sublibraries. The recent development of Patron Driven Acquisition (PDA) also fits into this construction.
The actual comparison, selection and procurement of journal packages from a large number of publishers and content providers requires a certain generic specific expertise which is not discipline dependent. This is a task that could well continue to be the responsibility of some central organisational unit, which may or may not be called the university library.
And what about cataloguing, a definite librarian skill? If discovery happens elsewhere, and libraries don’t need to maintain their own local catalogues, then it seems obvious that libraries don’t need to catalogue anything anymore. In fact, in the current situation most libraries don’t catalogue that much already. All the main bibliographical metadata for books (title, author, date, etc.) are already provided by publishers, by external central library service centres, or by other libraries in a shared cataloguing environment. And libraries have never catalogued journal articles anyway, only journals and issues. Article metadata are provided by the publishers or aggregators. Libraries pay for these services.
It is usual for libraries to add their own subject headings and classification terms to the already existing ones. But as Karen Coyle said at EMTACL12: “Library classification is a knowledge prevention system“, because it offers only one specific object oriented view on the information world. So maybe libraries should stop doing this, which would be in line with the “discovery happens elsewhere” argument anyway.
What remains of cataloguing is adding local holdings, items and subscription information. This is very useful information for library customers, but again this doesn’t seem to require very detailed librarian skills. As a matter of fact most of these metadata are already provided in the selection and acquisition process by acquisition staff and vendors.
The recent Library of Congress BIBFRAME initiative developments in theory make it possible to replace all local cataloguing efforts by linking local holdings information to global metadata.
There is still one area that may require the full local cataloguing range: the university’s own scientific output, as long as it is not published in journals or as books. The fulltext material is made available through institutional repositories, which obviously requires metadata to make the publications findable. However, the majority of the institutional publications are made available through other channels as well, as mentioned, so the need for local cataloguing in these cases is absent.
More and more students are coming to the library buildings every day, that’s what you hear all the time. Large amounts of money are spent on creating new study centres and meeting places in existing library buildings, even on new buildings. But that’s exactly the point: students don’t come to the library for discovery anymore, because the building no longer provides that. They come for places to study, use network pc’s or the university wifi, meet with fellow students, pick up their print items on loan, or view not-for-loan material. The physical locations are nothing more or less than study centres. There’s absolutely nothing wrong with that, they are very important, but they do not have to be associated with the university library, but can be provided by the university, on any location.
The reference desk, or its online counterpart, is a weird phenomenon. It seems to emphasise the fact that if you want instant information, books are of no use. On the other hand, it suggests that you should come to the library if you need specific information right now. In my view, although the reference desk partly embodies the actual original objective of a library, namely giving access to information, this could function very well outside the library context.
The reference desk service is also somewhat ambiguous. In some cases subject specialist expertise is needed, other cases require a more general knowledge of how to search and find information.
Statistics of the use of library holdings, both print and electronic, are an important source of information for making decisions on acquisitions and subscriptions. These statistics are provided by local and remote delivery systems and vendors. Usage statistics can also be used for other purposes, like identifying certain trends in scholarly processes, mapping of information sources to specific user groups, etc. Administering and providing statistics once again is not a librarian task, but can be done by internal or external service providers.
Special Collections are a Special Case. Most university libraries have a Special Collections division, for historical reasons. But of course Special Collections divisions are nothing less than a Museum and Archive division with specific skills, expertise and procedures. Most of the time they are autonomous units within the university anyway.
Now, if the traditional library tasks of selection, cataloguing, discovery and delivery will increasingly be carried out by non-librarian staff and units inside and outside the university, is there still a valid reason for maintaining an autonomous central university library organisation? Should academic libraries shift focus? There are a number of possible new services and responsibilities for the library that are being discussed or already being implemented.
Content curation can be seen as the task of bringing together information on a specific subject, of all kinds, from different sources on the web to be consumed by people in an easy way. This is something that can be done and is already done by all kinds of organisations and people. Libraries, academic, public and other types, can and should play a bigger role in this area. This involves looking at other units and sources of information than just the traditional library ones: books and journals. This new service type evidently is closely related to the traditional reference desk service.
Obviously this can best be taken care of by subject specialists. To do this, they need tools and infrastructure. These tools and infrastructure are of a generic nature and can be provided by technical specialists inside or outside the libraries or universities.
Techniques are often referred to as “mashups” or “linked data”, depending on the background of the people involved.
Linked data deserves its own section here, because it has been an ever widening movement since a number of years. It finally reached the library world the last couple of years with developments like the W3C Library Linked Data Incubator Group, the Library of Congress BIBFRAME initiative and the IFLA Semantic Web Special Interest Group. Linked data is a special type of data source mashup infrastructure. It requires the use of URIs for all separately usable data entities, and triples as the format for the actual linking (subject-predicate-object), mostly using the RDF structure.
There are two sides to linked data: the publishing of data in RDF and consequently the consumption of data elsewhere. A special case is the linked data based infrastructure, combining both publication and consumption in a specific way, as is the objective of the above mentioned BIBFRAME project.
Again, we need both subject specialists and generic technology experts to make this work in libraries, both academic and public ones.
University libraries are more and more expected to increase the level of support for researchers. It’s not only about providing access to scholarly publications anymore, but also about maintaining research information systems, virtual research environments, and long term preservation, availability and reusability of research data sets.
Again, here we see the need for discipline specific support because the needs of researchers for communication, collaboration and data varies greatly per discipline. And again, for the technical and organisational infrastructure we need internal or external generic technology experts and services. Apart from metadata expertise there are no traditional librarian skills required.
The Final Frontier: the library turning 180 degrees and switching from consumption to production of publications. According to some people university libraries are very suitable and qualified to become scholarly publishers (see for instance Björn Brembs‘ “Libraries Are Better Than Corporate Publishers Because…”). I am not sure that this is actually the case. Publishing as it currently exists requires a number of specific skills that have nothing to do with librarian expertise. A number of universities already have dedicated university press publishing agencies. But of course the publishing process can and probably will change. There is the open access movement, there is the rebellion against large scientific publishers, and last but not least, there is the slow rise of nanopublications, which could revolutionise the form that scholarly publishing will take. In the future publishing can originate at the source, making use of all kinds of new technologies of linking different types of data into new forms of non-static publications. Universities or university libraries could play a role here. Again we see here the need for both subject specialists and generic technology.
Special and general
So what is the overall picture? Of the current academic library tasks, only a few may still be around in the university in the future: selection, acquisition, cataloguing (if any), reference desk, usage statistics, and only a small part actually requires traditional librarian skills. Together with the new service areas of content curation, linked data, research support and publishing, this is rather an odd collection of very different fields of expertise. There does not seem to be a nice matching set of tasks for one central university division, let alone a library.
But what all these areas have in common is that they depend on linking and coordination of data from different sources.
And another interesting conclusion is that virtually all of these areas have two distinct components:
- Discipline or subject specific expertise
- Generic technical and organisational data infrastructure
I see a new duality in the realm of information management in universities. Selection, content curation, reference desk, linking data, cataloguing and research support will all be the domain of subject specialists directly connected to departments responsible for teaching and research in specific disciplines. These discipline related services will depend on generic technological and organisational infrastructures, available inside and outside the university, maintained by generic technical specialists.
These generic infrastructures could function completely separately, or they could somehow be interlinked and coordinated by some central university organisational unit. This would make sense, because there is a lot of overlap in information between these areas. Some kind of central data coordination unit would make it possible to provide a lot more useful data services than can be imagined now. Also, usage statistics, acquisition and the potential new publishing framework, yes even the special collections, could benefit from a central data services unit.
Such a unit would be different from the existing university ICT department. The latter mainly provides generic hardware, network, storage and security, and is focused on the internal infrastructure, trying to keep out as much external traffic as possible.
The new unit would be targeted at providing data services, possibly built on top of the internal technical infrastructure, but mainly using existing external ones. And it is obvious that there is added value in cooperation with similar bodies outside the university.
“Data services” then stands for providing storage, use, reuse, creation and linking of internal and external metadata and datasets by means of system administration, tools selection and implementation, and explicitly also programming when needed.
Such a unit would up to a point resemble current library service providers like the German regional library consortia and service centres such as hbz, KOBV or GBV, or high level organisations like the Dutch National Library Catalogue project.
Paraphrasing the conclusion of my own SWIB12 talk: it is time to stop thinking publications and start thinking data. This way the academic library could transform itself into a new central data services hub.
(Subject expertise AND data infrastructure) OR else!
Posted on January 5th, 2012 33 comments
2011 has in a sense been the year of library linked data. Not that libraries of all kinds are now publishing and consuming linked data in great numbers. No. But we have witnessed the publication of the final report of the W3C Library Linked Data Incubator Group, the Library of Congress announcement of the new Bibliographic Framework for the Digital Age based on Linked Data and RDF, the release by a number of large libraries and library consortia of their bibliographic metadata, many publications, sessions and presentations on the subject.
All these events focus mainly on publishing library bibliographic metadata as linked open data. Personally I am not convinced that this is the most interesting type of data that libraries can provide. Bibliographic metadata as such describe publications, in the broadest sense, providing information about title, authors, subjects, editions, dates, urls, but also physical attributes like dimensions, number of pages, formats, etc. This type of information, in FRBR terms: Work, Expression and Manifestation metadata, is typically shared among a large number of libraries, publishers, booksellers, etc. ‘Shared’ in this case means ‘multiplied and redundantly stored in many different local systems‘. It doesn’t really make sense if all libraries in the world publish identical metadata side by side, does it?
In essence only really unique data is worth publishing. You link to the rest.
Currently, library data that is really unique and interesting is administrative information about holdings and circulation. After having found metadata about a potentially relevant publication it is very useful for someone to know how and where to get access to it, if it’s not freely available online. Do you need to go to a specific library location to get the physical item, or to have access to the online article? Do you have to be affiliated to a specific institution to be entitled to borrow or access it?
Usage data about publications, both print and digital, can be very useful in establishing relevance and impact. This way information seekers can be supported in finding the best possible publications for their specific circumstances. There are some interesting projects dealing with circulation data already, such as the research project by Magnus Pfeffer and Kai Eckert as presented at the SWIB 11 conference, and the JISC funded Library Impact Data project at the University of Huddersfield. The Ex Libris bX service presents article recommendations based on SFX usage log analysis.
The consequence of this assertion is that if libraries want to publish linked open data, they should focus on holdings and circulation data, and for the rest link to available bibliographic metadata as much as possible. It is to be expected that the Library of Congress’ New Bibliographic Framework will take care of that part one way or another.
In order to achieve this libraries should join forces with each other and with publishers and aggregators to put their efforts into establishing shared global bibliographic metadata pools accessible through linked open data. We can think of already existing data sources like WorldCat, OpenLibrary, Summon, Primo Central and the like. We can only hope that commercial bibliographic metadata aggregators like OCLC, SerialsSolutions and Ex Libris will come to realise that it’s in everybody’s interest to contribute to the realisation of the new Bibliographic Framework. The recent disagreement between OCLC and the Swedish National Library seems to indicate that this may take some time. For a detailed analysis of this see the blog post ‘Can linked library data disrupt OCLC? Part one’.
An interesting initiative in this respect is LibraryCloud, an open, multi-library data service that aggregates and delivers library metadata. And there is the HBZ LOBID project, which is targeted at ‘the conversion of existing bibliographic data and associated data to Linked Open Data‘.
So what would the new bibliographic framework look like? If we take the FRBR model as a starting point, the new framework could look something like this. See also my slideshow “Linked Open Data for libraries”, slides 39-42.
The basic metadata about a publication or a unit of content, on the FRBR Work level, would be an entry in a global datastore identified by a URI ( Uniform Resource Identifier). This datastore could for instance be WorldCat, or OpenLibrary, or even a publisher’s datastore. It doesn’t really matter. We don’t even have to assume it’s only one central datastore that contains all Work entries.
The thing identified by the URI would have a text string field associated with it containing the original title, let’s say “The Da Vinci Code” as an example of a book. But also articles can and should be identified this way. The basic information we need to know about the Work would be attached to it using URIs to other things in the linked data web. A set of two things linked by a URI is called a ‘triple’. ‘Author’ could for instance be a link to OCLC’s VIAF (http://viaf.org/viaf/102403515 = Dan Brown), which would then constitute a triple. If there are more authors, you simply add a URI for every person or institution. Subjects could be links to DBPedia/Wikipedia, Freebase, the Library of Congress Authority files, etc. There could be some more basic information, maybe a year, or a URI to a source describing the background of the work.
At the Expression level, a Dutch translation would have it’s own URI, stored in the same or another datastore. I could imagine that the publisher who commissioned the translation would maintain a datastore with this information. Attached to the Expression there would be the URI of the original Work, a URI pointing to the language, a URI identifying the translator and a text string contaning the Dutch title, among others.
Every individual edition of the work could have it’s own Manifestation level URI, with a link to the Expression (in this case the Dutch translation), a publisher URI, a year, etc. For articles published according to the long standing tradition of peer reviewed journals, there would also be information about the journal. On this level there should also be URIs to the actual content when dealing with digital objects like articles, ebooks, etc., no matter if access is free or restricted.
So far we have everything we need to know about publications “in the cloud”, or better: in a number of datastores available on a number of servers connected to the world wide web. This is more or less the situation described by OCLC’s Lorcan Dempsey in his recent post ‘Linking not typing … knowledge organization at the network level’. The only thing we need now is software to present all linked information to the user.
No libraries in sight yet. For accessing freely available digital content on the web you actually don’t need a library, unless you need professional assistance finding the correct and relevant information. Here we have identified a possible role of librarians in this new networked information model.
Now we have reached the interesting part: how to link local library data to this global shared model? We immediately discover that the original FRBR model is inadequate in this networked environment, because it implies a specific local library situation. Individual copies of a work (the Items) are directly linked to the Manifestation, because FRBR refers to the old local catalogue which describes only the works/publications one library actually owns.
In the global shared library linked data network we need an extra explicit level to link physical Items owned by the library or online subscriptions of the library to the appropriate shared network level. I suggest to use the “Holding” level. A Holding would have it’s own URI and contain URIs of the Manifestation and of the Library. A specific Holding in this way would indicate that a specific library has one or more copies (Items) of a specific edition of a work (Manifestation), or offers access to an online digital article by way of a subscription.
If a Holding refers to physical copies (print books or journal issues for instance) then we also need the Item level. An Item would have it’s own URI and the URI of the Holding. For each Item, extra information can be provided, for instance ‘availability’, ‘location’, etc. Local circulation administration data can be registered for all Holdings and Items. For online digital content we don’t need Items, only subscription information directly attached to the Holding.
Local Holding and Item information can reside on local servers within the library’s domain or just as well on some external server ‘in the cloud’.
It’s on the level of the Holding that usage statistics per library can be collected and aggregated, both for physical items and for digital material.
Now, this networked linked library data model still allows libraries to present a local traditional catalogue type interface, showing only information about the library’s own print and digital holdings. What’s needed is software to do this using the local Holdings as entry level.
But the nice thing about the model is that there will also be a lot of other options. It will also be possible to start at the other end and search all bibliographic metadata available in the shared global network, and then find the most appropriate library to get access to a specific publication, much like WorldCat does, but on an even larger scale.
Another nice thing of using triples, URIs and linked data, is that it allows for adding all kinds of other, non-traditional bibliographic links to the old inward looking library world, making it into a flexible and open model, ready for future developments. It will for instance be possible for people to discover links to publications and library holdings from any other location on the web, for instance a Wikipedia page or a museum website. And the other way around, from an item in local library holdings to let’s say a recorded theatre performance on YouTube.
When this new data and metadata framework will be in place, there will be two important issues to be solved:
- Getting new software, systems and tools for both back end administrative functions and front end information finding needs. For this we need efforts from traditional library systems vendors but also from developers in libraries.
- Establishing future roles for libraries, librarians and information professionals in the new framework. This may turn out to be the most important issue.
Posted on September 2nd, 2011 10 commentsShifting focus from information carriers back to information
Library catalogues have traditionally been used to describe and register books and journals and other physical objects that together constitute the holdings of a library. In an integrated library system (ILS), the public catalogue is combined with acquisition and circulation modules to administer the purchases of book copies and journal subscriptions on one side, and the loans to customers on the other side. The “I” for “Integrated” in ILS stands for an internal integration of traditional library workflows. Integration from a back end view, not from a customer perspective.
Because of the very nature of such a catalogue, namely the description of physical objects and the administration of processing them, there are no explicit relations between the different editions and translations of the same book, nor are there descriptions of individual journal articles. If you do a search on a specific person’s name, you may end up with a large number of result records, written by that person or someone with a similar name, or about that person, even with identical titles, without knowing if there is a relationship between them, and what that relationship might be. What’s certain is that you will not find journal articles written by or about that person. The same applies to a search on title. There is no way of telling if there is any relation between identical titles. A library catalogue user would have to look at specific metadata in the records (like MARC 76X-78X – Linking Entries, 534 – Original Version Note or 580 – Linking Entry Complexity Note), if available, to reach their own conclusions.
Most libraries nowadays also purchase electronic versions of books and journals (ebooks and ejournals) and have free or paid subscriptions to online databases. Sometimes these digital items (ebooks, ejournals and databases) are also entered into the traditional library catalogues, but they are sometimes also made available through other library systems, like federated search tools, integrated discovery tools, A-Z lists, etc. All kinds of combinations occur.
In traditional library catalogues digital items are treated exactly the same as their physical counterparts. They are all isolated individual items without relations. As Karen Coyle put it November 2010 at the SWIB10 conference: “The main goal of cataloguing today is to keep things apart” .
Basically, integrated library systems and traditional catalogues are nothing more than inventory and logistics systems for physical objects, mainly focused on internal workflows. Unfortunately in newer end user interfaces like federated search and integrated discovery tools the user experience in this respect has in general been similar to that of traditional public catalogues.
At some point in time during the rise of electronic online catalogues, apparently the lack of relations between different versions of the same original work became a problem. I’m not sure if it was library customers or librarians who started feeling the need to see these implicit connections made explicit. The fact is that IFLA (International Federation of Library Associations) started developing FRBR in 1998.
FRBR (Functional Requirements for Bibliographic Records) is an attempt to provide a model for describing the relations between physical publications, editions, copies and their common denominator, the Work.
FRBR Group 1 describes publications in terms of the entities Work, Expression, Manifestation and Item (WEMI).
FRAD (Functional Requirements for Authority Data – ‘authors’) and FRSAD (Functional Requirements for Subject Authority Data – ‘subjects’) have been developed later on as alternatives for the FRBR Group 2 and 3 entities.
As an example let’s have a look at The Diary of Anne Frank. The original handwritten diary may be regarded as the Work. There are numerous adaptations and translations (Expressions) of the original unfinished and unedited Work. Each of these Expressions can be published in the form of one or more prints, editions, etc. These are the Manifestations, especially if they have different ISBN’s. Finally a library can have one or more physical copies of a Manifestation, the Items.
Some might even say the actual physical diary is the only existing Item embodying one specific (the first) Expression of the Work (Anne’s thoughts) and/or the only Manifestation of that Expression.
Of course, this model, if implemented, would be an enormous improvement to the old public catalogue situation. It makes it possible for library customers to have an automatic overview of all editions, translations, adaptations of one specific original work through the mechanism of Expressions and Manifestations. RDA (Resource Description and Access) is exactly doing this.
However there are some significant drawbacks, because the FRBR model is an old model, based on the traditional way of library cataloguing of physical items (books, journals, and cd’s, dvd’s), etc. (Karen Coyle at SWIB10).
- In the first place the FRBR model only shows the Works and related Manifestations and Expressions of physical copies (Items) that the library in question owns. Editions not in the possession of the library are ignored. This would be a bit different in a union catalogue of course, but then the model still only describes the holdings of the participating libraries.
- Secondly, the focus on physical copies is also the reason that the original FRBR model does not have a place for journal titles as such, only for journal issues. So there will be as many entries for one journal as the library has issues of it.
- Thirdly, it’s a hierarchical model, which incorporates only relations from the Work top down. There is no room for relations like: ‘similar works’, ‘other material on the same subject’, ‘influenced by’, etc.
- In the fourth place, FRBR still does not look at content. It is document centric, instead of information centric. It does however have the option for describing parts of a Work, if they are considered separate entities/works, like journal articles or volumes of a trilogy.
- Finally, the FRBR Item entity is only interesting in a storage and logistics environment for physical copies, such as the Circulation function in libraries, or the Sales function in bookstores. It has no relation to content whatsoever.
FRBR definitely is a positive and necessary development, but it is just not good enough. Basically it still focuses on information carriers instead of information (it’s a set of rules for managing Bibliographic Records, not for describing Information). It is an introverted view of the world. This was OK as long as it was dictated by the prevailing technological, economical and social conditions.
In a new networked digital information world libraries should shift their focus back to their original objective: being gateways to information as such. This entails replacing an introverted hierarchical model with an extroverted networked one, and moving away from describing static information aggregates in favour of units of content as primary objects.
The linked data concept provides the framework of such a networked model. In this model anything can be related to anything, with explicit declarations of the nature of the relationship. In the example of the Diary of Anne Frank one could identify relations with movies and theater plays that are based on the diary, with people connected to the diary or with the background of World War 2, antisemitism, Amsterdam, etc.
In traditional library catalogues defining relations with movies or theater plays is not possible from the description of the book. They could however be entered as a textual reference in the description of a movie, if for instance a DVD of that movie is catalogued. Relations to people, World War 2, antisemitism and Amsterdam would be described as textual or coded references to a short concept description, which in turn could provide lists of other catalogue items indexed with these subjects.
In a networked linked data model these links could connect to information entities in their own right outside the local catalogue, containing descriptions and other material about the subject, and providing links to other related information entities.
FRBR would still be a valuable part of such a universal networked model, as a subset for a specific purpose. In the context of physical information carriers it is a useful model, although with some missing features, as described above. It could be used in isolation, as originally designed, but if it’s an open model, it would also provide the missing links and options to describe and find related information.
Also, the FRBR model is essential as a minimal condition for enabling links from library catalogue items to other entity types through the Work common denominator.
In a completely digital information environment, the model could be simplified by getting rid of the Item entity. Nobody needs to keep track of available copies of online digital information, unless publishers want to enforce the old business models they have been using in order to keep making a profit. Ebooks for instance are essentially Expressions or Manifestations, depending on their nature, as I stated in my post ’Is an e-book a book?’.
The FRBR model can be used and is used also in other subject areas, like music, theater performances, etc. The Work – Expression – Manifestation – Item hierarchy is applicable to a number of creative professions.
The networked model provides the option of describing all traditional library objects, but also other and new ones and even objects that currently don’t exist, because it is an open and adaptable model.
In the traditional library models it is for instance impossible, or at least very hard, to describe a story that continues through all volumes of a trilogy as a central thread, apart from and related to the descriptions of the three separate physical books and their own stories. In the Millennium trilogy by Stieg Larsson, Lisbeth Salander’s life story is the central thread, but it can’t be described as a separate “Work” in MARC/FRBR/RDA because it is not the main subject of one physical content carrier (unless we are dealing with an edition in one physical multi part volume). The three volumes will be described with the subjects ‘Missing girl mystery‘, ‘Sex trafficking‘ and ‘Illegal secret service unit‘ respectively.
In an open networked information model on the contrary it would be entirely possible to describe such a ‘roaming story’.
New forms of information objects could appear in the form of new types of aggregates, other than books or journal articles, for instance consisting of text, images, statistics and video, optionally of a flexible nature (dynamic instead of static information objects).
Existing library systems (ILS’s and Integrated Discovery tools alike), using bibliographic metadata formats and frameworks like MARC, FRBR and RDA, can’t easily deal with new developments without some sort of workaround. Obviously this means that if libraries want to continue playing a role in the information gateway world, they need completely different systems and technology. Library system vendors should take note of this.
Finally, instead of only describing information objects, libraries could take up a new role in creating new objects, in the form of subject based virtual information aggregates, like for instance the Anne Frank Timeline, or Qwiki.This would put libraries back in the center of the information access business.
Posted on March 28th, 2011 1 comment
The challenges of generating linked data from legacy databases
Some time ago I wrote a blog post about the linked data proof of concept project I am involved in, connecting bibliographic metadata from the OPAC of the Library of the University of Amsterdam with the theatre performances database maintained by the Theatre Institute of The Netherlands.
I ended that post with a list of next steps to take:
- select/adapt/create a vocabulary for the Production/Performance subject area
- select/adapt/create vocabularies for Persons (FOAF?) and Subjects (SKOS?)
- add internal relationships with the other entities (Play, Production, etc.) in the JSON structure (implement RDF in JSON)
- Add RDF/XML as output option, besides JSON
- add external relationships (to other linked data sources like DBPedia, etc.)
- extend the number of possible URI formats (for Play, Production, etc.)
- add content negotiation to serve both human and machine readable redirects
- extend the options on the OPAC side
- publish UBA bibliographic data as linked open data (probably an entirely new project)
So, what have we achieved so far? I can be brief about all the ‘real’ linked data stuff (RDF, vocabularies, external links, content negotiation): we are not there yet. This will be dealt with in the next phase.
Instead, we have focused on getting the simple JSON implementation right, both on the data publishing side and on the data using side. We have added more URIs and internal relationships, and we are using these in the OPAC user interface.
But we have also encountered a number of crucial problems that are in my view inherent to the type of legacy data models used in libraries and cultural heritage institutions.
First let me describe the improvements we have added so far.
The URI for ‘person’ <baseurl>/person/<personname> now also returns a link to all the ‘titles’ that person is connected to (not only with the ‘author’ role, but for all roles, like director, performer, etc.): <baseurl>/gettitles/<personname>. This link will return a set of URIs of the form <baseurl>/title/<personname>/<title>. The /<personname>/<title> bit is at the moment the only way that a more or less unique identifier can be constructed from the OPAC metadata for the ‘play’ in the TIN database. There are a number of really important problems related to this that I will discuss below.
returns among others:
/title/Beckett, Samuel/Waiting for Godot
/title/Beckett, Samuel/En attendant Godot
The URI for a ‘play’ <baseurl>/title/<personname>/<title> now returns a set of ‘production’ URIs of the form <baseurl>/production/<personname>/<title>/<openingdate>/<idnr>.
The ‘production’ URI returns information about ‘theatre company’, ‘venue‘ and all persons connected to that production, including their URIs, and when available also a link to an image of a poster, and a video.
<baseurl>/title/Beckett, Samuel/Waiting for Godot
/production/Beckett, Samuel/Waiting for Godot/1988-07-28/5777
/production/Beckett, Samuel/Waiting for Godot/1988-11-22/6750
/production/Beckett, Samuel/Waiting for Godot/1992-04-16/10728
/production/Beckett, Samuel/Waiting for Godot/1981-02-18/43032
The last ‘production’ URI returns:
“title”:”Waiting For Godot”,
“description”:”Beckett, Samuel (auteur: toneelspel van)”,
“description”:”Hartnett, John (regie)”,
“description”:”Muller, Frans (decor: ontwerp)”,
“description”:”Newell, Kym (licht: ontwerp)”,
“description”:”Zaal, Kees (geluid)”,
“description”:”Tolstoj, Alexander (uitvoerende: Lucky)”,
“description”:”Weeks, David (uitvoerende: Estragon)”,
“description”:”Coburn, Grant (uitvoerende: Vladimir)”,
“description”:”Evans, Rhys (uitvoerende: Pozzo)”,
“description”:”Geiringer, Karl (uitvoerende: A Boy)”,
“description”:”Guidi, Peter (uitvoering muziek)”,
“description”:”Kimmorley, Roxanne (uitvoering muziek)”,
“description”:”Vries, Hessel de (uitvoering muziek)”,
“uri”:”/person/Vries, Hessel de”
“description”:”Phillips, Margot (uitvoering muziek)”,
Now, the problems (or challenges) that we are facing here are essential to the core concept of linked data:
- we don’t have actual matching unique identifiers (URIs)
- we don’t have explicit internal relations with a common entity in both sources
- part of the data consists of literal strings in a specific language
These three problems are interrelated, they are linked problems, so to speak.
To start with the identifiers. Of course we have internal system identifiers in our local Aleph catalogue database. Because we contribute to the Dutch Union Catalogue (originally a PICA system, now OCLC), our bibliographic records also have national Dutch PICA identifiers. And because the Dutch Union Catalogue records are copied to WorldCat, these records in WorldCat also have OCLC numbers.
Also the Theatre Institute has internal system identifiers in their Adlib database. But at the moment we do not have a match between these separate internal identifier schemes. The Theatre Production database records are not in WorldCat because they’re not bibliographic records.
We are more or less forced to use the string values of the title and author fields to construct a usable URI, on both sides. Clearly this is the basis of lots of errors, because of the great number of possible variations in author and title descriptions.
But even if the Theatre Institute’s records were in the Union Catalogue or WorldCat as well, then we still would not have an automatic match without some kind of broker mechanism ascertaining that the library catalogue record describes the same thing as the theatre production database record. The same applies to the author, which of course should be a relation of the type “written by” between the play and a person record instead of string values. Both systems do have internal author or person authority files, but there is no direct matching. For authors this could theoretically be achieved by linking to an online person authority file like VIAF. But in the current situation this is not available.
This brings me to the second problem. The fact that we are using the string values of title instead of unique identifiers, means that we connect plays and productions with a specific title variety or language. In our current implementation this means that we are not able to link to all versions of one specific play.
For instance, from our OPAC the following URIs are constructed (two in English, one in French, one in Dutch):
/title/Beckett, Samuel/Waiting for Godot
/title/Beckett, Samuel/Waiting for Godot : a tragicomedy in two acts
/title/Beckett, Samuel/En attendant Godot : pièce en deux actes
/title/Beckett, Samuel/Wachten op Godot
In the Theatre Production database (two in English, four in Dutch, one in French, one in German):
/title/Beckett, Samuel/Waiting for Godot
/title/Beckett, Samuel/Waiting For Godot
/title/Beckett, Samuel/Wachten op Godot
/title/Beckett, Samuel/Wachtend op Godot
/title/Beckett, Samuel/Wachten op Godot (De favorieten)
/title/Beckett, Samuel/Wachten op Godot (eerste bedrijf)
/title/Beckett, Samuel/En attendant Godot
/title/Beckett, Samuel/Warten auf Godot
Only the first and fourth URI from the OPAC will find corresponding titles in the Theatre Production database. The second and third one, using a subtitle within the main title, don’t even have equivalents. And only two of the eight entries from the Theatre Production database have a match in the catalogue.
In a library catalogue environment we are used to this problem, because catalogues are used for describing physical objects in the form of editions and copies. Unfortunately, also the Theatre Production database just contains records describing productions of a specific ‘edition’ or translation of a play, with only the opening performance information attached.
This is where I need to talk about FRBR. Basically in a library catalogue environment this means that we should describe the relations between the ‘work’ (original text), the ‘expression’ (the version or translation), the ‘manifestation’ (edition, format, etc.) and the ‘items’ (the physical copies). Via the relations with higher level expression and work, the physical copy could be linked to the unifying work level, and then ideally through some universally valid unique identifier to, in our case, the theatre plays.
Although FRBR is a publication centered schema used only in libraries, the same concepts can be applied to theatre performances: the original work (which is the same as the work in a bibliographical sense) has expressions (adaptations, translations, etc.), which have manifestations (productions), and in the end the individual items (actual performances on a specific date, time and location).
If both the library catalogue and the theatre production database were FRBRised, we could in theory link on the Work level and cover all individual versions. But we would still need a matching mechanism on that Work level of course.
In reality however we can only try to link on the Manifestation level in an imperfect way.
At the moment, in our project, on the catalogue side we extract the title and author from the generated OPAC HTML. It could be an option to get available linking information form the internal MARC records (like the 240, 246, 765, 767, 775 tags), but that is not easy because of a number of reasons. Something similar could be done in the theatre production database, making implicit links explicit. But all this makes the effort to get something sensible out there much bigger.
The third problem, the literal strings in Dutch both in the library catalogue and in the theatre production database, prevents the effective use of the data in multilingual environments, equally in the traditional native interfaces and as linked data. Obviously for English speaking end users the Dutch terms mean nothing. And in a linked data environment the Dutch strings can’t easily be linked to other data, in Dutch, English, or any language, without unique identifiers.
Implicit to explicit
People calling on institutions to publish their data as linked open data tend to say it’s easy once you know how to do it . And of course it must be done. But if the published datasets have a flat internal structure designed to fulfill a specific local business objective, then they just don’t provide sufficient added value for third party use. In order to make your published open data useful for others, you have to make implicit relations explicit. And this requires something more than just making the data available in RDF ‘as is’, it requires a lot of processing.
Posted on August 20th, 2009 12 comments
On August 17, after I tested a search in our new Aleph OPAC and mentioned my surprise on Twitter, the following discussion unfolded between me (lukask), Ed Summers of the Library of Congress and Till Kinstler of GBV (German Union Library Network):
- lukask: Just found out we only have one item about RDF in our catalogue: http://tinyurl.com/lz75c4
- edsu: @lukask broaden that search http://is.gd/2l6vB
- lukask: @edsu Ha! Thanks! But I’m sure that RDF will be mentioned in these 29 titles! A case for social tagging!
- edsu: @lukask or better cataloging
- edsu: @lukask i guess they both amount to the same thing eh?
- lukask: @edsu That’s an interesting position…”social tagging=better cataloging”. I will ask my cataloguing co-workers about this specific example
- edsu: @lukask make sure to wear body-armor
- lukask: @edsu Yes I know! I will bring it up at tomorrow’s party for the celebration of our ALEPH STP (after some drinks…)
- tillk: @edsu @lukask or fulltext search… SCNR…
- edsu: @tillk yeah, totally — with projects like @googlebooks and @hathitrust we may look back on the age of cataloging with different eyes …
- lukask: @tillk @edsu Fulltext search yes, or “implicit automatic metadata generation”?
What happened here was:
- A problem with findability of specific bibliographic items was observed: although it is highly unlikely that books about the Semantic Web will not cover RDF-Resource Description Framework, none of the 29 titles found with “Semantic Web” could be found with the search term “Resource Description Framework“; on the other side, the only item found with “Resource Description Framework” was NOT found with “Semantic Web“. I must add that the “Semantic web” search was an “All words” search. Only 20 of the results were indexed with the Dutch subject heading “Semantisch web” (which term is never used in real life as far as I know; the English term is an international concept). Some results were off topic, they just happened to have “semantic” and “web” somewhere in their metadata. A better search would have been a phrase search (adjacent) with “semantic web” in actual quotes, which gives 26 items. But of these, a small number were not indexed with subject heading “Semantisch web“. Another note: searching with “RDF” gets you all kinds of results. Read more on the issue of searching and relevance in my post Relevance in context.
- Four possible solutions were suggested:
- social tagging
- better cataloging
- fulltext searching
- automatic metadata generation
Clearly, the 26 items found with the search “Semantic web” are not indexed by the “Resource description framework” or “RDF” subject heading. There is not even a subject heading for “Resource description framework” or “RDF“. In my personal view, from my personal context, this is an omission. Mind you, this is not only an issue in the catalogue of the Library of the University of Amsterdam, it is quite common. I tried it in the British Library Integrated Catalogue with similar results. Try it in your own OPAC!
I presume that our professional cataloging colleagues can’t know everything about all subjects. That is completely understandable. I would not know how to catalog a book about a medical subject myself either! But this is exactly the point. If you allow end users to add their own tags to your bibliographic records, you enhance the findability of these records for specific groups of end users.
I am not saying that cataloguing and indexing by library specialists using controlled vocabularies should be replaced by social tagging! No, not at all. I am just saying that both types of tagging/indexing are complementary. Sure, some of the tags added by end users may not follow cataloging standards, but who cares? Very often the end users adding tags of their own will be professional experts in their field. In any case, items with social tags will be found more often because specific end user groups can find them searching with their own terms.
I suppose Ed Summers was trying to say the same thing as I just did above, when he commented “or better cataloging, I guess they both amount to the same thing eh?“, which I summarised as “social tagging=better cataloging“, but he can correct me if I’m wrong.
Anyway, I hope I made it clear that I would not say “social tagging=better cataloging“, but rather “controlled vocabularies+social tagging=better cataloging“.
Or alternatively, could we improve cataloging by professional library catalogers? I must admit I do not know enough about library training and practice to say something about that. I am not a trained librarian. Don’t hesitate to comment!
Is fulltext searching the miracle cure for findability problems, as Till Kinstler seems to suggest? Maybe.
Suppose all our print material was completely digitised and available for fulltext search, I have no doubt that all 26 items mentioned above (the results of the “semantic web” all words search) would be found with the “resource description framework” or “rdf” search as well. But because fulltext search is by its very nature an “all words” search, the “rdf” fulltext search would also give a lot of “noise”, or items not having any relation to “semantic web” at all (author’s initials “R.D.F”, other acronyms “R.D.F.”, just see RDF in the BL catalogue). Again, see my post Relevance in context for an explanation of searching without context.
Also, there will be books or articles about a subject that will not contain the actual subject term at all. With fulltext search these items will not be found.
Moreover, fulltext searching actually limits the findable items to text, excluding other types, like images, maps, video, audio etc.
This brings me to the “final solution”:
Automatic metadata generation
Of course this is mostly still wishful thinking. But there are a number of interesting implementations already.
What I mean when I say “(implicit) automatic metadata generation” is: metadata that is NOT created deliberately by humans, but either generated and assigned as static metadata, or generated on the fly, by software, applying intelligent analysis to objects, of all types (text, images, audio, video, etc.).
In the case of our “rdf” example, such a tool would analyse a text and assign “rdf” as a subject heading based on the content and context of this text, even if the term “rdf” does not appear in the text at all. It would also discard texts containing the string “rdf” that refer to something completely different. Of course for this to succeed there should be some kind of contextual environment with links to other records or even other systems to be able to determine if certain terminology is linked to frequently used terms not mentioned in the text itself (here the Linked Data developments could play a major role).
The same principle should also apply to non-textual objects, so that images, audio, video etc. about the same subject can be found in one go. Google has some interesting implementations in this field already: image search by colour and content type: see for example the search for “rdf” in Google Images with colour “red”and content type “clip art”.
But of course there still needs a lot to be done.
Posted on August 11th, 2009 5 comments
If you do a search in a bibliographic database, you should find what you need, not just what you are looking for, or what the database “thinks” you are looking for. If you find what you are looking for, then you will not be surprised and you will not discover anything new. And that’s not what you want, is it? But if you find things you did not look for but also do not need, you’re not just surprised, you are confused! And that’s not what you want either.
You want the results that are the most relevant for your search, with your specific objectives, at that specific point in time time, for your specific circumstances, and you want them immediately.
So, how should search systems behave to make you find what you need? There are two conditions that need to be met:
- The search terms must be interpreted correctly
- The most relevant search results must be presented
First of all, let’s take a look at current practice.
Search systems cannot cope with ambiguous search terms. My favorite example and test search term is “voc“. This can stand for a number of things in various disciplines: V.O.C. (Dutch: “Verenigde Oostindische Compagnie” or “Dutch United East Indies Company”) in historical databases; “vocals” in musical databases; “volatile organic compounds” in physics databases. So if you do a search for “voc” in a standard library catalogue, you get all kinds of results. Even more so if you use a metasearch or federated search tool for searching several databases simultaneously.
You are confused. You would like the system to “understand” which one of these concepts you are referring to instead of just using the literal string. You would like the system to take into account your context.
In most databases search results can be sorted or filtered by a number of fields, most commonly by year, title, author, and also by more specific fields in dedicated databases. But unless you are interested in a specific year, author or title, this will not do. Recently many systems have implemented “faceted” and “clustered” browsing of results, enabling “drilling down” on specific terms or subjects. This basically comes down to setting the context after the fact.
But after the system has interpreted your search terms, the results should also be ordered in a specific way, the ones you need most should be on top. This is where “relevance ranking” of search results comes in. Most catalogues and databases use a system specific default relevance ranking algorithm. Search results are assigned a rank, based on a number of criteria, that can differ between databases, depending on the nature of the database.
Some databases just present the most recent results on top. For medical and physical sciences this may be right, but for history and literature databases this may just be wrong.
Sometimes the search terms are taken into account: the number of times the given search terms are present in the result fields is important, but also the specific fields in which search terms appear. The appearance of search terms in “Title” and “Subject” may rank higher than in “Abstract” or “Publisher”. Moreover, the search indexes used can have a major influence on rank: if you search for “Subject” = “flu”, then results with “flu” as subject will be ranked higher than results with “flu” in the title only.
To come back to my example, with ambiguous search terms like “voc” this type of relevance ranking will definitely not be enough, because results from the three different conceptual areas will be completely mixed up.
When searching with a metasearch or federated search tool things get even more complicated. Each of the remote databases that you search in has its own default way of ranking. Usually the metasearch tool fetches the first 30 or so results from each remote database (one set sorted by date, the other by internal rank, the next by title), merges these into one list and then applies its own local ranking mechanism to this subset only. Confusion! And I did not even mention searching databases with metadata in multiple languages. Moreover, databases containing only metadata will produce different results and relevance than databases with full text articles. There is absolutely no way of telling if you actually have the most relevant results for your situation.
Again, with relevance ranking search systems do not take into account the context either. You could say it is an introverted, internally focused way of ranking, the confusing results of which are multiplied in case of metasearching.
Most metasearch tools give users the option of searching in sets of pre-selected databases, based on subject or type. This way you can limit your search to those databases that are known to have data about that specific subject. You more or less set the context in advance. But this mechanism only eliminates results from databases that probably do not have data on your subject at all, so they would not have shown up in the results anyway. Moreover, the same issues that were discussed above apply to this limited set of databases.
The metasearch tool that I know best (MetaLib) offers the option of setting a relative rank per database, so results from databases with a higher rank will have a higher relevance in merged result sets. But this is a system wide option, set by system administration, so it is not taking into account any context at all. It would be better if you could make the relative database rank dependent on the set or subject the search is done from (for instance: if a history database is searched in the context of a “History” set, the results get a higher rank than in a search from a “Music” set).
The best solution for this “internal” relevance problem regarding distributed databases is a central database of harvested indexes. In this case all harvested metadata is normalised and ranked in a uniform way, and users do not have to select databases in advance. But these systems still do not take into account “external” relevance: there is no context!
A very interesting and intelligent solution for the problem of pre-selecting databases is provided in PurpleSearch, the integrated front end to MetaLib (among other things), developed by the Library of the University of Groningen. The system records which databases actually produce results for specific search terms. As soon as the user enters search terms in the single search box, the system knows which databases will have results, and the search is automatically carried out in these databases, without asking the user to select the databases or subject area he or she wants to search in. Simultaneously a background search in all other databases is performed in order to check additional new results, and the information about results in databases is updated.
Of course, all other usual options are available as well, like pre-selecting databases (setting context in advance) and faceted results drilling down (setting context after the fact). But again, no external contextual settings.
- Conclusion: the only way to find what you need, is to make search systems take into account the context in which the search is done, both for searching and for relevance ranking.
Now, let’s have a look at a couple of conditions that would make contextual searches possible.
Personal context: a system should “know” about your personal interests, field of study, job situation, age, etc. so it can “decide” which databases to search in and which results are the most relevant for you. Some systems, like university systems, have access to information about their users. Once you log in, the system potentially knows which subjects you are studying or teaching and could use this information for setting the context for searching and ranking.
But what if you are a student in Law AND Social Siences, which subject area should the system choose? Or: if you are a History teacher, and you have a personal interest in Ecology, which the system does not know about, what then? Somehow you still need to set context yourself.
Some systems also offer the opportunity of setting personal preferences, like: area of interest, specific databases, type of material (only digital or print), only recent material, etc. Again: you must be able to deviate from these preferences, depending on your situation, which means setting context manually.
Different search systems will have different user profiles (user data and preferences). It would be nice if search systems could take advantage of universal external personal profiles (like Google Profiles for instance) using some kind of universal API.
Situational context: a system should also “know” about the situation you are in, both in a functional sense and in a physical sense.
Functional context means: wich role are you playing? Are you in your law student role or in your social sciences student role? Are you in your professional role or in your private role? But also: to which resources do you have access?
An interesting idea: if you work Monday to Friday during office hours, study in the evenings and spend time on your personal interests on the weekends, it would be nice if you could link times of day and days of the week to your different roles, so search systems could use the correct context for your searches depending on time and date: “if it’s Tuesday evening then use study profile and search in ‘History’; if it’s Sunday, use private profile and search in ‘Ecology’“.
This temporal context was also referred to by Till Kinstler in a (German) blog post about the new “Suchkiste” search system prototype of the German Union Library Network (GBV): ‘the search for “Charlie Brown” in October should result in “It’s the Great Pumpkin, Charlie Brown” at number 1, and in December in “A Charlie Brown Christmas“‘.
Physical context means: where are you? It would be nice if a library catalogue search system would take into account your actual location, so it could show you the records of the copies of the FRBR-ized results available in the library locations nearest to you (this idea came up in a recent Twitter discussion between @librarianbe and @gbierens). This is what Worldcat does when you supply it with your location manually. In Worldcat this is a static preference. But it would be nice if it would respond to your actual location, for instance by using the GPS coordinates of your mobile phone. Alternatively, search systems could derive your location from the IP address you are sending your search from.
This information could also be used to determine if records for digital or physical copies should be ranked the most relevant in this case. If you are inside the library building and you have a preference for physical books and journals, then records for available print copies should be on top of the results list. If you are at home, then records for digital copies that you have access to should come first.
Contextual searching and ranking should always be a combination of all possible conditions, personal, situational and internal system ones.
Of course it goes without saying that it would be great if metasearch tools were able to convey the search context to the remote databases and get contextual results back, using some kind of universal serach context API!
Last but not least, each search system should show the context of the search, and explain how it got to the results in the presented order. Something like: based on your personal preferences, the time of day and day of the week, and your location, the search was done in these databases, with this subject area, and the physical copies of the nearest location are shown on top.
This context area on the results screen could then be used as a kind of inverted faceted search, drilling “up” to a broader level or “sideways” to another context.
Posted on June 19th, 2009 7 commentsLinked Data and bibliographic metadata models
Some time after I wrote “UMR – Unified Metadata Resources“, I came across Chris Keene’s post “Linked data & RDF : draft notes for comment“, “just a list of links and notes” about Linked Data, RDF and the Semantic Web, put together to start collecting information about “a topic that will greatly impact on the Library / Information management world“.
While reading this post and working my way through the links on that page, I started realising that Linked Data is exactly what I tried to describe as One single web page as the single identifier of every book, author or subject. I did mention Semantic Web, URI’s and RDF, but the term “Linked Data” as a separate protocol had escaped me.
The concept of Linked Data was described by Tim Berners Lee, the inventor of the World Wide Web. Whereas the World Wide Web links documents (pages, files, images), which are basically resources about things, (“Information Resources” in Semantic Web terms), Linked Data (or the Semantic Web) links raw data and real life things (“Non-Information Resources”).
There are several definitions of Linked Data on the web, but here is my attempt to give a simple definition of it (loosely based on the definition in Structured Dynamics’ Linked Data FAQ):Linked Data is a methodology for providing relationships between things (data, concepts and documents) anywhere on the web, using URI’s for identifying, RDF for describing and HTTP for publishing these things and relationships, in a way that they can be interpreted and used by humans and software.
I will try to illustrate the different aspects using some examples from the library world. The article is rather long, because of the nature of the subject, then again the individual sections are a bit short. But I do supply a lot of links for further reading.
Data is relationships
The important thing is that “data is relationships“, as Tim Berners Lee says in his recent presentation for TED.
Before going into relationships between things, I have to point out the important distinction between abstract concepts and real life things, which are “manifestations” of the concepts. In Object modeling these are called “classes” (abstract concepts, types of things) and “objects” (real life things, or “instances” of “classes“).
- the class book can have the instances/objects “Cloud Atlas“, “Moby Dick“, etc.
- the class person can have the instances/objects “David Mitchell“, “Herman Melville“, etc.
In the Semantic Web/RDF model the concept of triples is used to describe a relationship between two things: subject – predicate – object, meaning: a thing has a relation to another thing, in the broadest sense:
- a book (subject) is written by (predicate) a person (object)
You can also reverse this relationship:
- a person (subject) is the author of (predicate) a book (object)
The person in question is only an author because of his or her relationship to the book. The same person can also be a mother of three children, an employee of a library, and a speaker at a conference.
Moreover, and this is important: there can be more than one relationship between the same two classes or types of things. A book (subject) can also be about (predicate) a person (object). In this case the person is a “subject” of the book, that can be described by a “keyword”, “subject heading”, or whatever term is used. A special case would be a book, written by someone about himself (an autobiography).
The problem with most legacy systems, and library catalogues as an example of these, is that a record for let’s say a book contains one or more fields for the author (or at best a link to an entry in an authority file or thesaurus), and separately one or more fields for subjects. This way it is not possible to see books written by an author and books about the same author in one view, without using all kinds of workarounds, link resolvers or mash-ups.
Using two different relationships that link to the same thing would provide for an actual view or representation of the real world situation.
Another important option of Linked Data/RDF: a certain thing can have as a property a link to a concept (or “class”) , describing the nature of the thing: “object Cloud Atlas” has type “book“; “object David Mitchell” has type “person“; “object Cloud Atlas” is written by “object David Mitchell“.
And of course, the property/relationship/predicate can also link to a concept describing the nature of the link.
Anywhere on the web
So far so good. But you may argue that this relationship theory is not very new. Absolutely right, but up until now this data-relationship concept has mainly been used with a view to the inside, focused on the area of the specific information system in question, because of the nature and the limitations of the available technology and infrastructure.
The “triple” model is of course exactly the same as the long standing methodology of Entity Relationship Diagrams (ERD), with which relationships between entities (=”classes“) are described. An ERD is typically used to generate a database that contains data in a specific information system. But ERD’s could just as well be used to describe Linked Data on the web.
Information systems, such as library catalogs, have been, and still are, for the greatest part closed containers of data, or “silos” without connections between them, as Tim Berners Lee also mentions in his TED presentation.
Lots of these silo systems are accessible with web interfaces, but this does not mean that items in these closed systems with dedicated web front ends can be linked to items in other databases or web pages. Of course these systems can have API‘s that allow system developers to create scripts to get related information from other systems and incorporate that external information in the search results of the calling system. This is what is being done in web 2.0 with so-called mash-ups.
But in this situation you need developers who know how to make scripts using specific scripting languages for all the different proprietary API’s that are being supported for all the individual systems.
If Linked Data was a global standard and all open and closed systems and websites supported RDF, then all these links would be available automatically to RDF enabled browser and client software, using SPARQL, the RDF Query Language.
- Linked Data/RDF can be regarded as a universal API.
The good thing about Linked Data is, that it is possible to use Linked Data mechanisms to link to legacy data in silo databases. You just need to provide an RDF wrapper for the legacy system, like has been done with the Library of Congress Subject Headings.
Some examples of available tools for exposing legacy data as RDF:
- Triplify – a web applications plugin that converts relational database structures into RDF triples
- D2R Server – a tool for publishing relational databases on the Semantic Web
- wp-RDFa – a wordpress plugin that adds some RDF information about Author and Title to WordPress blog posts
Of course, RDF that is generated like this will very probably only expose objects to link TO, not links to RDF objects external to the system.
Also, Linked Data can be used within legacy systems, for mixing legacy and RDF data, open and closed access data, etc. In this case we have RDF triples that have a subject URI from one data source and an object URI from another data source. In a situation with interlinked systems it would for instance be possible to see that the author of a specific book (data from a library catalog) is also speaking at a specific conference (data from a conference website). Objects linked together on the web using RDF triples are also known as an “RDF graph”. With RDF-aware client software it is possible to navigate through all the links to retrieve additional information about an object.
URI’s (“Uniform Resource Identifiers”) are necessary for uniquely identifying and linking to resources on the web. A URI is basically a string that identifies a thing or resource on the web. All “Information Resources”, or WWW pages, documents, etc. have a URI, which is commonly known as a URL (Uniform Resource Locator).
With Linked Data we are looking at identifying “Non-information Resources” or “real world objects” (people, concepts, things, even imaginary things), not web pages that contain information about these real world objects. But it is a little more complicated than that. In order to honour the requirement that a thing and its relations can be interpreted and used by humans and software, we need at least 3 different representations of one resource (see: How to publish Linked Data on the web):
- Resource identifier URI (identifies the real world object, the concept, as such)
- RDF document URI (a document readable for semantic web applications, containing the real world object’s RDF data and relationships with other objects)
- HTML document URI (a document readable for humans, with information about the real world object)
For instance, there could be a Resource Identifier URI for a book called “Cloud Atlas“. The web resource at that URI can redirect an RDF enabled browser to the RDF document URI, which contains RDF data describing the book and its properties and relationships. A normal HTML web browser would be redirected to the HTML document URI, for instance a web page about the book at the publisher’s website.
There are several methods of redirecting browsers and application to the required representation of the resource. See Cool URIs for the Semantic Web for technical details.
There are also RDF enabled browsers that transform RDF into web pages readable by humans, like the FireFox addon “Tabulator“, or the web based Disco and Marbles browsers, both hosted at the Free University Berlin.
RDF, vocabularies, ontologies
RDF or Resource Description Framework, is, like the name suggests, just a framework. It uses XML (or a simpler non-XML method N3) to describe resources by means of relationships. RDF can be implemented in vocabularies or ontologies, which are sets of RDF classes describing objects and relationships for a given field.
Basically, anybody can create an RDF vocabulary by publishing an RDF document defining the classes and properties of the vocabulary, at a URI on the web. The vocabulary can then be used in a resource by referring to the namespace (the URI) and the classes in that RDF document.
A nice and useful feature of RDF is that more than one vocabularies can be mixed and used in one resource.
Also, a vocabulary itself can reference other vocabularies and thereby inherit well established classes and properties from other RDF documents.
Another very useful feature of RDF is that objects can be linked to similar object resources describing the same real world thing. This way confusion about which object we are talking about, can be avoided.
A couple of existing and well used RDF vocabularies/ontologies:
- RDF – the base RDF vocabulary
- RDFS (for RDF Schema)
- DC (for Dublin Core)
- FOAF (for FOAF- Friend of a Friend) – online identities and social networks
- SKOS (for SKOS – Simple Knowledge Organisation System) – thesauri, classification schemes, subject heading systems and taxonomies
- OWL (for OWL -Ontology Web Language)
(By the way, the links in the first column (to the RDF files themselves) may act as an illustration of the redirection mechanism described before. Some of them may link to either the RDF file with the vocabulary definition itself, or to a page about the vocabulary, depending on the type of browser you use: rdf-aware or not.)
A special case is:
<?xml version=”1.0″ encoding=”UTF-8″ ?>
- RDFa – a sort of microformat without a vocabulary of its own, which relies on other vocabularies for turning XHTML page attributes into RDF
<dc:publisher>Random House Trade Paperbacks</dc:publisher>
<dc:title>Cloud Atlas: A Novel</dc:title>
<rdfs:label>Cloud Atlas: A Novel</rdfs:label>
<rdfs:label>RDF document about the book: Cloud Atlas: A Novel</rdfs:label>
<rdfs:label>Review number 1 about: Cloud Atlas: A Novel</rdfs:label>
<rdfs:label>RDF Book Mashup</rdfs:label>
A partial view on this RDF file with the Marbles browser:
It seems obvious that Linked Data can be very useful in providing a generic infrastructure for linking data, metadata and objects, available in numerous types of data stores, in the online library world. With such a networked online data structure, it would be fairly easy to create all kinds of discovery interfaces for bibliographic data and objects. Moreover, it would also be possible to link to non-bibliographic data that might interest the users of these interfaces.
A brief and incomplete list of some library related Linked Data projects, some of which already mentioned above:
- RDF BookMashup – Integration of Web 2.0 data sources like Amazon, Google or Yahoo into the Semantic Web.
- Library of Congress Authorities – Exposing LoC Autorities and Vocabularies to the web using URI’s
- DBPedia – Exposing structured data from WikiPedia to the web
- LIBRIS – Linked Data interface to Swedish LIBRIS Union catalog
- Scriblio+Wordpress+Triplify – “A social, semantic OPAC Union Catalogue”
And what about MARC, AACR2 and RDA? Is there a role for them in the Linked Data environment? RDA is supposed to be the successor of AACR2 as a content standard that can be used with MARC, but also with other encoding standards like MODS or Dublin Core.
The RDA Entity Relationship Diagram, that incorporates FRBR as well, can of course easily be implemented as an RDF vocabulary, that could be used to create a universal Linked Data library network. It really does not matter what kind of internal data format the connected systems use.
Posted on May 15th, 2009 22 comments
Why use a non-normalised metadata exchange format for suboptimal data storage?
This week I had a nice chat with André Keyzer of Groningen University library and Peter van Boheemen of Wageningen University Library who attended OCLC’s Amsterdam Mashathon 2009. As can be expected from library technology geeks, we got talking about bibliographic metadata formats, very exciting of course. The question came up: what on earth could be the reason for storing bibliographic metadata in exchange formats like MARC?
Exactly my idea! As a matter of fact I think I may have used the same words a couple of times in recent years, probably even at ELAG2008. The thing is: it really does not matter how you store bibliographic metadata in your database, as long as you can present and exchange the data in any format requested, be it MARC or Dublin Core or anything else.
Of course the importance of using internationally accepted standards is beyond doubt, but there clearly exists widespread misunderstanding of the functions of certain standards, like for instance MARC. MARC is NOT a data storage format. In my opinion MARC is not even an exchange format, but merely a presentation format.
With a background and experience in data modeling, database and systems design (among others), I was quite amazed about bibliographic metadata formats when I started working with library systems in libraries, not having a librarian training at all. Of course, MARC (“MAchine Readable Cataloging record“) was invented as a standard in order to facilitate exchange of library catalog records in a digital era.
But I think MARC was invented by old school cataloguers who did not have a clue about data normalisation at all. A MARC record, especially if it corresponds to an official set of cataloging rules like AARC2, is nothing more than a digitised printed catalog card.
In pre-computer times it made perfect sense to have a standardised uniform way of registering bibliographic metadata on a printed card in this way. The catalog card was simultaneously used as a medium for presenting AND storing metadata. This is where the confusion originates from!
But when the Library of Congress says “If a library were to develop a “home-grown” system that did not use MARC records, it would not be taking advantage of an industry-wide standard whose primary purpose is to foster communication of information” it is saying just plain nonsense.
Actually it is better NOT to use something like MARC for other purposes than exchanging, or better, presenting data. To illustrate this I will give two examples of MARC tags that have been annoying me since my first day as a library employee:
- 100 – Main Entry-Personal Name (NR) – subfield $a – Personal name (NR)
- 773 – Host Item Entry (R) – subfield $g – Relationship information (R)
100 – Main Entry-Personal Name
Besides storing an author’s name as a string in each individual bibliographic record instead of using a code, linking to a central authority table (“foreign key” in relational database terms), it is also a mistake to use a person’s name as one complete string in one field. Examples on the Library of Congress MARC website use forms like “Adams, Henry”, “Fowler, T. M.” and “Blackbeard, Author of”. To take only the simple first example, this author could also be registered as “Henry Adams”, “Adams, H.”, “H. Adams”. And don’t say that these forms are not according to the rules! They are out there! There is no way to match these variations as being actually one and the same.
In a normalised relational database, this subfield $a would be stored something like this (simplified!):
- First name=Henry
773 – Host Item Entry
Subfield $g of this MARC tag is used for storing citation information for a journal article, volume, issue, year, start page, end page, all in one string, like: “Vol. 2, no. 2 (Feb. 1976), p. 195-230“. Again I have seen this used in many different ways. In a normalised format this would look something like this, using only the actual values:
- Start page=195
- End page=230
In a presentation of this normalised data record extra text can be added like “Vol.” or “Volume“, “Issue” or “No.“, brackets, replacing codes by descriptions (Month 2 = Feb.) etc., according to the format required. So the stored values could be used to generate the text “Vol. 2, no. 2 (Feb. 1976), p. 195-230” on the fly, but also for instance “Volume 2, Issue 2, dated February 1976, pages 195-230“.
The strange thing with this bibliographic format aimed at exchanging metadata is that it actually makes metadata exchange terribly complicated, especially with these two tags Author and Host Item. I can illustrate this with describing the way this exchange is handled between two digital library tools we use at the Library of the University of Amsterdam, MetaLib and SFX , both from the same vendor, Ex Libris.
The metasearch tool MetaLib is using the described and preferred mechanism of on the fly conversion of received external metadata from any format to MARC for the purpose of presentation.
But if we want to use the retrieved record to link to for instance a full text article using the SFX link resolver, the generated MARC data is used as a source and the non-normalised data in the 100 and 773 MARC tags has to be converted to the OpenURL format, which is actually normalised (example in simple OpenUrl 0.1):
isbn=;issn=0927-3255;date=1976; volume=2;issue=2;spage=195;epage=230; aulast=Adams;aufirst=Henry;auinit=;
In order to do this all kinds of regular expressions and scripting functions are needed to extract the correct values from the MARC author and citation strings. Wouldn’t it be convenient, if the record in MetaLib would already have been in OpenURL or any other normalised format?
The point I am trying to make is of course that it does not matter how metadata is stored, as long as it is possible to get the data out of the database in any format appropriate for the occasion. The SRU/SRW protocol is particularly aimed at precisely this: getting data out of a database in the required format, like MARC, Dublin Core, or anything else. An SRU server is a piece of middleware that receives requests, gets the requested data, converts the data and then returns the data in the requested format.
Currently at the Library of the University of Amsterdam we are migrating our ILS which also involves converting our data from one bibliographic metadata format (PICA+) to another (MARC). This is extremely complicated, especially because of the non-normalised structure of both formats. And I must say that in my opinion PICA+ is even the better one.
Also all German and Austrian libraries are meant to migrate from the MAB format to MARC, which also seems to be a move away from a superior format.
All because of the need to adhere to international standards, but with the wrong solution.
Maybe the projected new standard for resource description and access RDA will be the solution, but that may take a while yet.
Posted on April 12th, 2009 7 comments
One single web page as the single identifier of every book, author or subject
I like the concept of “the web as common publication platform for libraries“, and “every book its own url“, as described by Owen Stephens in two blog posts:
“Its time to change library systems ”
I’d suggest what we really need to think about is a common ‘publication’ platform – a way of all of our systems outputting records in a way that can then be easily accessed by a variety of search products – whether our own local ones, remote union ones, or even ones run by individual users. I’d go further and argue that platform already exists – it is the web!
and “The Future is Analogue ”
If every book in your catalogue had it’s own URL – essentially it’s own address on your web, you would have, in a single step, enabled anyone in the world to add metadata to the book – without making any changes to the record in your catalogue.
This concept of identifying objects by URL:Unified Resource Locator (or maybe better URI: Unified Resource Identifier) is central to the Semantic Web, that uses RDF (resource Description Framework) as a metadata model.
As a matter of fact at ELAG 2008 I saw Jeroen Hoppenbrouwers (“Rethinking Subject Access “) explaining his idea of doing the same for Subject Headings using the Semantic Web concept of triplets. Every subject its own URL or web page. He said: “It is very easy. You can start doing this right away“.
To make the picture complete we only need the third essential component: every author his or her or its own URL!
This ideal situation would have to conform to the Open Access guidelines of course. One single web page serving as the single identifier of every book, author or subject, available for everyone to link their own holdings, subscriptions, local keywords and circulation data to.
In real life we see a number of current initiatives on the web by commercial organisations and non commercial groups, mainly in the area of “books” (or rather “publications”) and “authors”. “Subjects” apparently is a less appealing area to start something like this, because obviously stand-alone “subjects” without anything to link them to are nothing at all, whereas you always have “publications” and “authors”, even without “subjects”. The only project I know of is MACS (Multilingual Acces to Subjects), which is hosted on Jeroen Hoppenbrouwers’ domain.
For publications we have OCLC’s WorldCat, Librarything, Open Library, to name just a few. And of course these global initiatives have had their regional and local counterparts for many years already (Union Catalogues, Consortia models). But this is again a typical example of multiple parallel data stores of the same type of entities. The idea apparently is that you want to store everything in one single database aiming to be complete, instead of the ideal situation of single individual URI’s floating around anywhere on the web.
Ex Libris’ new Unified Resource Management development (URM, and yes: the title of this blog post is an ironic allusion to that acronym), although it promotes sharing of metadata, it does this within another separate system into which metadata from other systems can be copied.
Of course, the ideal picture sketched above is much too simple. We have to be sure which version of a publication, which author and which translation of a subject for instance we are dealing with. For publications this means that we need to implement FRBR (in short: an original publication/work and all of its manifestations), for authors we need author names thesauri, for subjects multilingual access.
I have tried to illustrate this in this simplified and incomplete diagram:
In this model libraries can use their local URI-objects representing holdings and copies for their acquisitions and circulation management, while the bibliographic metadata stay out there in the global, open area. Libraries (and individuals of course) can also attach local keywords to the global metadata, which in turn can become available globally (“social tagging”).
It is obvious that the current initiatives have dealt with these issues with various levels of success. Some examples to illustrate this:
- Work: Desiderius Erasmus – Encomium Moriae (Greek), Laus Stultitiae (Latin), Lof der Zotheid (Dutch), Praise of Folly (English)
- Author: David Mitchell
- Erasmus in WorldCat Identities (one ID, many forms)
- David Mitchell in WorldCat Identities (one id per author)
- David Mitchell in VIAF (one id per author)
- Erasmus in OpenLibrary (one id, one incomplete form)
- Erasmus in VIAF (one id, although from The Netherlands, preferred forms are Swedish, French and German)
- Erasmus in Librarything (no identifier, numerous forms and occurrences)
- David Mitchell in Librarything (one form, “David Mitchell is composed of at least 12 distinct authors“, no way to distinguish)
- David Mitchell in OpenLibrary (one id for multiple authors)
- Erasmus “Praise of folly” in Librarything (numerous entries for all different title variations)
- Erasmus “Praise of folly” in OpenLibrary (numerous entries for all different title variations)
These findings seem to indicate that some level of coordination (which the commercial initiatives apparently have implemented better than the non-commercial ones) is necessary in order to achieve the goal of “one URI for each object”.
Who wants to start?