-
Relevance in context
Posted on August 11th, 2009 5 commentsIf you do a search in a bibliographic database, you should find what you need, not just what you are looking for, or what the database “thinks” you are looking for. If you find what you are looking for, then you will not be surprised and you will not discover anything new. And that’s not what you want, is it? But if you find things you did not look for but also do not need, you’re not just surprised, you are confused! And that’s not what you want either.
You want the results that are the most relevant for your search, with your specific objectives, at that specific point in time time, for your specific circumstances, and you want them immediately.
So, how should search systems behave to make you find what you need? There are two conditions that need to be met:
- The search terms must be interpreted correctly
- The most relevant search results must be presented
The Problem
First of all, let’s take a look at current practice.Search systems cannot cope with ambiguous search terms. My favorite example and test search term is “voc“. This can stand for a number of things in various disciplines: V.O.C. (Dutch: “Verenigde Oostindische Compagnie” or “Dutch United East Indies Company”) in historical databases; “vocals” in musical databases; “volatile organic compounds” in physics databases. So if you do a search for “voc” in a standard library catalogue, you get all kinds of results. Even more so if you use a metasearch or federated search tool for searching several databases simultaneously.
You are confused. You would like the system to “understand” which one of these concepts you are referring to instead of just using the literal string. You would like the system to take into account your context.
In most databases search results can be sorted or filtered by a number of fields, most commonly by year, title, author, and also by more specific fields in dedicated databases. But unless you are interested in a specific year, author or title, this will not do. Recently many systems have implemented “faceted” and “clustered” browsing of results, enabling “drilling down” on specific terms or subjects. This basically comes down to setting the context after the fact.
But after the system has interpreted your search terms, the results should also be ordered in a specific way, the ones you need most should be on top. This is where “relevance ranking” of search results comes in. Most catalogues and databases use a system specific default relevance ranking algorithm. Search results are assigned a rank, based on a number of criteria, that can differ between databases, depending on the nature of the database.
Some databases just present the most recent results on top. For medical and physical sciences this may be right, but for history and literature databases this may just be wrong.
Sometimes the search terms are taken into account: the number of times the given search terms are present in the result fields is important, but also the specific fields in which search terms appear. The appearance of search terms in “Title” and “Subject” may rank higher than in “Abstract” or “Publisher”. Moreover, the search indexes used can have a major influence on rank: if you search for “Subject” = “flu”, then results with “flu” as subject will be ranked higher than results with “flu” in the title only.
To come back to my example, with ambiguous search terms like “voc” this type of relevance ranking will definitely not be enough, because results from the three different conceptual areas will be completely mixed up.When searching with a metasearch or federated search tool things get even more complicated. Each of the remote databases that you search in has its own default way of ranking. Usually the metasearch tool fetches the first 30 or so results from each remote database (one set sorted by date, the other by internal rank, the next by title), merges these into one list and then applies its own local ranking mechanism to this subset only. Confusion! And I did not even mention searching databases with metadata in multiple languages. Moreover, databases containing only metadata will produce different results and relevance than databases with full text articles. There is absolutely no way of telling if you actually have the most relevant results for your situation.
Again, with relevance ranking search systems do not take into account the context either. You could say it is an introverted, internally focused way of ranking, the confusing results of which are multiplied in case of metasearching.
Most metasearch tools give users the option of searching in sets of pre-selected databases, based on subject or type. This way you can limit your search to those databases that are known to have data about that specific subject. You more or less set the context in advance. But this mechanism only eliminates results from databases that probably do not have data on your subject at all, so they would not have shown up in the results anyway. Moreover, the same issues that were discussed above apply to this limited set of databases.
The metasearch tool that I know best (MetaLib) offers the option of setting a relative rank per database, so results from databases with a higher rank will have a higher relevance in merged result sets. But this is a system wide option, set by system administration, so it is not taking into account any context at all. It would be better if you could make the relative database rank dependent on the set or subject the search is done from (for instance: if a history database is searched in the context of a “History” set, the results get a higher rank than in a search from a “Music” set).
The best solution for this “internal” relevance problem regarding distributed databases is a central database of harvested indexes. In this case all harvested metadata is normalised and ranked in a uniform way, and users do not have to select databases in advance. But these systems still do not take into account “external” relevance: there is no context!
A very interesting and intelligent solution for the problem of pre-selecting databases is provided in PurpleSearch, the integrated front end to MetaLib (among other things), developed by the Library of the University of Groningen. The system records which databases actually produce results for specific search terms. As soon as the user enters search terms in the single search box, the system knows which databases will have results, and the search is automatically carried out in these databases, without asking the user to select the databases or subject area he or she wants to search in. Simultaneously a background search in all other databases is performed in order to check additional new results, and the information about results in databases is updated.
Of course, all other usual options are available as well, like pre-selecting databases (setting context in advance) and faceted results drilling down (setting context after the fact). But again, no external contextual settings.
Search "voc" in PurpleSearch
- Conclusion: the only way to find what you need, is to make search systems take into account the context in which the search is done, both for searching and for relevance ranking.
Solutions
Now, let’s have a look at a couple of conditions that would make contextual searches possible.Personal context: a system should “know” about your personal interests, field of study, job situation, age, etc. so it can “decide” which databases to search in and which results are the most relevant for you. Some systems, like university systems, have access to information about their users. Once you log in, the system potentially knows which subjects you are studying or teaching and could use this information for setting the context for searching and ranking.
But what if you are a student in Law AND Social Siences, which subject area should the system choose? Or: if you are a History teacher, and you have a personal interest in Ecology, which the system does not know about, what then? Somehow you still need to set context yourself.Some systems also offer the opportunity of setting personal preferences, like: area of interest, specific databases, type of material (only digital or print), only recent material, etc. Again: you must be able to deviate from these preferences, depending on your situation, which means setting context manually.
Different search systems will have different user profiles (user data and preferences). It would be nice if search systems could take advantage of universal external personal profiles (like Google Profiles for instance) using some kind of universal API.
Situational context: a system should also “know” about the situation you are in, both in a functional sense and in a physical sense.
Functional context means: wich role are you playing? Are you in your law student role or in your social sciences student role? Are you in your professional role or in your private role? But also: to which resources do you have access?
An interesting idea: if you work Monday to Friday during office hours, study in the evenings and spend time on your personal interests on the weekends, it would be nice if you could link times of day and days of the week to your different roles, so search systems could use the correct context for your searches depending on time and date: “if it’s Tuesday evening then use study profile and search in ‘History’; if it’s Sunday, use private profile and search in ‘Ecology’“.
This temporal context was also referred to by Till Kinstler in a (German) blog post about the new “Suchkiste” search system prototype of the German Union Library Network (GBV): ‘the search for “Charlie Brown” in October should result in “It’s the Great Pumpkin, Charlie Brown” at number 1, and in December in “A Charlie Brown Christmas“‘.
Physical context means: where are you? It would be nice if a library catalogue search system would take into account your actual location, so it could show you the records of the copies of the FRBR-ized results available in the library locations nearest to you (this idea came up in a recent Twitter discussion between @librarianbe and @gbierens). This is what Worldcat does when you supply it with your location manually. In Worldcat this is a static preference. But it would be nice if it would respond to your actual location, for instance by using the GPS coordinates of your mobile phone. Alternatively, search systems could derive your location from the IP address you are sending your search from.
This information could also be used to determine if records for digital or physical copies should be ranked the most relevant in this case. If you are inside the library building and you have a preference for physical books and journals, then records for available print copies should be on top of the results list. If you are at home, then records for digital copies that you have access to should come first.Contextual searching and ranking should always be a combination of all possible conditions, personal, situational and internal system ones.
Of course it goes without saying that it would be great if metasearch tools were able to convey the search context to the remote databases and get contextual results back, using some kind of universal serach context API!
Last but not least, each search system should show the context of the search, and explain how it got to the results in the presented order. Something like: based on your personal preferences, the time of day and day of the week, and your location, the search was done in these databases, with this subject area, and the physical copies of the nearest location are shown on top.
This context area on the results screen could then be used as a kind of inverted faceted search, drilling “up” to a broader level or “sideways” to another context. -
Linked Data for Libraries
Posted on June 19th, 2009 7 commentsLinked Data and bibliographic metadata modelsSome time after I wrote “UMR – Unified Metadata Resources“, I came across Chris Keene’s post “Linked data & RDF : draft notes for comment“, “just a list of links and notes” about Linked Data, RDF and the Semantic Web, put together to start collecting information about “a topic that will greatly impact on the Library / Information management world“.
While reading this post and working my way through the links on that page, I started realising that Linked Data is exactly what I tried to describe as One single web page as the single identifier of every book, author or subject. I did mention Semantic Web, URI’s and RDF, but the term “Linked Data” as a separate protocol had escaped me.
The concept of Linked Data was described by Tim Berners Lee, the inventor of the World Wide Web. Whereas the World Wide Web links documents (pages, files, images), which are basically resources about things, (“Information Resources” in Semantic Web terms), Linked Data (or the Semantic Web) links raw data and real life things (“Non-Information Resources”).
There are several definitions of Linked Data on the web, but here is my attempt to give a simple definition of it (loosely based on the definition in Structured Dynamics’ Linked Data FAQ):
Linked Data is a methodology for providing relationships between things (data, concepts and documents) anywhere on the web, using URI’s for identifying, RDF for describing and HTTP for publishing these things and relationships, in a way that they can be interpreted and used by humans and software.
I will try to illustrate the different aspects using some examples from the library world. The article is rather long, because of the nature of the subject, then again the individual sections are a bit short. But I do supply a lot of links for further reading.
Data is relationships
The important thing is that “data is relationships“, as Tim Berners Lee says in his recent presentation for TED.
Before going into relationships between things, I have to point out the important distinction between abstract concepts and real life things, which are “manifestations” of the concepts. In Object modeling these are called “classes” (abstract concepts, types of things) and “objects” (real life things, or “instances” of “classes“).Examples:
- the class book can have the instances/objects “Cloud Atlas“, “Moby Dick“, etc.
- the class person can have the instances/objects “David Mitchell“, “Herman Melville“, etc.
In the Semantic Web/RDF model the concept of triples is used to describe a relationship between two things: subject – predicate – object, meaning: a thing has a relation to another thing, in the broadest sense:
- a book (subject) is written by (predicate) a person (object)
You can also reverse this relationship:
- a person (subject) is the author of (predicate) a book (object)

Triple
The person in question is only an author because of his or her relationship to the book. The same person can also be a mother of three children, an employee of a library, and a speaker at a conference.
Moreover, and this is important: there can be more than one relationship between the same two classes or types of things. A book (subject) can also be about (predicate) a person (object). In this case the person is a “subject” of the book, that can be described by a “keyword”, “subject heading”, or whatever term is used. A special case would be a book, written by someone about himself (an autobiography).The problem with most legacy systems, and library catalogues as an example of these, is that a record for let’s say a book contains one or more fields for the author (or at best a link to an entry in an authority file or thesaurus), and separately one or more fields for subjects. This way it is not possible to see books written by an author and books about the same author in one view, without using all kinds of workarounds, link resolvers or mash-ups.
Using two different relationships that link to the same thing would provide for an actual view or representation of the real world situation.Another important option of Linked Data/RDF: a certain thing can have as a property a link to a concept (or “class”) , describing the nature of the thing: “object Cloud Atlas” has type “book“; “object David Mitchell” has type “person“; “object Cloud Atlas” is written by “object David Mitchell“.
And of course, the property/relationship/predicate can also link to a concept describing the nature of the link.
Anywhere on the web

ERD
So far so good. But you may argue that this relationship theory is not very new. Absolutely right, but up until now this data-relationship concept has mainly been used with a view to the inside, focused on the area of the specific information system in question, because of the nature and the limitations of the available technology and infrastructure.
The “triple” model is of course exactly the same as the long standing methodology of Entity Relationship Diagrams (ERD), with which relationships between entities (=”classes“) are described. An ERD is typically used to generate a database that contains data in a specific information system. But ERD’s could just as well be used to describe Linked Data on the web.
Information systems, such as library catalogs, have been, and still are, for the greatest part closed containers of data, or “silos” without connections between them, as Tim Berners Lee also mentions in his TED presentation.
Lots of these silo systems are accessible with web interfaces, but this does not mean that items in these closed systems with dedicated web front ends can be linked to items in other databases or web pages. Of course these systems can have API‘s that allow system developers to create scripts to get related information from other systems and incorporate that external information in the search results of the calling system. This is what is being done in web 2.0 with so-called mash-ups.
But in this situation you need developers who know how to make scripts using specific scripting languages for all the different proprietary API’s that are being supported for all the individual systems.
If Linked Data was a global standard and all open and closed systems and websites supported RDF, then all these links would be available automatically to RDF enabled browser and client software, using SPARQL, the RDF Query Language.- Linked Data/RDF can be regarded as a universal API.
The good thing about Linked Data is, that it is possible to use Linked Data mechanisms to link to legacy data in silo databases. You just need to provide an RDF wrapper for the legacy system, like has been done with the Library of Congress Subject Headings.
Some examples of available tools for exposing legacy data as RDF:
- Triplify – a web applications plugin that converts relational database structures into RDF triples
- D2R Server – a tool for publishing relational databases on the Semantic Web
- wp-RDFa – a wordpress plugin that adds some RDF information about Author and Title to WordPress blog posts
Of course, RDF that is generated like this will very probably only expose objects to link TO, not links to RDF objects external to the system.
Also, Linked Data can be used within legacy systems, for mixing legacy and RDF data, open and closed access data, etc. In this case we have RDF triples that have a subject URI from one data source and an object URI from another data source. In a situation with interlinked systems it would for instance be possible to see that the author of a specific book (data from a library catalog) is also speaking at a specific conference (data from a conference website). Objects linked together on the web using RDF triples are also known as an “RDF graph”. With RDF-aware client software it is possible to navigate through all the links to retrieve additional information about an object.

Linked Data
URI’s
URI’s (“Uniform Resource Identifiers”) are necessary for uniquely identifying and linking to resources on the web. A URI is basically a string that identifies a thing or resource on the web. All “Information Resources”, or WWW pages, documents, etc. have a URI, which is commonly known as a URL (Uniform Resource Locator).With Linked Data we are looking at identifying “Non-information Resources” or “real world objects” (people, concepts, things, even imaginary things), not web pages that contain information about these real world objects. But it is a little more complicated than that. In order to honour the requirement that a thing and its relations can be interpreted and used by humans and software, we need at least 3 different representations of one resource (see: How to publish Linked Data on the web):
- Resource identifier URI (identifies the real world object, the concept, as such)
- RDF document URI (a document readable for semantic web applications, containing the real world object’s RDF data and relationships with other objects)
- HTML document URI (a document readable for humans, with information about the real world object)

Redirection
For instance, there could be a Resource Identifier URI for a book called “Cloud Atlas“. The web resource at that URI can redirect an RDF enabled browser to the RDF document URI, which contains RDF data describing the book and its properties and relationships. A normal HTML web browser would be redirected to the HTML document URI, for instance a web page about the book at the publisher’s website.
There are several methods of redirecting browsers and application to the required representation of the resource. See Cool URIs for the Semantic Web for technical details.
There are also RDF enabled browsers that transform RDF into web pages readable by humans, like the FireFox addon “Tabulator“, or the web based Disco and Marbles browsers, both hosted at the Free University Berlin.
RDF, vocabularies, ontologies
RDF or Resource Description Framework, is, like the name suggests, just a framework. It uses XML (or a simpler non-XML method N3) to describe resources by means of relationships. RDF can be implemented in vocabularies or ontologies, which are sets of RDF classes describing objects and relationships for a given field.
Basically, anybody can create an RDF vocabulary by publishing an RDF document defining the classes and properties of the vocabulary, at a URI on the web. The vocabulary can then be used in a resource by referring to the namespace (the URI) and the classes in that RDF document.A nice and useful feature of RDF is that more than one vocabularies can be mixed and used in one resource.
Also, a vocabulary itself can reference other vocabularies and thereby inherit well established classes and properties from other RDF documents.
Another very useful feature of RDF is that objects can be linked to similar object resources describing the same real world thing. This way confusion about which object we are talking about, can be avoided.A couple of existing and well used RDF vocabularies/ontologies:
- RDF – the base RDF vocabulary
- RDFS (for RDF Schema)
- DC (for Dublin Core)
- FOAF (for FOAF- Friend of a Friend) – online identities and social networks
- SKOS (for SKOS – Simple Knowledge Organisation System) – thesauri, classification schemes, subject heading systems and taxonomies
- OWL (for OWL -Ontology Web Language)
(By the way, the links in the first column (to the RDF files themselves) may act as an illustration of the redirection mechanism described before. Some of them may link to either the RDF file with the vocabulary definition itself, or to a page about the vocabulary, depending on the type of browser you use: rdf-aware or not.)
A special case is:
- RDFa – a sort of microformat without a vocabulary of its own, which relies on other vocabularies for turning XHTML page attributes into RDF
Example
A shortened example for “Cloud Atlas” by David Mitchell from the RDF BookMashup at the Free University Berlin, which uses a number of different vocabularies:<?xml version=”1.0″ encoding=”UTF-8″ ?>
<rdf:RDF
xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
…
xmlns:skos=”http://www.w3.org/2004/02/skos/core#”>
<rdf:Description rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/books/0375507256″>
<rev:hasReview rdf:resource=”http://www4.wiwiss.fu-berlin.de/bookmashup/reviews/0375507256_EditorialReview1″/>
<dc:creator rdf:resource=”http://www4.wiwiss.fu-berlin.de/bookmashup/persons/David+Mitchell”/>
<dc:format>Paperback</dc:format>
<dc:identifier rdf:resource=”urn:ISBN:0375507256″/>
<dc:publisher>Random House Trade Paperbacks</dc:publisher>
<dc:title>Cloud Atlas: A Novel</dc:title>
</rdf:Description>
<scom:Book rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/books/0375507256″>
<rdfs:label>Cloud Atlas: A Novel</rdfs:label>
<skos:subject rdf:resource=”http://www4.wiwiss.fu-berlin.de/bookmashup/subject/Fantasy+fiction”/>
<skos:subject rdf:resource=”http://www4.wiwiss.fu-berlin.de/bookmashup/subject/Fate+and+fatalism”/>
…
<foaf:depiction rdf:resource=”http://ecx.images-amazon.com/images/I/51MIVHgJP%2BL.jpg”/>
<foaf:thumbnail rdf:resource=”http://ecx.images-amazon.com/images/I/51MIVHgJP%2BL._SL75_.jpg”/>
</scom:Book>
<rdf:Description rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/doc/books/0375507256″>
<dc:license rdf:resource=”http://www.amazon.com/AWS-License-home-page-Money/b/ref=sc_fe_c_0_12738641_12/102-8791790-9885755?ie=UTF8&node=3440661&no=12738641&me=A36L942TSJ2AJA”/>
<dc:license rdf:resource=”http://www.google.com/terms_of_service.html”/>
</rdf:Description>
<foaf:Document rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/doc/books/0375507256″>
<rdfs:label>RDF document about the book: Cloud Atlas: A Novel</rdfs:label>
<foaf:maker rdf:resource=”http://www4.wiwiss.fu-berlin.de/is-group/resource/projects/Project10″/>
<foaf:primaryTopic rdf:resource=”http://www4.wiwiss.fu-berlin.de/bookmashup/books/0375507256″/>
</foaf:Document>
<rdf:Description rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/persons/David+Mitchell”>
<rdfs:label>David Mitchell</rdfs:label>
</rdf:Description>
<rdf:Description rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/reviews/0375507256_EditorialReview1″>
<rdfs:label>Review number 1 about: Cloud Atlas: A Novel</rdfs:label>
</rdf:Description>
<rdf:Description rdf:about=”http://www4.wiwiss.fu-berlin.de/is-group/resource/projects/Project10″>
<rdfs:label>RDF Book Mashup</rdfs:label>
</rdf:Description>
</rdf:RDF>
A partial view on this RDF file with the Marbles browser:
See also the same example in the Disco RDF browser.
Library implementations
It seems obvious that Linked Data can be very useful in providing a generic infrastructure for linking data, metadata and objects, available in numerous types of data stores, in the online library world. With such a networked online data structure, it would be fairly easy to create all kinds of discovery interfaces for bibliographic data and objects. Moreover, it would also be possible to link to non-bibliographic data that might interest the users of these interfaces.A brief and incomplete list of some library related Linked Data projects, some of which already mentioned above:
- RDF BookMashup – Integration of Web 2.0 data sources like Amazon, Google or Yahoo into the Semantic Web.
- Library of Congress Authorities – Exposing LoC Autorities and Vocabularies to the web using URI’s
- DBPedia – Exposing structured data from WikiPedia to the web
- LIBRIS – Linked Data interface to Swedish LIBRIS Union catalog
- Scriblio+Wordpress+Triplify – “A social, semantic OPAC Union Catalogue”
And what about MARC, AACR2 and RDA? Is there a role for them in the Linked Data environment? RDA is supposed to be the successor of AACR2 as a content standard that can be used with MARC, but also with other encoding standards like MODS or Dublin Core.
The RDA Entity Relationship Diagram, that incorporates FRBR as well, can of course easily be implemented as an RDF vocabulary, that could be used to create a universal Linked Data library network. It really does not matter what kind of internal data format the connected systems use. -
Who needs MARC?
Posted on May 15th, 2009 21 commentsWhy use a non-normalised metadata exchange format for suboptimal data storage?
This week I had a nice chat with André Keyzer of Groningen University library and Peter van Boheemen of Wageningen University Library who attended OCLC’s Amsterdam Mashathon 2009. As can be expected from library technology geeks, we got talking about bibliographic metadata formats, very exciting of course. The question came up: what on earth could be the reason for storing bibliographic metadata in exchange formats like MARC?
Being asked once at an ELAG conference about the bibliographic format Wageningen University was using in their home grown catalog system, Peter answered: “WDC” ….”we don’t care“.
Exactly my idea! As a matter of fact I think I may have used the same words a couple of times in recent years, probably even at ELAG2008. The thing is: it really does not matter how you store bibliographic metadata in your database, as long as you can present and exchange the data in any format requested, be it MARC or Dublin Core or anything else.
Of course the importance of using internationally accepted standards is beyond doubt, but there clearly exists widespread misunderstanding of the functions of certain standards, like for instance MARC. MARC is NOT a data storage format. In my opinion MARC is not even an exchange format, but merely a presentation format.

St. Marc Express
With a background and experience in data modeling, database and systems design (among others), I was quite amazed about bibliographic metadata formats when I started working with library systems in libraries, not having a librarian training at all. Of course, MARC (“MAchine Readable Cataloging record“) was invented as a standard in order to facilitate exchange of library catalog records in a digital era.
But I think MARC was invented by old school cataloguers who did not have a clue about data normalisation at all. A MARC record, especially if it corresponds to an official set of cataloging rules like AARC2, is nothing more than a digitised printed catalog card.In pre-computer times it made perfect sense to have a standardised uniform way of registering bibliographic metadata on a printed card in this way. The catalog card was simultaneously used as a medium for presenting AND storing metadata. This is where the confusion originates from!

MARC record
But when the Library of Congress says “If a library were to develop a “home-grown” system that did not use MARC records, it would not be taking advantage of an industry-wide standard whose primary purpose is to foster communication of information” it is saying just plain nonsense.
Actually it is better NOT to use something like MARC for other purposes than exchanging, or better, presenting data. To illustrate this I will give two examples of MARC tags that have been annoying me since my first day as a library employee:- 100 – Main Entry-Personal Name (NR) – subfield $a – Personal name (NR)
- 773 – Host Item Entry (R) – subfield $g – Relationship information (R)
100 – Main Entry-Personal Name
Besides storing an author’s name as a string in each individual bibliographic record instead of using a code, linking to a central authority table (“foreign key” in relational database terms), it is also a mistake to use a person’s name as one complete string in one field. Examples on the Library of Congress MARC website use forms like “Adams, Henry”, “Fowler, T. M.” and “Blackbeard, Author of”. To take only the simple first example, this author could also be registered as “Henry Adams”, “Adams, H.”, “H. Adams”. And don’t say that these forms are not according to the rules! They are out there! There is no way to match these variations as being actually one and the same.
In a normalised relational database, this subfield $a would be stored something like this (simplified!):- Person
- Surname=Adams
- First name=Henry
- Prefix=
- …
773 – Host Item Entry
Subfield $g of this MARC tag is used for storing citation information for a journal article, volume, issue, year, start page, end page, all in one string, like: “Vol. 2, no. 2 (Feb. 1976), p. 195-230“. Again I have seen this used in many different ways. In a normalised format this would look something like this, using only the actual values:- Journal
- Volume=2
- Issue=2
- Year=1976
- Month=2
- Day=
- Start page=195
- End page=230
In a presentation of this normalised data record extra text can be added like “Vol.” or “Volume“, “Issue” or “No.“, brackets, replacing codes by descriptions (Month 2 = Feb.) etc., according to the format required. So the stored values could be used to generate the text “Vol. 2, no. 2 (Feb. 1976), p. 195-230” on the fly, but also for instance “Volume 2, Issue 2, dated February 1976, pages 195-230“.
The strange thing with this bibliographic format aimed at exchanging metadata is that it actually makes metadata exchange terribly complicated, especially with these two tags Author and Host Item. I can illustrate this with describing the way this exchange is handled between two digital library tools we use at the Library of the University of Amsterdam, MetaLib and SFX , both from the same vendor, Ex Libris.
The metasearch tool MetaLib is using the described and preferred mechanism of on the fly conversion of received external metadata from any format to MARC for the purpose of presentation.
But if we want to use the retrieved record to link to for instance a full text article using the SFX link resolver, the generated MARC data is used as a source and the non-normalised data in the 100 and 773 MARC tags has to be converted to the OpenURL format, which is actually normalised (example in simple OpenUrl 0.1):isbn=;issn=0927-3255;date=1976; volume=2;issue=2;spage=195;epage=230; aulast=Adams;aufirst=Henry;auinit=;
In order to do this all kinds of regular expressions and scripting functions are needed to extract the correct values from the MARC author and citation strings. Wouldn’t it be convenient, if the record in MetaLib would already have been in OpenURL or any other normalised format?
The point I am trying to make is of course that it does not matter how metadata is stored, as long as it is possible to get the data out of the database in any format appropriate for the occasion. The SRU/SRW protocol is particularly aimed at precisely this: getting data out of a database in the required format, like MARC, Dublin Core, or anything else. An SRU server is a piece of middleware that receives requests, gets the requested data, converts the data and then returns the data in the requested format.
Currently at the Library of the University of Amsterdam we are migrating our ILS which also involves converting our data from one bibliographic metadata format (PICA+) to another (MARC). This is extremely complicated, especially because of the non-normalised structure of both formats. And I must say that in my opinion PICA+ is even the better one.
Also all German and Austrian libraries are meant to migrate from the MAB format to MARC, which also seems to be a move away from a superior format.
All because of the need to adhere to international standards, but with the wrong solution.Maybe the projected new standard for resource description and access RDA will be the solution, but that may take a while yet.
-
Replacing our ILS, business as usual
Posted on April 24th, 2009 2 commentsAs you may have noticed from some of my tweets, the Library of the Unversity of Amsterdam, my place of work, is in the process of replacing its ILS (Integrated Library System). All in all this project, or better these two projects (one selecting a new ILS, the other one implementing it) will have taken 18 months or more from the decision to go ahead until STP (Switch to Production), planned for August 15 this year. My colleague Bert Zeeman blogged about this (in Dutch) recently.
One thing that has become absolutely clear to me is that replacing an ILS is not just about replacing one information system by another. It is about replacing an entire organisational structure of work processes, with its huge impact on all people involved. And in our case it affects two organisations: besides the Library of the University of Amsterdam also the Media Library of the Hogeschool van Amsterdam. We have been managing library systems for both organisations in a mini consortial structure since a couple of years. So the Media Library is facing a second ILS replacement within two years.
While the decision was made because of pressing technical reasons, also with an eye on preparing for future library 2.0 developments, it turned out to be of substantial consequence for the organisation.
This is the first time that I am participating in such a radical library system project. I have done a couple of projects implementing and upgrading metasearching and OpenURL link resolver tools in the last six years, but these are nothing compared to the current project. With these “add-on” tools, that started as a means of extending the library’s primary stream of information, only a relatively limited number of people were involved. But with an ILS you are talking about the core business of a library (still!) and about day to day working life of everybody involved in acquisitions, cataloguing, circulation as well as system administrators and system librarians.To make it even more complicated, the University Library is also switching from the old system’s proprietary bibliographic format to MARC21, because that is what the new system is using. Personally I think that the old system’s format is better (just like our German colleagues think about their move from MAB to MARC), but of course the advantages of using an internationally accepted and used standard outweighs this, as always. Maybe food for another blog post later…
Last but not least, the Library is simultaneously doing a project for the implementation of RFID for self check machines. The initial idea was to implement RFID in the old system and then just migrate everything to the new one. However, for various reasons, recently it was decided to postpone RFID implementation to shortly after our ILS STP. Some initial tests have shown that this probably will work.
And while all this is going on, all normal work needs to be taken care of too: ” business as usual” .
Now, looking at workflows: the way that our individual departments have organised their workflows, is partly dictated by the way the old system is designed. The new system obviously dictates workflows too, but in other areas. Although this new system is very flexible and highly configurable, there are still some local requirements that cannot be met by the new system.
Of course this is NOT the way it should be! Systems should enable us to do what we want and how we want it! Hopefully new developments like Ex Libris’ URM and the very recently announced new OCLS WorldCat Web based ILS will take care of users better.Talking about “very flexible and highly configurable”: although a very big advantage, this also makes it much more complicated and time consuming to implement the new system. Fortunately there are a lot of other libraries in The Netherlands and around the world using the new system that are willing to help us in every possible way. And this is highly appreciated!
Other isues that make this project complicated:
- unexpected issues, bottlenecks: these keep on coming
- migration of data from old system: conversion of old to new format
- implementing links with external systems like student’s and staff database, financial system, national union catalogue
I think we will make STP on the planned date, but I also think we need to postpone a number of issues until after that. There will still be a lot of work to be done for my department after the project has finished.
To end with a positive note: the new OPAC wil be much nicer and more flexible than the old one. And in the end that is what we are doing this for: our patrons.
-
System librarians 2.0
Posted on October 17th, 2008 2 commentsIt strikes me that training for and documentation about our new Aleph ILS are aimed at three types of staff: system administrators, system librarians and staff (expert) users. Basically system administrators are supposed to take care of “technical stuff” like installing, upgrading, monitoring, backups, general system configuration etc., while staff users are dealing with the “real stuff”, like cataloging, acquisition, circulation, etc. System librarians appear to be a kind of hybrid species, both technicians and librarians: information specialists with UNIX and vi experience.
At the Library of the University of Amsterdam we do not have these three staff types, we only have what we call system administrators and staff users. We as system administrators do both system administrator and system librarian tasks as defined in the Aleph documentation. Only hardware, operating system, network, server monitoring and system backups are taken care of by the University’s central ICT department.
There is no such job title as “system librarian”, in fact I would not even know how to translate this term into Dutch. However, we do have terms for three different types of tasks: technical system administration, application administration and functional administration, which may be equivalent to the above mentioned staff types, although the terms are used in different ways and boundaries between them are unclear. In The Netherlands we even have system administrators, application administrators and functional administrators, but these are all general terms, not limited to the library world.Anyway, the need for three types of library system administration tasks and staff is typically related to the legacy systems of Library 1.0.
Library 0.0 (the catalog card era) had only one type: the expert staff user, also known as “librarian“.
Library 2.0 (also known as “next generation” library systems) will probably also have only one type of staff user that is needed in the libraries themselves: and I guess we will call these library staff users “system librarians“. These future system librarians will have knowledge of and experience in library and information issues, and will take care of configuration of the integrated library information systems at their disposal through sophisticated, intuitive and user friendly web admin interfaces.
The systems themselves will be hosted and monitored on remote servers, according to the SaaS model (Software as a Service), either by system vendors or by user consortia or in cooperation between both. Technical system administration will no longer be necessary at the local libraries.
Cataloging, tagging, indexing etc. will not be necessary at the local library level either, because metadata will be provided by publishers, or dynamically generated by harvesting and indexing systems, and enriched by our end users/clients via tagging. These metadata stores will also be hosted and administered on remote servers, either by publishers or again by cooperative user organisations.Of course this will have a lot of consequences for the current organisation and staffing of our libraries, but there will be plenty of time to adapt.
System librarians of the world: unite!
-
How open are open systems?
Posted on October 12th, 2008 2 commentsIn my post “LING – Library integration next generation” I mentioned Marshall Breedings presentation at TICER “Library Automation Challenges for the next generation”.
Besides “Moving toward new generation of library automation” one of his other two topics was “A Mandate for Openness”, about Open Source, Open Systems, Open Content.
Marshall Breeding distinguishes five types of Open Systems, three of which in my view are the most important:- Closed Systems: black boxes, only accessible via the user interfaces provided by the developer, no programmable access to the system
- Open Source Model: all aspects of the system available to inspection and modification
- Open API model: the core application is closed and accessible via the user interfaces provided by the developer, but third party developers can create code against the published API’s or database tables
(The other two types are intermediate or combined types: “standard RDBM systems” where third party developers can access the database schema, which in my view contains only part of the system’s data; and “Open Source/Open API”).
Especially the “Open API Model” is an interesting development for most libraries that work with commercial library systems. I have had some experience with two initiatives in this field: OCLC’s “WorldCat Grid“, and Ex Libris’ “Open Platform“. A big and important difference between these two is: WorldCat Grid is about access to a specific database already available to the public at large, Ex Libris’ Open Platform is about access to a number of commercial systems.
Interestingly, both initiatives consist of two parts: a set of open API’s and an open developers’ platform. These two parts make it possible to have a kind of marriage between commercial systems and an open source community. But how does this work in real life, how open is access to both the API’s and the Platform?
Some of OCLC’s WorldCat Grid Services are freely accessible, others are accessible for OCLC members only.
Membership of the WorldCat Grid Developers’ Network is available to “IT professionals from: OCLC member institutions, content providers, other software vendors and publishers, as well as bloggers and others in the library field who see value in a collaborative network related to the development of new functionality for the WorldCat Grid.”
“Software code, snippets and API’s developed within the network will be openly available for members, and the world-at-large, to use and re-use.”With Ex Libris’ Open Platform, access to the Developers’ Platform is only open for Ex Libris customers.
Access to the existing API mechanisms (“X-Server” for the products Aleph, MetaLib, SFX, and Webservices for Primo) are also only available to Ex Libris customers. What will happen with newly developed API’s (conforming to new API standards like DLF ILS-Discovery Interface protocol) for new products is still unclear.In my view it does make sense to restrict availability of Open API’s to members or customers in the case of access to licensed metadata or resources. But availability of Open API’s that access public data should be free to all.
It does NOT make sense to restrict access to tools developed on top of the Open API’s to members or customers only.
Granting access to data should be the privilege of the owners of the data, granting access to tools that access data should be the privilege of the developers/owners of these tools.
In this respect the OCLC platform is more open than Ex Libris, but it still is not completely open.
Of course, this is all highly dependent of the motives of the companies for supporting Openness: is it commitment to openness, or fear of losing customers? -
Library Systems and the world of hardware
Posted on October 8th, 2008 No commentsThe project for implementation of Aleph as the new ILS for the Library of the University of Amsterdam started last week (October 2) with the official kick-off meeting. The Ex Libris project plan was presented to the members of the project team, bottlenecks were identified, and a lot of adjustments were made to the planning in order to be able to carry out more tasks simultaneously and thus earlier in time.
First steps are installation of Aleph 18, and giving on site trainings to all people involved, using the new locally installed Aleph 18 system.
But of course, before everything can start, we need the hardware! The central ICT department of the University of Amsterdam (not part of the library) is responsible for configuring and providing the Aleph production server according to the official Ex Libris “Requirements for ALEPH 500 Installation”. And as always there is confusion about what is actually meant by the provider,and as always there are conflicts between the provider’s requirements and the ICT department’s security policy.
As head of the Library Systems Department of the library and as coordinator of the project’s System Administration team, I have been acting this week as an intermediary between our software and hardware providers, passing information about volumes, partitions, database and scratch files, root access, IP addresses and swap areas.
This makes you realise again that all these new web 2.0 systems and techniques are in the end completely dependent on the availability of correctly configured and constantly monitored machines, cables and electricity, and not in the least on all these technicians that know all about hardware and networks.
-
LING – Library integration next generation
Posted on October 5th, 2008 No commentsEnd of August I attended the Technological Developments: Threats and Opportunities for Libraries module of TICER – Digital Libraries à la Carte 2008 at the University of Tilburg, The Netherlands.
One of the speakers was Marshall Breeding. His presentation “Library Automation Challenges for the next generation” consisted of three topics, one of which was “Moving toward new generation of library automation”.
He discussed “rethinking the ILS”. The old I(ntegrated) L(ibrary) S(ystem) was about integration of acquisition, serials, cataloguing, circulation, OPAC and reporting of print material. Now we are moving towards a completely elecronic information universe, so new means of integration (and also dis-integration!) are necessary.
Developments until now have been targeted at the front ends: new integrated web 2.0 user interfaces that can also be used in a “dis-integrated” way (by means of API’s that allow embedding portions of the user interface in other environments), such as Primo, Encore, WorldCat Local, AquaBrowser, VuFind, eXtensible Catalog, etc.
Keyword here is “decoupling” of the front end from the back end. But with these products that is not really the case: there is always a harvesting, indexing and enrichting component integrated in them, that moves at least part of the content and also processing to this front end environment.
A new direction here is what Marshall Breeding calls “Comprehensive Resource Management”: the integration of all types of administration (acquisition, cataloging, OPAC, metasearching, linking, etc.) of all types of library resources (print and electronic, text and objects).
One and a half year ago (February 2007) I wrote an article “My Ideal Knowledge Base” about this in “SMUG 4 EU – Newsletter for MetaLib and SFX users” Issue 4 (page 14), targeted at Ex Libris tools Aleph, Metalib, SFX, DigiTool. I ended this vision of an ideal situation with: “Is this ideal image only a dream, or will it come true some day?“.
According to Marshall Breeding it will take 2-3 years more to see library automation systems that follow this approach and 5-7 years for wider adoption. He also said that traditional ILS vendors were working on this, but that no public announcements had been made yet.
Exactly two weeks later, at IGeLU2008 in Madrid, Ex Libris announced and presented their plans for URM (Unified Resources Management) and URD2 (Unified Resource Discovery and Delivery, meaning Primo). Eventually all of their existing products will be integrated in this new next generation environment. The first release will focus on ERM (Electronic Resource Management).
Short term plans for existing tools are focused on preparing them for the new URM/URD2 environment. For instance SFX 4.0 will have a re-designed database ready for integration with URM 1.0.
MetaLib will see its final official version with minor release 4.3 spring 2009. After that a “next generation metasearch tool” will be developed with a completely re-designed back end and metasearch engine, and Primo as front end. Existing customers will be able to upgrade to this NextGen MetaSearch without paying a license fee for Primo (remote search option only).
Interesting times ahead….
















Recent Comments