-
Mobile library services
Posted on March 4th, 2010 2 commentsLocation aware services in a digital library world
This is the third post in a series of three
[1. Mainframe to mobile - 2. Mobile app or mobile web? - 3. Mobile library services]
While library systems technology and mobile apps architecture make up the technical and functional infrastructure of mobile web access, mobile library services are what it’s all about. What type of mobile services should libraries offer to their customers?
As stated before, the two main features that distinguish mobile, handheld devices from other devices are:
- web access any time anywhere
- location awareness
It seems obvious that libraries should take these two conditions into account when providing mobile services, not in the least the first one. I don’t think that mobile devices will completely replace other devices like pc’s and netbooks, like Google seems to think, but they will definitely be an important tool for lots of people, simply because they always carry a mobile phone with them. So in order to offer something extra, mobile applications should be focused on the situational circumstance of potential access to information any time anywhere, and make use of the location awareness of the device as much a possible. But does this also apply to services for library customers? That partly depends on the type of library (public, academic, special) and the physical and geographical structure of the library (one central location, branch locations).
As a starting point we can say that mobile library services should cover the total range of online library services already offered through traditional web interfaces. However, mobile users may not want to use certain library services on their mobile devices. For instance, from an analysis of usage statistics of EBSCO Mobile at the Library of Texas A&M University, generously provided by Bennett Ponsford, it appears that although the number of searches in EBSCO mobile is increasing, only 1% of mobile searches leads to a fulltext download, against 77% of regular EBSCO searches. These findings suggest that library customers, at least academic ones, are willing to search for books and articles on their mobile devices, but will postpone actually using them until they are in a more convenient environment. Apparently small screens and/or mobile PDF readers are not very reader friendly in academic settings. This may be different for public library customers and e-books.
So, libraries should concentrate on offering those mobile services that are wanted and will actually be used. In the beginning this may involve analysis of usage statistics and customer feedback to be able to determine the perfect mobile services suite for your library. Libraries should be prepared for “perpetual beta” and “agile development”.
There are two main areas of information in which libraries can offer mobile services:
- practical information
- bibliographical information
This is no different from other library information channels, like normal websites and printed guides and catalogues.
Practical information may consist of contact address, email and telephone information, opening hours, staff information, rules and regulations of any kind, etc. In most cases this is information that does not change very often, so static information pages will be sufficient. However, especially with mobile devices who’s owners are on the move, providing dynamic up to date information will give an advantage. For instance: today’s and tomorrow’s opening hours, number of currently available public workstations per location, etc.
The information provided will be even more precisely aimed at the user’s personal situation, if the “location awareness” feature is added to the “any time anywhere” feature, and up to date static and dynamic information for the locations in the immediate vicinity of the customer is shown first, using the device’s automatic geolocation properties. And all this gets better still if the library’s own information is mashed up with available online tools, like showing a location on Google Maps when selecting an address, and with the device’s tools, like making a phone call when clicking on a phone number.
Bibliographical information should be handled somewhat differently. Searching library catalogues or online databases is in essence not location dependent. Online digital bibliographical metadata is available “in the cloud” any time anywhere. It’s not the discovery but the delivery that makes the difference. We have already seen that mobile academic library customers do not download fulltext articles to their mobile devices. But mobile customers will definitely be interested in the possibility of requesting a print item to be delivered to them in the nearest location. WorldCat Mobile, like “normal” WorldCat, for instance offers the option to select a library manually from a list in order to find the nearest location to obtain an item from. It would of course be nice if the delivery location would be automatically determined by the mobile request service, using the device’s location awareness and the current opening hours of the library branches.
The funny thing here is that we have the paradoxical situation of state-of-the-art technology in a world of global online digital information being used to obtain “old fashioned” physical carriers of information (books) from the nearest physical location.
Augmented reality, as a link between the physical and virtual world, may be a valuable extension of mobile services. A frequently mentioned example is scanning a book cover or a barcode with the camera of a mobile phone and locating the item on Amazon. It would be helpful if your phone could automatically find and request the item in the nearest library branch. Personally I am not convinced that this is very valuable. Typing in ISBN or book title will do the job just as fast. Moreover, bookshop staff may not appreciate this behaviour.
A more common use of augmented reality would be to point the camera of your mobile device to a library building, after which a variety of information about the building is shown. The best known augmented reality app at the moment is Layar. This tool allows you to add a number of “layers”, with which you can for instance find the nearest ATM’s or museums, or Wikipedia information about physical objects or locations around you.

Layar - LibraryThing Local
There is also a LibraryThing Local layer for Layar, with which you can find
information about all libraries, bookshops and book related events in the neighbourhood. It may even be possible to find a specific book in an open stack using this technology.
All these extended mobile applications suggest that users of apps may not just be a specific group of people (like library customers), but that mobile users will be interested in all kinds of useful information about their current location. Library information may be only a part of that. Maybe mobile apps should be targeted at a more general audience and include related information from other sources, making use of the linked data concept.
A search in a library catalog in this case may result in a list of books with links to related objects in a museum nearby or a historic location related to the subject of the book. Alternatively, an item in a museum website might have links to related literature in catalogs of nearby libraries. Anything is possible.
The question that remains is: should libraries take care of providing these generic location based services, or will others do that?
-
Mobile app or mobile web?
Posted on February 21st, 2010 11 commentsTechnology, users and business models
This is the second post in a series of three
[1. Mainframe to mobile - 2. Mobile app or mobile web? - 3. Mobile library services]
Mobile access to information on the internet is the latest step in the development of information systems technology, as described in the previous post in this series. The two main features that distinguish mobile devices from other devices are:
- Access to the web literally any time, anywhere
- Location awareness using GPS or the mobile network
Let’s focus on web access first. There are two main ways in which information providers can provide access to their data: by a mobile web browser or by apps.
The easiest way to provide mobile access is: do nothing. Users of mobile internet devices can simply visit all existing websites with their mobile browser. However, in doing so they will experience a number of problems: performance is slow, pages are too large, navigation is difficult, certain parts of websites don’t work. These problems are caused by the very physical characteristics of mobile technology that make mobile internet access possible: the small size of devices and displays, the wireless network, the limited features of dedicated mobile operating systems and browsers.
Fortunately, technological development is an interactive, reciprocal, cyclic process. Technology continuously needs to find solutions to problems that were caused by new uses of existing technology.Many organisations have solved this problem by creating separate “dumbed down” mobile versions of their websites, containing mainly text only pages and textual links to their most important services and information. In the case of libraries for instance “locations and addresses“, “opening hours“, etc. See this list of examples (with thanks to Aaron Tay). Another example is LibraryThing Mobile, which also has a catalog search option. In these cases you have to manually point your browser to the dedicated mobile URL, unless the webserver is configured to automatically recognise mobile browsers and redirect them to the mobile site.
Of course this not the optimal solution for two reasons:
- On the front end: as an information provider you are complete ignoring all graphical, dynamic, interactive and web 2.0 functionality on the end user side. This means actually going back to the early days of the world wide web of static text pages.
- On the back end: duplicating system and content administration. In most cases it will come down to manually creating and editing HTML pages, because most website content management systems may not offer manual or automatic editing of pages for mobile access. Some systems offer automatic recognition of mobile browsers and display content in the appropriate format, like the WordPress plugin “WordPress Mobile Edition” that automatically shows a list of posts if mobile browsers are detected. This is what happens on this blog.
Because of this situation we are witnessing a re-enactment of the client-server alternative to static HTML that I described previously: mobile apps! “Apps” is short for “applications“, apparently everything needs to be short in the mobile online web 2.0 age. Apps are installed on mobile devices, they run locally making use of the hardware, operating system and user friendly interface of the device, and they only connect to the internet for retrieving data from a database system in the cloud (on a remote server).
A disadvantage of this solution obviously is that you have to multiply development and maintenance in order to support all mobile platforms that your customers are using, or just support the most used platform (iPhone) and ignore the rest of your end users. Alternatively you can support one mobile platform with an app, and the rest with a mobile web site. Organisations have the choice of developing apps themselves from scratch, or using one of the commercial parties that offer library apps, such as Boopsie, Blackboard or the recently announced LibraryThing Anywhere, that is meant to offer both mobile web and apps for iPhone, Blackberry and Android.Some examples:
- TU Delft mobile app for iPhone (powered by Blackboard). University wide, including library. Haven’t been able to test this because I have an Android phone. An Android version will be developed. For other devices they offer a mobile website.
- Duke University mobile app for iPhone. University wide, including library. For other devices they offer a mobile website.
- Santa Clara County Library mobile apps for iPhone and Android (powered by Boopsie).
- WorldCat Mobile (powered by Boopsie).
An alternative solution to the client-server and “dumbed down” models would be to use the new HTML5 and CSS3 options to create websites that can easily be handled by all PC and mobile webbrowsers alike. HTML 5 has geolocation options, and browsers are made location aware this way too. The iWebKit Framework is a free and easy package to create web apps compatible for all mobile platforms. See this demo on PC, iPhone, Android, etc.
Some say that HTML5/CSS3 will make apps disappear, but I suspect performance may still be a problem, due to slow connections. But it’s not only a technology issue. It’s also a matter of business models, as Owen Stephens and Till Kinstler pointed out.Apps can be distributed for free by organisations that want to draw traffic to their own data, ignoring the open web. This method fits their clasic business model, as Till remarked, mentioning the newspaper business as an example.
But there is also another side to this: apps can be created by anybody, making use of APIs to online systems and databases, and be shared with others for free or for a small fee, as is the case with the iPhone Apps Store, the Android Market, the Nokia Ovi Store, or the newly announced Wholesale Applications Community (WAC). This model will never be possible with web based apps (like HTML5), because nobody has access to a system’s web server other than the system administrators. It is also much too complicated for developers and consumers of apps to host web apps on a server that mobile device users can connect too.
And there is more: independent developers are more likely to look beyond the boundaries of the classic model of giving access to your own data only. Third party apps have the opportunity to connect data from a number of data sources in the cloud in order to satisfy mobile user needs better. To take the newspaper business example, I mentioned this in my post “Mobile reading“: general news apps vs dedicated newspaper apps. The rise of the open linked data movement will only boost the development and use of the mobile client server model.In my view there will be a hybrid situation: HTML5/CSS3 based web apps and local mobile apps will coexist, depending on developer, audience, and objectives.
What services library mobile apps should offer, including location awareness and linking data, is the topic of another post.
-
Mainframe to mobile
Posted on February 16th, 2010 9 commentsThe connection between information technology and library information systems
This is the first post in a series of three
[1. Mainframe to mobile - 2. Mobile app or mobile web? - 3. Mobile library services]
The functions, services and audience of library information systems, as is the case with all information systems, have always been dependent on and determined by the existing level of information technology. Mobile devices are the latest step in this development.
In the beginning there was a computer, a mainframe. The only way to communicate with it was to feed it punchcards with holes that represented characters.
If you made a typo (puncho?), you were not informed until a day later when you collected the printout, and you could start again. System and data files could be stored externally on large tape reels or small tape cassettes, identical to music tapes. Tapes were also used for sharing and copying data between systems by means of physical transportation.
Suddenly there was a human operable terminal, consisting of a monitor and keyboard, connected to the central computer. Now you could type in your code and save it as a file on the remote server (no local processing or storage at all). If you were lucky you had a full screen editor, if not there was the line editor. No graphics. Output and errors were shown on screen almost immediately, depending on the capacity of the CPU (central processing unit) and the number of other batch jobs in the queue. The computer was a multi-user time sharing device, a bit like the “cloud”, but every computer was a little cloud of its own.
There was no email. There were no end users other than systems administrators, programmers and some staff. Communication with customers was carried out by sending them printouts on paper by snail mail.I guess this was the first time that some libraries, probably mainly in academic and scientific institutions, started creating digital catalogs, for staff use only of course.
Then came the PC (Personal Computer). Terminal and keyboard were now connected to the computer (or system unit) on your desk. You had the thing entirely to yourself! Input and output consisted of lines of text only, one colour (green or white on black), and still no graphics. Files could be stored on floppy disks, 5¼-inch magnetic things that you could twist and bend, but if you did that you lost your data. There was no internal storage. File sharing was accomplished by moving the floppy from one PC to another and/or copy files from one floppy to another (on the same floppy drive).
Later we got smaller disks, 3½-inch, in protective cases. The PC was mainly used for early word processing (WordStar, WordPerfect) and games. Finally there was a hard disk (as opposed to “floppy” disk) inside the PC system unit, which held the operating system (mainly MS-DOS), and on which you could store your files, which became larger. Time for stand-alone database applications (dBase).
Then there was Windows, a mouse, and graphics. And of course the Internet! You could connect your PC to the Internet with a modem that occupied your telephone line and made phone calls impossible during your online session. At first there was Gopher, a kind of text based web.
Then came the World Wide Web (web 0.0), consisting of static web pages with links to other static web pages that you could read on your PC. Not suitable for interactive systems. Libraries could publish addresses and opening hours.
But fortunately we got client-server architecture, combining the best of both worlds. Powerful servers were good at processing, storing and sharing data. PC’s were good at presenting and collecting data in a “user friendly” graphical user interface (GUI), making use of local programming and scripting languages. So you had to install an application on the local PC which then connected to the remote server database engine. The only bad thing was that the application was tied to the specific PC, with local Windows configuration settings. And it was not possible to move the thing around.Now we had multi-user digital catalogs with a shared central database and remote access points with the client application installed, available to staff and customers.
Luckily dynamic creation of HTML pages came along, so we were able to move the client part of client-server applications to the web as well. With web applications we were able to use the same applications anywhere on any computer linked to the world wide web. You only needed a browser to display the server side pages on the local PC.
Now everybody could browse through the library catalog any time, anywhere (where there was a computer with an internet connection and a web browser). The library OPAC (Online Public Access Catalog) was born.
The only disadvantage was that every page change had to be generated by the server again, so performance was not optimal.
But that changed with browser based scripting technology like JavaScript, AJAX, Flash, etc. Application bits are sent to the local browser on the PC at runtime, to be executed there. So actually this is client server “on the fly”, without the need to install a specific application locally.In the meantime the portable PC appeared, system unit, monitor and keyboard all in one. At first you needed some physical power to move the thing around, but later we got laptops, notebooks, netbooks, getting smaller, lighter and more powerful all the time. And wifi of course, no need to plug the device in to the physical network anymore. And USB-sticks.
Access to OPAC and online databases became available anytime, anywhere (where you carried your computer).
The latest development of course is the rise of mobile phones with wireless web access, or rather mobile web devices which can also be used for making phone calls. Mobile devices are small and light enough to carry with you in your pocket all the time. It’s a tiny PC.
Finally you can access library related information literally any time, anywhere, even in your bedroom and bathroom.
It’s getting boring, but yes, there is a drawback. Web applications are not really accommodated for use in mobile browsers: pages are too large, browser technology is not really compatible, connections are too slow.
Available options are:
- creating a special “dumbed down” version of a website for use on mobile devices only: smaller text based pages with links
- creating a new HTML5/CSS3 website, targeted at mobile devices and “traditional” PC’s alike
- creating “apps”, to be installed on mobile devices and connect to a database system in the cloud; basically this is the old client-server model all over again.
A comparison of mobile apps and mobile web architecture is the topic of another post.
-
Just in time or just in case?
Posted on October 16th, 2009 7 commentsMetasearch vs. harvesting & indexing
The other day I gave a presentation for the Assembly of members of the local Amsterdam Libraries Association “Adamnet“, about the Amsterdam Digital Library search portal that we host at the Library of the University of Amsterdam. This portal is built with our MetaLib metasearch tool and offers simultaneous access to, at the moment, 20 local library catalogues.
A large part of this presentation was dedicated to all possible (and very real) technical bottlenecks of this set-up, with the objective of improving coordination and communication between the remote system administrators at the participating libraries and the central portal administration. All MetaLib database connectors/configurations are “home-made”, and the portal highly depends on the availability of the remote cataloging systems.
I took the opportunity to explain to my audience also the “issues” inherent in the concept of metasearch (or “federated search“, “distributed search“, etc.), and compare that to the harvesting & indexing scenario.
Because it was not the first (nor last) time that I had to explain the peculiarities of metasearch, I decided to take the Metasearch vs. Harvesting & Indexing part of the presentation and extend it to a dedicated slideshow. You can see it here, and you are free to use it. Examples/screenshots are taken from our MetaLib Amsterdam Digital Library portal. But everything said applies to other metasearch tools as well, like Webfeat, Muse Global, 360-Search, etc.
The slideshow is meant to be an objective comparison of the two search concepts. I am not saying that Metasearch is bad, and H&I is good, that would be too easy. Some five years ago Metasearch was the best we had, it was a tremendous progress beyond searching numerous individual databases separately. Since then we have seen the emergence of harvesting & indexing tools, combined with “uniform discovery interfaces”, such as Aquabrowser, Primo, Encore, and the OpenSource tools VuFind, SUMMA, Meresco, to name a few.
Anyway, we can compare the main difference between Metasearch and H&I to the concepts “Just in time” and “Just in case“, used in logistics and inventory management.
With Metasearch, records are fetched on request (Just in time), with the risk of running into logistics and delivery problems. With H&I, all available records are already there (Just in case), but maybe not the most recent ones.
Objectively of course, H&I can solve the problems inherent in Metasearch, and therefore is a superior solution. However, a number of institutions, mainly general academic libraries, will for some time depend on databases that can’t be harvested because of technical, legal or commercial reasons.
In other cases, H&I is the best option, for instance in the case of cooperating local or regional libraries, such as Adamnet, or dedicated academic or research libraries that only depend on a limited number of important databases and catalogs.
But I also believe that the real power of H&I can only be taken advantage of, if institutions cooperate and maintain shared central indexes, instead of building each their own redundant metadata stores. This already happens, for instance in Denmark, where the Royal Library uses Primo to access the national DADS database.
We also see commercial hosted H&I initiatives implemented as SaaS (Software as a Service) by both tool vendors and database suppliers, like Ex Libris’ PrimoCentral, SerialSolutions’ Summon and EBSCOhost Integrated Search.
The funny thing is, that if you want to take advantage of all these hosted harvested indexes, you are likely to end up with a hybrid kind of metasearch situation where you distribute searches to a number of remote H&I databases.
-
Roadmaps to uncertainty
Posted on October 6th, 2009 4 commentsWhat will library staff do 5 years from now?

© Lukas Koster
I attended the IGeLU 2009 annual conference in Helsinki September 6-9. IGeLU is the International Group of Ex Libris Users, an independent organisation that represents Ex Libris customers. Just to state my position clearly I would like to add that I am a member of the IGeLU Steering Committee.
These annual user group meetings typically have three types of sessions: internal organisational sessions (product working groups and steering committee business meetings, elections), Ex Libris sessions (product updates, Q&A, strategic visions), and customer sessions (presentations of local solutions, addons, developments).Not surprisingly, the main overall theme of this conference was the future of library systems and libraries. The word that characterises the conference best in my mind (besides “next generation“and “metaphor“) is “roadmap“. All Ex Libris products but also all attending libraries are on their way to something new, which strangely enough is still largely uncertain.

© Lukas Koster
Library paradise?
Ex Libris presented the latest state of design and development of their URM (Unified Resource Management) project, ‘A New Model for Next-generation Library Services’. In the final URM environment all back end functionality of all current Ex Libris products will be integrated into one big modular system, implemented in a SaaS (“Software as a Service“) architecture. In the Ex Libris vision the front end to this model will be their Primo Indexing and Discovery interface, but all URM modules will have open API’s to enable using them with other tools.
The goal of this roadmap apparently is efficiency in the areas of technical and functional system administration for libraries.Intermediate generation
In the mean time development of existing products is geared towards final inclusion in URM. All future upgrades will result in what I would like to call “intermediate” instead of “next generation” products . MetaLib, the metasearch or federated search tool, will be replaced by MetaLib Next Generation, with a re-designed metasearch engine and a Primo front end. The digital collection management tool DigiTool will be merged into its new and bigger nephew Rosetta, the digital preservation system. The database of the OpenUrl resolver SFX will be restructured to accommodate the URM datamodel. The next version of Verde (electronic resource management) will effectively be URM version 1, which will also be usable as an alternative for both ILS’es Voyager and Aleph.
Here we see a kind of “intermediate” roadmap to different “base camps” from where the travelers can try to reach their final destination.
© Lukas Koster
Holy cows!
From the perspective of library staff we see another panorama appearing.
In one of the customer presentations Janet Lute of Princeton University Library, one of the three (now four) URM development partners, mentioned a couple of “holy cows” or library tasks they might consider stopping doing while on their way to the new horizon:- managing prediction patterns for journal issues
- checking in print serials
- maintaining lots of circulation matrices and policies
- collecting fines
- cataloging over 80% of bibliographic records
I would like to add my own holy cow MARC to this list, about which I have written a previous post Who needs MARC?. (Some other developments in this area are self service, approval plans, shared cataloging, digitisation, etc.)
This roadmap is supposed to lead to more efficient work and less pressure for acquisitions, cataloging and circulation staff.Eldorado or Brave New World?
To summarise: we see a sketchy roadmap leading us via all kinds of optional intermediate stations to an as yet still vague and unclear Eldorado of scholarly information disclosure and discovery.
The majority of public and professional attention is focused on discovery: modern web 2.0 front ends to library collections, and the benefits for the libraries’ end users. But it is probably even more important to look at the other side, disclosure: the library back end, and the consequences of all these developments for library staff, both technically oriented system administrators and professionally oriented librarians.Future efficient integrated and modular library systems will no doubt eliminate a lot of tasks performed by library staff, but does this mean there will be no more library jobs?
Will the university library of the future be “sparsely staffed, highly decentralized, and have a physical plant consisting of little more than special collections and study areas“, as was stated recently in an article in “Inside Higher Education”? I mentioned similar options in “No future for libraries?“.Personally I expect that the two far ends of the library jobs spectrum will merge into a single generic job type which we can truly call “system librarian“, as I stated in my post “System librarians 2.0“. But what will these professionals do? Will they catalog? Will they configure systems? Will they serve the public? Will they develop system add-ons?
This largely depends on how the new integrated systems will be designed and implemented, how systems and databases from different vendors and providers will be able to interact, how much libraries/information management organisations will outsource and crowdsource, how much library staff is prepared to rethink existing workflows, how much libraries want to distinguish themselves from other organisations, how much end users are interested in differences between information management organisations; in brief: how much these new platforms will allow us to do ourselves.
We have come up with a realistic image of ourselves for the next couple of decades soon, otherwise our publishers and system vendors will be doing it for us.
-
Explicit and implicit metadata
Posted on August 20th, 2009 15 commentsOn August 17, after I tested a search in our new Aleph OPAC and mentioned my surprise on Twitter, the following discussion unfolded between me (lukask), Ed Summers of the Library of Congress and Till Kinstler of GBV (German Union Library Network):
- lukask: Just found out we only have one item about RDF in our catalogue: http://tinyurl.com/lz75c4
- edsu: @lukask broaden that search
http://is.gd/2l6vB - lukask: @edsu Ha! Thanks! But I’m sure that RDF will be mentioned in these 29 titles! A case for social tagging!
- edsu: @lukask or better cataloging
- edsu: @lukask i guess they both amount to the same thing eh?
- lukask: @edsu That’s an interesting position…”social tagging=better cataloging”. I will ask my cataloguing co-workers about this specific example
- edsu: @lukask make sure to wear body-armor
- lukask: @edsu Yes I know! I will bring it up at tomorrow’s party for the celebration of our ALEPH STP (after some drinks…)
- tillk: @edsu @lukask or fulltext search…
SCNR… - edsu: @tillk yeah, totally — with projects like @googlebooks and @hathitrust we may look back on the age of cataloging with different eyes …
- lukask: @tillk @edsu Fulltext search yes, or “implicit automatic metadata generation”?
What happened here was:
- A problem with findability of specific bibliographic items was observed: although it is highly unlikely that books about the Semantic Web will not cover RDF-Resource Description Framework, none of the 29 titles found with “Semantic Web” could be found with the search term “Resource Description Framework“; on the other side, the only item found with “Resource Description Framework” was NOT found with “Semantic Web“. I must add that the “Semantic web” search was an “All words” search. Only 20 of the results were indexed with the Dutch subject heading “Semantisch web” (which term is never used in real life as far as I know; the English term is an international concept). Some results were off topic, they just happened to have “semantic” and “web” somewhere in their metadata. A better search would have been a phrase search (adjacent) with “semantic web” in actual quotes, which gives 26 items. But of these, a small number were not indexed with subject heading “Semantisch web“. Another note: searching with “RDF” gets you all kinds of results. Read more on the issue of searching and relevance in my post Relevance in context.
- Four possible solutions were suggested:
- social tagging
- better cataloging
- fulltext searching
- automatic metadata generation
Social tagging
Clearly, the 26 items found with the search “Semantic web” are not indexed by the “Resource description framework” or “RDF” subject heading. There is not even a subject heading for “Resource description framework” or “RDF“. In my personal view, from my personal context, this is an omission. Mind you, this is not only an issue in the catalogue of the Library of the University of Amsterdam, it is quite common. I tried it in the British Library Integrated Catalogue with similar results. Try it in your own OPAC!
I presume that our professional cataloging colleagues can’t know everything about all subjects. That is completely understandable. I would not know how to catalog a book about a medical subject myself either! But this is exactly the point. If you allow end users to add their own tags to your bibliographic records, you enhance the findability of these records for specific groups of end users.
I am not saying that cataloguing and indexing by library specialists using controlled vocabularies should be replaced by social tagging! No, not at all. I am just saying that both types of tagging/indexing are complementary. Sure, some of the tags added by end users may not follow cataloging standards, but who cares? Very often the end users adding tags of their own will be professional experts in their field. In any case, items with social tags will be found more often because specific end user groups can find them searching with their own terms.Better cataloging
I suppose Ed Summers was trying to say the same thing as I just did above, when he commented “or better cataloging, I guess they both amount to the same thing eh?“, which I summarised as “social tagging=better cataloging“, but he can correct me if I’m wrong.
Anyway, I hope I made it clear that I would not say “social tagging=better cataloging“, but rather “controlled vocabularies+social tagging=better cataloging“.
Or alternatively, could we improve cataloging by professional library catalogers? I must admit I do not know enough about library training and practice to say something about that. I am not a trained librarian. Don’t hesitate to comment!Fulltext searching
Is fulltext searching the miracle cure for findability problems, as Till Kinstler seems to suggest? Maybe.
Suppose all our print material was completely digitised and available for fulltext search, I have no doubt that all 26 items mentioned above (the results of the “semantic web” all words search) would be found with the “resource description framework” or “rdf” search as well. But because fulltext search is by its very nature an “all words” search, the “rdf” fulltext search would also give a lot of “noise”, or items not having any relation to “semantic web” at all (author’s initials “R.D.F”, other acronyms “R.D.F.”, just see RDF in the BL catalogue). Again, see my post Relevance in context for an explanation of searching without context.
Also, there will be books or articles about a subject that will not contain the actual subject term at all. With fulltext search these items will not be found.
Moreover, fulltext searching actually limits the findable items to text, excluding other types, like images, maps, video, audio etc.
This brings me to the “final solution”:Automatic metadata generation
Of course this is mostly still wishful thinking. But there are a number of interesting implementations already.
What I mean when I say “(implicit) automatic metadata generation” is: metadata that is NOT created deliberately by humans, but either generated and assigned as static metadata, or generated on the fly, by software, applying intelligent analysis to objects, of all types (text, images, audio, video, etc.).
In the case of our “rdf” example, such a tool would analyse a text and assign “rdf” as a subject heading based on the content and context of this text, even if the term “rdf” does not appear in the text at all. It would also discard texts containing the string “rdf” that refer to something completely different. Of course for this to succeed there should be some kind of contextual environment with links to other records or even other systems to be able to determine if certain terminology is linked to frequently used terms not mentioned in the text itself (here the Linked Data developments could play a major role).
The same principle should also apply to non-textual objects, so that images, audio, video etc. about the same subject can be found in one go. Google has some interesting implementations in this field already: image search by colour and content type: see for example the search for “rdf” in Google Images with colour “red”and content type “clip art”.
But of course there still needs a lot to be done. -
Relevance in context
Posted on August 11th, 2009 9 commentsIf you do a search in a bibliographic database, you should find what you need, not just what you are looking for, or what the database “thinks” you are looking for. If you find what you are looking for, then you will not be surprised and you will not discover anything new. And that’s not what you want, is it? But if you find things you did not look for but also do not need, you’re not just surprised, you are confused! And that’s not what you want either.
You want the results that are the most relevant for your search, with your specific objectives, at that specific point in time time, for your specific circumstances, and you want them immediately.
So, how should search systems behave to make you find what you need? There are two conditions that need to be met:
- The search terms must be interpreted correctly
- The most relevant search results must be presented
The Problem
First of all, let’s take a look at current practice.Search systems cannot cope with ambiguous search terms. My favorite example and test search term is “voc“. This can stand for a number of things in various disciplines: V.O.C. (Dutch: “Verenigde Oostindische Compagnie” or “Dutch United East Indies Company”) in historical databases; “vocals” in musical databases; “volatile organic compounds” in physics databases. So if you do a search for “voc” in a standard library catalogue, you get all kinds of results. Even more so if you use a metasearch or federated search tool for searching several databases simultaneously.
You are confused. You would like the system to “understand” which one of these concepts you are referring to instead of just using the literal string. You would like the system to take into account your context.
In most databases search results can be sorted or filtered by a number of fields, most commonly by year, title, author, and also by more specific fields in dedicated databases. But unless you are interested in a specific year, author or title, this will not do. Recently many systems have implemented “faceted” and “clustered” browsing of results, enabling “drilling down” on specific terms or subjects. This basically comes down to setting the context after the fact.
But after the system has interpreted your search terms, the results should also be ordered in a specific way, the ones you need most should be on top. This is where “relevance ranking” of search results comes in. Most catalogues and databases use a system specific default relevance ranking algorithm. Search results are assigned a rank, based on a number of criteria, that can differ between databases, depending on the nature of the database.
Some databases just present the most recent results on top. For medical and physical sciences this may be right, but for history and literature databases this may just be wrong.
Sometimes the search terms are taken into account: the number of times the given search terms are present in the result fields is important, but also the specific fields in which search terms appear. The appearance of search terms in “Title” and “Subject” may rank higher than in “Abstract” or “Publisher”. Moreover, the search indexes used can have a major influence on rank: if you search for “Subject” = “flu”, then results with “flu” as subject will be ranked higher than results with “flu” in the title only.
To come back to my example, with ambiguous search terms like “voc” this type of relevance ranking will definitely not be enough, because results from the three different conceptual areas will be completely mixed up.When searching with a metasearch or federated search tool things get even more complicated. Each of the remote databases that you search in has its own default way of ranking. Usually the metasearch tool fetches the first 30 or so results from each remote database (one set sorted by date, the other by internal rank, the next by title), merges these into one list and then applies its own local ranking mechanism to this subset only. Confusion! And I did not even mention searching databases with metadata in multiple languages. Moreover, databases containing only metadata will produce different results and relevance than databases with full text articles. There is absolutely no way of telling if you actually have the most relevant results for your situation.
Again, with relevance ranking search systems do not take into account the context either. You could say it is an introverted, internally focused way of ranking, the confusing results of which are multiplied in case of metasearching.
Most metasearch tools give users the option of searching in sets of pre-selected databases, based on subject or type. This way you can limit your search to those databases that are known to have data about that specific subject. You more or less set the context in advance. But this mechanism only eliminates results from databases that probably do not have data on your subject at all, so they would not have shown up in the results anyway. Moreover, the same issues that were discussed above apply to this limited set of databases.
The metasearch tool that I know best (MetaLib) offers the option of setting a relative rank per database, so results from databases with a higher rank will have a higher relevance in merged result sets. But this is a system wide option, set by system administration, so it is not taking into account any context at all. It would be better if you could make the relative database rank dependent on the set or subject the search is done from (for instance: if a history database is searched in the context of a “History” set, the results get a higher rank than in a search from a “Music” set).
The best solution for this “internal” relevance problem regarding distributed databases is a central database of harvested indexes. In this case all harvested metadata is normalised and ranked in a uniform way, and users do not have to select databases in advance. But these systems still do not take into account “external” relevance: there is no context!
A very interesting and intelligent solution for the problem of pre-selecting databases is provided in PurpleSearch, the integrated front end to MetaLib (among other things), developed by the Library of the University of Groningen. The system records which databases actually produce results for specific search terms. As soon as the user enters search terms in the single search box, the system knows which databases will have results, and the search is automatically carried out in these databases, without asking the user to select the databases or subject area he or she wants to search in. Simultaneously a background search in all other databases is performed in order to check additional new results, and the information about results in databases is updated.
Of course, all other usual options are available as well, like pre-selecting databases (setting context in advance) and faceted results drilling down (setting context after the fact). But again, no external contextual settings.
Search "voc" in PurpleSearch
- Conclusion: the only way to find what you need, is to make search systems take into account the context in which the search is done, both for searching and for relevance ranking.
Solutions
Now, let’s have a look at a couple of conditions that would make contextual searches possible.Personal context: a system should “know” about your personal interests, field of study, job situation, age, etc. so it can “decide” which databases to search in and which results are the most relevant for you. Some systems, like university systems, have access to information about their users. Once you log in, the system potentially knows which subjects you are studying or teaching and could use this information for setting the context for searching and ranking.
But what if you are a student in Law AND Social Siences, which subject area should the system choose? Or: if you are a History teacher, and you have a personal interest in Ecology, which the system does not know about, what then? Somehow you still need to set context yourself.Some systems also offer the opportunity of setting personal preferences, like: area of interest, specific databases, type of material (only digital or print), only recent material, etc. Again: you must be able to deviate from these preferences, depending on your situation, which means setting context manually.
Different search systems will have different user profiles (user data and preferences). It would be nice if search systems could take advantage of universal external personal profiles (like Google Profiles for instance) using some kind of universal API.
Situational context: a system should also “know” about the situation you are in, both in a functional sense and in a physical sense.
Functional context means: wich role are you playing? Are you in your law student role or in your social sciences student role? Are you in your professional role or in your private role? But also: to which resources do you have access?
An interesting idea: if you work Monday to Friday during office hours, study in the evenings and spend time on your personal interests on the weekends, it would be nice if you could link times of day and days of the week to your different roles, so search systems could use the correct context for your searches depending on time and date: “if it’s Tuesday evening then use study profile and search in ‘History’; if it’s Sunday, use private profile and search in ‘Ecology’“.
This temporal context was also referred to by Till Kinstler in a (German) blog post about the new “Suchkiste” search system prototype of the German Union Library Network (GBV): ‘the search for “Charlie Brown” in October should result in “It’s the Great Pumpkin, Charlie Brown” at number 1, and in December in “A Charlie Brown Christmas“‘.
Physical context means: where are you? It would be nice if a library catalogue search system would take into account your actual location, so it could show you the records of the copies of the FRBR-ized results available in the library locations nearest to you (this idea came up in a recent Twitter discussion between @librarianbe and @gbierens). This is what Worldcat does when you supply it with your location manually. In Worldcat this is a static preference. But it would be nice if it would respond to your actual location, for instance by using the GPS coordinates of your mobile phone. Alternatively, search systems could derive your location from the IP address you are sending your search from.
This information could also be used to determine if records for digital or physical copies should be ranked the most relevant in this case. If you are inside the library building and you have a preference for physical books and journals, then records for available print copies should be on top of the results list. If you are at home, then records for digital copies that you have access to should come first.Contextual searching and ranking should always be a combination of all possible conditions, personal, situational and internal system ones.
Of course it goes without saying that it would be great if metasearch tools were able to convey the search context to the remote databases and get contextual results back, using some kind of universal serach context API!
Last but not least, each search system should show the context of the search, and explain how it got to the results in the presented order. Something like: based on your personal preferences, the time of day and day of the week, and your location, the search was done in these databases, with this subject area, and the physical copies of the nearest location are shown on top.
This context area on the results screen could then be used as a kind of inverted faceted search, drilling “up” to a broader level or “sideways” to another context. -
Linked Data for Libraries
Posted on June 19th, 2009 22 commentsLinked Data and bibliographic metadata modelsSome time after I wrote “UMR – Unified Metadata Resources“, I came across Chris Keene’s post “Linked data & RDF : draft notes for comment“, “just a list of links and notes” about Linked Data, RDF and the Semantic Web, put together to start collecting information about “a topic that will greatly impact on the Library / Information management world“.
While reading this post and working my way through the links on that page, I started realising that Linked Data is exactly what I tried to describe as One single web page as the single identifier of every book, author or subject. I did mention Semantic Web, URI’s and RDF, but the term “Linked Data” as a separate protocol had escaped me.
The concept of Linked Data was described by Tim Berners Lee, the inventor of the World Wide Web. Whereas the World Wide Web links documents (pages, files, images), which are basically resources about things, (“Information Resources” in Semantic Web terms), Linked Data (or the Semantic Web) links raw data and real life things (“Non-Information Resources”).
There are several definitions of Linked Data on the web, but here is my attempt to give a simple definition of it (loosely based on the definition in Structured Dynamics’ Linked Data FAQ):
Linked Data is a methodology for providing relationships between things (data, concepts and documents) anywhere on the web, using URI’s for identifying, RDF for describing and HTTP for publishing these things and relationships, in a way that they can be interpreted and used by humans and software.
I will try to illustrate the different aspects using some examples from the library world. The article is rather long, because of the nature of the subject, then again the individual sections are a bit short. But I do supply a lot of links for further reading.
Data is relationships
The important thing is that “data is relationships“, as Tim Berners Lee says in his recent presentation for TED.
Before going into relationships between things, I have to point out the important distinction between abstract concepts and real life things, which are “manifestations” of the concepts. In Object modeling these are called “classes” (abstract concepts, types of things) and “objects” (real life things, or “instances” of “classes“).Examples:
- the class book can have the instances/objects “Cloud Atlas“, “Moby Dick“, etc.
- the class person can have the instances/objects “David Mitchell“, “Herman Melville“, etc.
In the Semantic Web/RDF model the concept of triples is used to describe a relationship between two things: subject – predicate – object, meaning: a thing has a relation to another thing, in the broadest sense:
- a book (subject) is written by (predicate) a person (object)
You can also reverse this relationship:
- a person (subject) is the author of (predicate) a book (object)

Triple
The person in question is only an author because of his or her relationship to the book. The same person can also be a mother of three children, an employee of a library, and a speaker at a conference.
Moreover, and this is important: there can be more than one relationship between the same two classes or types of things. A book (subject) can also be about (predicate) a person (object). In this case the person is a “subject” of the book, that can be described by a “keyword”, “subject heading”, or whatever term is used. A special case would be a book, written by someone about himself (an autobiography).The problem with most legacy systems, and library catalogues as an example of these, is that a record for let’s say a book contains one or more fields for the author (or at best a link to an entry in an authority file or thesaurus), and separately one or more fields for subjects. This way it is not possible to see books written by an author and books about the same author in one view, without using all kinds of workarounds, link resolvers or mash-ups.
Using two different relationships that link to the same thing would provide for an actual view or representation of the real world situation.Another important option of Linked Data/RDF: a certain thing can have as a property a link to a concept (or “class”) , describing the nature of the thing: “object Cloud Atlas” has type “book“; “object David Mitchell” has type “person“; “object Cloud Atlas” is written by “object David Mitchell“.
And of course, the property/relationship/predicate can also link to a concept describing the nature of the link.
Anywhere on the web

ERD
So far so good. But you may argue that this relationship theory is not very new. Absolutely right, but up until now this data-relationship concept has mainly been used with a view to the inside, focused on the area of the specific information system in question, because of the nature and the limitations of the available technology and infrastructure.
The “triple” model is of course exactly the same as the long standing methodology of Entity Relationship Diagrams (ERD), with which relationships between entities (=”classes“) are described. An ERD is typically used to generate a database that contains data in a specific information system. But ERD’s could just as well be used to describe Linked Data on the web.
Information systems, such as library catalogs, have been, and still are, for the greatest part closed containers of data, or “silos” without connections between them, as Tim Berners Lee also mentions in his TED presentation.
Lots of these silo systems are accessible with web interfaces, but this does not mean that items in these closed systems with dedicated web front ends can be linked to items in other databases or web pages. Of course these systems can have API’s that allow system developers to create scripts to get related information from other systems and incorporate that external information in the search results of the calling system. This is what is being done in web 2.0 with so-called mash-ups.
But in this situation you need developers who know how to make scripts using specific scripting languages for all the different proprietary API’s that are being supported for all the individual systems.
If Linked Data was a global standard and all open and closed systems and websites supported RDF, then all these links would be available automatically to RDF enabled browser and client software, using SPARQL, the RDF Query Language.- Linked Data/RDF can be regarded as a universal API.
The good thing about Linked Data is, that it is possible to use Linked Data mechanisms to link to legacy data in silo databases. You just need to provide an RDF wrapper for the legacy system, like has been done with the Library of Congress Subject Headings.
Some examples of available tools for exposing legacy data as RDF:
- Triplify – a web applications plugin that converts relational database structures into RDF triples
- D2R Server – a tool for publishing relational databases on the Semantic Web
- wp-RDFa – a wordpress plugin that adds some RDF information about Author and Title to Wordpress blog posts
Of course, RDF that is generated like this will very probably only expose objects to link TO, not links to RDF objects external to the system.
Also, Linked Data can be used within legacy systems, for mixing legacy and RDF data, open and closed access data, etc. In this case we have RDF triples that have a subject URI from one data source and an object URI from another data source. In a situation with interlinked systems it would for instance be possible to see that the author of a specific book (data from a library catalog) is also speaking at a specific conference (data from a conference website). Objects linked together on the web using RDF triples are also known as an “RDF graph”. With RDF-aware client software it is possible to navigate through all the links to retrieve additional information about an object.

Linked Data
URI’s
URI’s (“Uniform Resource Identifiers”) are necessary for uniquely identifying and linking to resources on the web. A URI is basically a string that identifies a thing or resource on the web. All “Information Resources”, or WWW pages, documents, etc. have a URI, which is commonly known as a URL (Uniform Resource Locator).With Linked Data we are looking at identifying “Non-information Resources” or “real world objects” (people, concepts, things, even imaginary things), not web pages that contain information about these real world objects. But it is a little more complicated than that. In order to honour the requirement that a thing and its relations can be interpreted and used by humans and software, we need at least 3 different representations of one resource (see: How to publish Linked Data on the web):
- Resource identifier URI (identifies the real world object, the concept, as such)
- RDF document URI (a document readable for semantic web applications, containing the real world object’s RDF data and relationships with other objects)
- HTML document URI (a document readable for humans, with information about the real world object)

Redirection
For instance, there could be a Resource Identifier URI for a book called “Cloud Atlas“. The web resource at that URI can redirect an RDF enabled browser to the RDF document URI, which contains RDF data describing the book and its properties and relationships. A normal HTML web browser would be redirected to the HTML document URI, for instance a web page about the book at the publisher’s website.
There are several methods of redirecting browsers and application to the required representation of the resource. See Cool URIs for the Semantic Web for technical details.
There are also RDF enabled browsers that transform RDF into web pages readable by humans, like the FireFox addon “Tabulator“, or the web based Disco and Marbles browsers, both hosted at the Free University Berlin.
RDF, vocabularies, ontologies
RDF or Resource Description Framework, is, like the name suggests, just a framework. It uses XML (or a simpler non-XML method N3) to describe resources by means of relationships. RDF can be implemented in vocabularies or ontologies, which are sets of RDF classes describing objects and relationships for a given field.
Basically, anybody can create an RDF vocabulary by publishing an RDF document defining the classes and properties of the vocabulary, at a URI on the web. The vocabulary can then be used in a resource by referring to the namespace (the URI) and the classes in that RDF document.A nice and useful feature of RDF is that more than one vocabularies can be mixed and used in one resource.
Also, a vocabulary itself can reference other vocabularies and thereby inherit well established classes and properties from other RDF documents.
Another very useful feature of RDF is that objects can be linked to similar object resources describing the same real world thing. This way confusion about which object we are talking about, can be avoided.A couple of existing and well used RDF vocabularies/ontologies:
- RDF – the base RDF vocabulary
- RDFS (for RDF Schema)
- DC (for Dublin Core)
- FOAF (for FOAF- Friend of a Friend) – online identities and social networks
- SKOS (for SKOS – Simple Knowledge Organisation System) – thesauri, classification schemes, subject heading systems and taxonomies
- OWL (for OWL -Ontology Web Language)
(By the way, the links in the first column (to the RDF files themselves) may act as an illustration of the redirection mechanism described before. Some of them may link to either the RDF file with the vocabulary definition itself, or to a page about the vocabulary, depending on the type of browser you use: rdf-aware or not.)
A special case is:
- RDFa – a sort of microformat without a vocabulary of its own, which relies on other vocabularies for turning XHTML page attributes into RDF
Example
A shortened example for “Cloud Atlas” by David Mitchell from the RDF BookMashup at the Free University Berlin, which uses a number of different vocabularies:<?xml version=”1.0″ encoding=”UTF-8″ ?>
<rdf:RDF
xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
…
xmlns:skos=”http://www.w3.org/2004/02/skos/core#”>
<rdf:Description rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/books/0375507256″>
<rev:hasReview rdf:resource=”http://www4.wiwiss.fu-berlin.de/bookmashup/reviews/0375507256_EditorialReview1″/>
<dc:creator rdf:resource=”http://www4.wiwiss.fu-berlin.de/bookmashup/persons/David+Mitchell”/>
<dc:format>Paperback</dc:format>
<dc:identifier rdf:resource=”urn:ISBN:0375507256″/>
<dc:publisher>Random House Trade Paperbacks</dc:publisher>
<dc:title>Cloud Atlas: A Novel</dc:title>
</rdf:Description>
<scom:Book rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/books/0375507256″>
<rdfs:label>Cloud Atlas: A Novel</rdfs:label>
<skos:subject rdf:resource=”http://www4.wiwiss.fu-berlin.de/bookmashup/subject/Fantasy+fiction”/>
<skos:subject rdf:resource=”http://www4.wiwiss.fu-berlin.de/bookmashup/subject/Fate+and+fatalism”/>
…
<foaf:depiction rdf:resource=”http://ecx.images-amazon.com/images/I/51MIVHgJP%2BL.jpg”/>
<foaf:thumbnail rdf:resource=”http://ecx.images-amazon.com/images/I/51MIVHgJP%2BL._SL75_.jpg”/>
</scom:Book>
<rdf:Description rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/doc/books/0375507256″>
<dc:license rdf:resource=”http://www.amazon.com/AWS-License-home-page-Money/b/ref=sc_fe_c_0_12738641_12/102-8791790-9885755?ie=UTF8&node=3440661&no=12738641&me=A36L942TSJ2AJA”/>
<dc:license rdf:resource=”http://www.google.com/terms_of_service.html”/>
</rdf:Description>
<foaf:Document rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/doc/books/0375507256″>
<rdfs:label>RDF document about the book: Cloud Atlas: A Novel</rdfs:label>
<foaf:maker rdf:resource=”http://www4.wiwiss.fu-berlin.de/is-group/resource/projects/Project10″/>
<foaf:primaryTopic rdf:resource=”http://www4.wiwiss.fu-berlin.de/bookmashup/books/0375507256″/>
</foaf:Document>
<rdf:Description rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/persons/David+Mitchell”>
<rdfs:label>David Mitchell</rdfs:label>
</rdf:Description>
<rdf:Description rdf:about=”http://www4.wiwiss.fu-berlin.de/bookmashup/reviews/0375507256_EditorialReview1″>
<rdfs:label>Review number 1 about: Cloud Atlas: A Novel</rdfs:label>
</rdf:Description>
<rdf:Description rdf:about=”http://www4.wiwiss.fu-berlin.de/is-group/resource/projects/Project10″>
<rdfs:label>RDF Book Mashup</rdfs:label>
</rdf:Description>
</rdf:RDF>
A partial view on this RDF file with the Marbles browser:
See also the same example in the Disco RDF browser.
Library implementations
It seems obvious that Linked Data can be very useful in providing a generic infrastructure for linking data, metadata and objects, available in numerous types of data stores, in the online library world. With such a networked online data structure, it would be fairly easy to create all kinds of discovery interfaces for bibliographic data and objects. Moreover, it would also be possible to link to non-bibliographic data that might interest the users of these interfaces.A brief and incomplete list of some library related Linked Data projects, some of which already mentioned above:
- RDF BookMashup – Integration of Web 2.0 data sources like Amazon, Google or Yahoo into the Semantic Web.
- Library of Congress Authorities – Exposing LoC Autorities and Vocabularies to the web using URI’s
- DBPedia – Exposing structured data from WikiPedia to the web
- LIBRIS – Linked Data interface to Swedish LIBRIS Union catalog
- Scriblio+Wordpress+Triplify – “A social, semantic OPAC Union Catalogue”
And what about MARC, AACR2 and RDA? Is there a role for them in the Linked Data environment? RDA is supposed to be the successor of AACR2 as a content standard that can be used with MARC, but also with other encoding standards like MODS or Dublin Core.
The RDA Entity Relationship Diagram, that incorporates FRBR as well, can of course easily be implemented as an RDF vocabulary, that could be used to create a universal Linked Data library network. It really does not matter what kind of internal data format the connected systems use. -
Who needs MARC?
Posted on May 15th, 2009 26 commentsWhy use a non-normalised metadata exchange format for suboptimal data storage?
This week I had a nice chat with André Keyzer of Groningen University library and Peter van Boheemen of Wageningen University Library who attended OCLC’s Amsterdam Mashathon 2009. As can be expected from library technology geeks, we got talking about bibliographic metadata formats, very exciting of course. The question came up: what on earth could be the reason for storing bibliographic metadata in exchange formats like MARC?
Being asked once at an ELAG conference about the bibliographic format Wageningen University was using in their home grown catalog system, Peter answered: “WDC” ….”we don’t care“.
Exactly my idea! As a matter of fact I think I may have used the same words a couple of times in recent years, probably even at ELAG2008. The thing is: it really does not matter how you store bibliographic metadata in your database, as long as you can present and exchange the data in any format requested, be it MARC or Dublin Core or anything else.
Of course the importance of using internationally accepted standards is beyond doubt, but there clearly exists widespread misunderstanding of the functions of certain standards, like for instance MARC. MARC is NOT a data storage format. In my opinion MARC is not even an exchange format, but merely a presentation format.

St. Marc Express
With a background and experience in data modeling, database and systems design (among others), I was quite amazed about bibliographic metadata formats when I started working with library systems in libraries, not having a librarian training at all. Of course, MARC (“MAchine Readable Cataloging record“) was invented as a standard in order to facilitate exchange of library catalog records in a digital era.
But I think MARC was invented by old school cataloguers who did not have a clue about data normalisation at all. A MARC record, especially if it corresponds to an official set of cataloging rules like AARC2, is nothing more than a digitised printed catalog card.In pre-computer times it made perfect sense to have a standardised uniform way of registering bibliographic metadata on a printed card in this way. The catalog card was simultaneously used as a medium for presenting AND storing metadata. This is where the confusion originates from!

MARC record
But when the Library of Congress says “If a library were to develop a “home-grown” system that did not use MARC records, it would not be taking advantage of an industry-wide standard whose primary purpose is to foster communication of information” it is saying just plain nonsense.
Actually it is better NOT to use something like MARC for other purposes than exchanging, or better, presenting data. To illustrate this I will give two examples of MARC tags that have been annoying me since my first day as a library employee:- 100 – Main Entry-Personal Name (NR) – subfield $a – Personal name (NR)
- 773 – Host Item Entry (R) – subfield $g – Relationship information (R)
100 – Main Entry-Personal Name
Besides storing an author’s name as a string in each individual bibliographic record instead of using a code, linking to a central authority table (“foreign key” in relational database terms), it is also a mistake to use a person’s name as one complete string in one field. Examples on the Library of Congress MARC website use forms like “Adams, Henry”, “Fowler, T. M.” and “Blackbeard, Author of”. To take only the simple first example, this author could also be registered as “Henry Adams”, “Adams, H.”, “H. Adams”. And don’t say that these forms are not according to the rules! They are out there! There is no way to match these variations as being actually one and the same.
In a normalised relational database, this subfield $a would be stored something like this (simplified!):- Person
- Surname=Adams
- First name=Henry
- Prefix=
- …
773 – Host Item Entry
Subfield $g of this MARC tag is used for storing citation information for a journal article, volume, issue, year, start page, end page, all in one string, like: “Vol. 2, no. 2 (Feb. 1976), p. 195-230“. Again I have seen this used in many different ways. In a normalised format this would look something like this, using only the actual values:- Journal
- Volume=2
- Issue=2
- Year=1976
- Month=2
- Day=
- Start page=195
- End page=230
In a presentation of this normalised data record extra text can be added like “Vol.” or “Volume“, “Issue” or “No.“, brackets, replacing codes by descriptions (Month 2 = Feb.) etc., according to the format required. So the stored values could be used to generate the text “Vol. 2, no. 2 (Feb. 1976), p. 195-230” on the fly, but also for instance “Volume 2, Issue 2, dated February 1976, pages 195-230“.
The strange thing with this bibliographic format aimed at exchanging metadata is that it actually makes metadata exchange terribly complicated, especially with these two tags Author and Host Item. I can illustrate this with describing the way this exchange is handled between two digital library tools we use at the Library of the University of Amsterdam, MetaLib and SFX , both from the same vendor, Ex Libris.
The metasearch tool MetaLib is using the described and preferred mechanism of on the fly conversion of received external metadata from any format to MARC for the purpose of presentation.
But if we want to use the retrieved record to link to for instance a full text article using the SFX link resolver, the generated MARC data is used as a source and the non-normalised data in the 100 and 773 MARC tags has to be converted to the OpenURL format, which is actually normalised (example in simple OpenUrl 0.1):isbn=;issn=0927-3255;date=1976; volume=2;issue=2;spage=195;epage=230; aulast=Adams;aufirst=Henry;auinit=;
In order to do this all kinds of regular expressions and scripting functions are needed to extract the correct values from the MARC author and citation strings. Wouldn’t it be convenient, if the record in MetaLib would already have been in OpenURL or any other normalised format?
The point I am trying to make is of course that it does not matter how metadata is stored, as long as it is possible to get the data out of the database in any format appropriate for the occasion. The SRU/SRW protocol is particularly aimed at precisely this: getting data out of a database in the required format, like MARC, Dublin Core, or anything else. An SRU server is a piece of middleware that receives requests, gets the requested data, converts the data and then returns the data in the requested format.
Currently at the Library of the University of Amsterdam we are migrating our ILS which also involves converting our data from one bibliographic metadata format (PICA+) to another (MARC). This is extremely complicated, especially because of the non-normalised structure of both formats. And I must say that in my opinion PICA+ is even the better one.
Also all German and Austrian libraries are meant to migrate from the MAB format to MARC, which also seems to be a move away from a superior format.
All because of the need to adhere to international standards, but with the wrong solution.Maybe the projected new standard for resource description and access RDA will be the solution, but that may take a while yet.
-
Replacing our ILS, business as usual
Posted on April 24th, 2009 2 commentsAs you may have noticed from some of my tweets, the Library of the Unversity of Amsterdam, my place of work, is in the process of replacing its ILS (Integrated Library System). All in all this project, or better these two projects (one selecting a new ILS, the other one implementing it) will have taken 18 months or more from the decision to go ahead until STP (Switch to Production), planned for August 15 this year. My colleague Bert Zeeman blogged about this (in Dutch) recently.
One thing that has become absolutely clear to me is that replacing an ILS is not just about replacing one information system by another. It is about replacing an entire organisational structure of work processes, with its huge impact on all people involved. And in our case it affects two organisations: besides the Library of the University of Amsterdam also the Media Library of the Hogeschool van Amsterdam. We have been managing library systems for both organisations in a mini consortial structure since a couple of years. So the Media Library is facing a second ILS replacement within two years.
While the decision was made because of pressing technical reasons, also with an eye on preparing for future library 2.0 developments, it turned out to be of substantial consequence for the organisation.
This is the first time that I am participating in such a radical library system project. I have done a couple of projects implementing and upgrading metasearching and OpenURL link resolver tools in the last six years, but these are nothing compared to the current project. With these “add-on” tools, that started as a means of extending the library’s primary stream of information, only a relatively limited number of people were involved. But with an ILS you are talking about the core business of a library (still!) and about day to day working life of everybody involved in acquisitions, cataloguing, circulation as well as system administrators and system librarians.To make it even more complicated, the University Library is also switching from the old system’s proprietary bibliographic format to MARC21, because that is what the new system is using. Personally I think that the old system’s format is better (just like our German colleagues think about their move from MAB to MARC), but of course the advantages of using an internationally accepted and used standard outweighs this, as always. Maybe food for another blog post later…
Last but not least, the Library is simultaneously doing a project for the implementation of RFID for self check machines. The initial idea was to implement RFID in the old system and then just migrate everything to the new one. However, for various reasons, recently it was decided to postpone RFID implementation to shortly after our ILS STP. Some initial tests have shown that this probably will work.
And while all this is going on, all normal work needs to be taken care of too: ” business as usual” .
Now, looking at workflows: the way that our individual departments have organised their workflows, is partly dictated by the way the old system is designed. The new system obviously dictates workflows too, but in other areas. Although this new system is very flexible and highly configurable, there are still some local requirements that cannot be met by the new system.
Of course this is NOT the way it should be! Systems should enable us to do what we want and how we want it! Hopefully new developments like Ex Libris’ URM and the very recently announced new OCLS WorldCat Web based ILS will take care of users better.Talking about “very flexible and highly configurable”: although a very big advantage, this also makes it much more complicated and time consuming to implement the new system. Fortunately there are a lot of other libraries in The Netherlands and around the world using the new system that are willing to help us in every possible way. And this is highly appreciated!
Other isues that make this project complicated:
- unexpected issues, bottlenecks: these keep on coming
- migration of data from old system: conversion of old to new format
- implementing links with external systems like student’s and staff database, financial system, national union catalogue
I think we will make STP on the planned date, but I also think we need to postpone a number of issues until after that. There will still be a lot of work to be done for my department after the project has finished.
To end with a positive note: the new OPAC wil be much nicer and more flexible than the old one. And in the end that is what we are doing this for: our patrons.






























Recent Comments