Posted on October 16th, 2009 4 comments
Metasearch vs. harvesting & indexing
The other day I gave a presentation for the Assembly of members of the local Amsterdam Libraries Association “Adamnet“, about the Amsterdam Digital Library search portal that we host at the Library of the University of Amsterdam. This portal is built with our MetaLib metasearch tool and offers simultaneous access to, at the moment, 20 local library catalogues.
A large part of this presentation was dedicated to all possible (and very real) technical bottlenecks of this set-up, with the objective of improving coordination and communication between the remote system administrators at the participating libraries and the central portal administration. All MetaLib database connectors/configurations are “home-made”, and the portal highly depends on the availability of the remote cataloging systems.
I took the opportunity to explain to my audience also the “issues” inherent in the concept of metasearch (or “federated search“, “distributed search“, etc.), and compare that to the harvesting & indexing scenario.
Because it was not the first (nor last) time that I had to explain the peculiarities of metasearch, I decided to take the Metasearch vs. Harvesting & Indexing part of the presentation and extend it to a dedicated slideshow. You can see it here, and you are free to use it. Examples/screenshots are taken from our MetaLib Amsterdam Digital Library portal. But everything said applies to other metasearch tools as well, like Webfeat, Muse Global, 360-Search, etc.
The slideshow is meant to be an objective comparison of the two search concepts. I am not saying that Metasearch is bad, and H&I is good, that would be too easy. Some five years ago Metasearch was the best we had, it was a tremendous progress beyond searching numerous individual databases separately. Since then we have seen the emergence of harvesting & indexing tools, combined with “uniform discovery interfaces”, such as Aquabrowser, Primo, Encore, and the OpenSource tools VuFind, SUMMA, Meresco, to name a few.
Anyway, we can compare the main difference between Metasearch and H&I to the concepts “Just in time” and “Just in case“, used in logistics and inventory management.
With Metasearch, records are fetched on request (Just in time), with the risk of running into logistics and delivery problems. With H&I, all available records are already there (Just in case), but maybe not the most recent ones.
Objectively of course, H&I can solve the problems inherent in Metasearch, and therefore is a superior solution. However, a number of institutions, mainly general academic libraries, will for some time depend on databases that can’t be harvested because of technical, legal or commercial reasons.
In other cases, H&I is the best option, for instance in the case of cooperating local or regional libraries, such as Adamnet, or dedicated academic or research libraries that only depend on a limited number of important databases and catalogs.
But I also believe that the real power of H&I can only be taken advantage of, if institutions cooperate and maintain shared central indexes, instead of building each their own redundant metadata stores. This already happens, for instance in Denmark, where the Royal Library uses Primo to access the national DADS database.
We also see commercial hosted H&I initiatives implemented as SaaS (Software as a Service) by both tool vendors and database suppliers, like Ex Libris’ PrimoCentral, SerialSolutions’ Summon and EBSCOhost Integrated Search.
The funny thing is, that if you want to take advantage of all these hosted harvested indexes, you are likely to end up with a hybrid kind of metasearch situation where you distribute searches to a number of remote H&I databases.
Posted on October 6th, 2009 1 comment
What will library staff do 5 years from now?
I attended the IGeLU 2009 annual conference in Helsinki September 6-9. IGeLU is the International Group of Ex Libris Users, an independent organisation that represents Ex Libris customers. Just to state my position clearly I would like to add that I am a member of the IGeLU Steering Committee.
These annual user group meetings typically have three types of sessions: internal organisational sessions (product working groups and steering committee business meetings, elections), Ex Libris sessions (product updates, Q&A, strategic visions), and customer sessions (presentations of local solutions, addons, developments).
Not surprisingly, the main overall theme of this conference was the future of library systems and libraries. The word that characterises the conference best in my mind (besides “next generation“and “metaphor“) is “roadmap“. All Ex Libris products but also all attending libraries are on their way to something new, which strangely enough is still largely uncertain.
Ex Libris presented the latest state of design and development of their URM (Unified Resource Management) project, ‘A New Model for Next-generation Library Services’. In the final URM environment all back end functionality of all current Ex Libris products will be integrated into one big modular system, implemented in a SaaS (“Software as a Service“) architecture. In the Ex Libris vision the front end to this model will be their Primo Indexing and Discovery interface, but all URM modules will have open API’s to enable using them with other tools.
The goal of this roadmap apparently is efficiency in the areas of technical and functional system administration for libraries.
In the mean time development of existing products is geared towards final inclusion in URM. All future upgrades will result in what I would like to call “intermediate” instead of “next generation” products . MetaLib, the metasearch or federated search tool, will be replaced by MetaLib Next Generation, with a re-designed metasearch engine and a Primo front end. The digital collection management tool DigiTool will be merged into its new and bigger nephew Rosetta, the digital preservation system. The database of the OpenUrl resolver SFX will be restructured to accommodate the URM datamodel. The next version of Verde (electronic resource management) will effectively be URM version 1, which will also be usable as an alternative for both ILS’es Voyager and Aleph.
Here we see a kind of “intermediate” roadmap to different “base camps” from where the travelers can try to reach their final destination.
From the perspective of library staff we see another panorama appearing.
In one of the customer presentations Janet Lute of Princeton University Library, one of the three (now four) URM development partners, mentioned a couple of “holy cows” or library tasks they might consider stopping doing while on their way to the new horizon:
- managing prediction patterns for journal issues
- checking in print serials
- maintaining lots of circulation matrices and policies
- collecting fines
- cataloging over 80% of bibliographic records
I would like to add my own holy cow MARC to this list, about which I have written a previous post Who needs MARC?. (Some other developments in this area are self service, approval plans, shared cataloging, digitisation, etc.)
This roadmap is supposed to lead to more efficient work and less pressure for acquisitions, cataloging and circulation staff.
Eldorado or Brave New World?
To summarise: we see a sketchy roadmap leading us via all kinds of optional intermediate stations to an as yet still vague and unclear Eldorado of scholarly information disclosure and discovery.
The majority of public and professional attention is focused on discovery: modern web 2.0 front ends to library collections, and the benefits for the libraries’ end users. But it is probably even more important to look at the other side, disclosure: the library back end, and the consequences of all these developments for library staff, both technically oriented system administrators and professionally oriented librarians.
Future efficient integrated and modular library systems will no doubt eliminate a lot of tasks performed by library staff, but does this mean there will be no more library jobs?
Will the university library of the future be “sparsely staffed, highly decentralized, and have a physical plant consisting of little more than special collections and study areas“, as was stated recently in an article in “Inside Higher Education”? I mentioned similar options in “No future for libraries?“.
Personally I expect that the two far ends of the library jobs spectrum will merge into a single generic job type which we can truly call “system librarian“, as I stated in my post “System librarians 2.0“. But what will these professionals do? Will they catalog? Will they configure systems? Will they serve the public? Will they develop system add-ons?
This largely depends on how the new integrated systems will be designed and implemented, how systems and databases from different vendors and providers will be able to interact, how much libraries/information management organisations will outsource and crowdsource, how much library staff is prepared to rethink existing workflows, how much libraries want to distinguish themselves from other organisations, how much end users are interested in differences between information management organisations; in brief: how much these new platforms will allow us to do ourselves.
We have come up with a realistic image of ourselves for the next couple of decades soon, otherwise our publishers and system vendors will be doing it for us.
Posted on May 15th, 2009 22 comments
Why use a non-normalised metadata exchange format for suboptimal data storage?
This week I had a nice chat with André Keyzer of Groningen University library and Peter van Boheemen of Wageningen University Library who attended OCLC’s Amsterdam Mashathon 2009. As can be expected from library technology geeks, we got talking about bibliographic metadata formats, very exciting of course. The question came up: what on earth could be the reason for storing bibliographic metadata in exchange formats like MARC?
Exactly my idea! As a matter of fact I think I may have used the same words a couple of times in recent years, probably even at ELAG2008. The thing is: it really does not matter how you store bibliographic metadata in your database, as long as you can present and exchange the data in any format requested, be it MARC or Dublin Core or anything else.
Of course the importance of using internationally accepted standards is beyond doubt, but there clearly exists widespread misunderstanding of the functions of certain standards, like for instance MARC. MARC is NOT a data storage format. In my opinion MARC is not even an exchange format, but merely a presentation format.
With a background and experience in data modeling, database and systems design (among others), I was quite amazed about bibliographic metadata formats when I started working with library systems in libraries, not having a librarian training at all. Of course, MARC (“MAchine Readable Cataloging record“) was invented as a standard in order to facilitate exchange of library catalog records in a digital era.
But I think MARC was invented by old school cataloguers who did not have a clue about data normalisation at all. A MARC record, especially if it corresponds to an official set of cataloging rules like AARC2, is nothing more than a digitised printed catalog card.
In pre-computer times it made perfect sense to have a standardised uniform way of registering bibliographic metadata on a printed card in this way. The catalog card was simultaneously used as a medium for presenting AND storing metadata. This is where the confusion originates from!
But when the Library of Congress says “If a library were to develop a “home-grown” system that did not use MARC records, it would not be taking advantage of an industry-wide standard whose primary purpose is to foster communication of information” it is saying just plain nonsense.
Actually it is better NOT to use something like MARC for other purposes than exchanging, or better, presenting data. To illustrate this I will give two examples of MARC tags that have been annoying me since my first day as a library employee:
- 100 – Main Entry-Personal Name (NR) – subfield $a – Personal name (NR)
- 773 – Host Item Entry (R) – subfield $g – Relationship information (R)
100 – Main Entry-Personal Name
Besides storing an author’s name as a string in each individual bibliographic record instead of using a code, linking to a central authority table (“foreign key” in relational database terms), it is also a mistake to use a person’s name as one complete string in one field. Examples on the Library of Congress MARC website use forms like “Adams, Henry”, “Fowler, T. M.” and “Blackbeard, Author of”. To take only the simple first example, this author could also be registered as “Henry Adams”, “Adams, H.”, “H. Adams”. And don’t say that these forms are not according to the rules! They are out there! There is no way to match these variations as being actually one and the same.
In a normalised relational database, this subfield $a would be stored something like this (simplified!):
- First name=Henry
773 – Host Item Entry
Subfield $g of this MARC tag is used for storing citation information for a journal article, volume, issue, year, start page, end page, all in one string, like: “Vol. 2, no. 2 (Feb. 1976), p. 195-230“. Again I have seen this used in many different ways. In a normalised format this would look something like this, using only the actual values:
- Start page=195
- End page=230
In a presentation of this normalised data record extra text can be added like “Vol.” or “Volume“, “Issue” or “No.“, brackets, replacing codes by descriptions (Month 2 = Feb.) etc., according to the format required. So the stored values could be used to generate the text “Vol. 2, no. 2 (Feb. 1976), p. 195-230” on the fly, but also for instance “Volume 2, Issue 2, dated February 1976, pages 195-230“.
The strange thing with this bibliographic format aimed at exchanging metadata is that it actually makes metadata exchange terribly complicated, especially with these two tags Author and Host Item. I can illustrate this with describing the way this exchange is handled between two digital library tools we use at the Library of the University of Amsterdam, MetaLib and SFX , both from the same vendor, Ex Libris.
The metasearch tool MetaLib is using the described and preferred mechanism of on the fly conversion of received external metadata from any format to MARC for the purpose of presentation.
But if we want to use the retrieved record to link to for instance a full text article using the SFX link resolver, the generated MARC data is used as a source and the non-normalised data in the 100 and 773 MARC tags has to be converted to the OpenURL format, which is actually normalised (example in simple OpenUrl 0.1):
isbn=;issn=0927-3255;date=1976; volume=2;issue=2;spage=195;epage=230; aulast=Adams;aufirst=Henry;auinit=;
In order to do this all kinds of regular expressions and scripting functions are needed to extract the correct values from the MARC author and citation strings. Wouldn’t it be convenient, if the record in MetaLib would already have been in OpenURL or any other normalised format?
The point I am trying to make is of course that it does not matter how metadata is stored, as long as it is possible to get the data out of the database in any format appropriate for the occasion. The SRU/SRW protocol is particularly aimed at precisely this: getting data out of a database in the required format, like MARC, Dublin Core, or anything else. An SRU server is a piece of middleware that receives requests, gets the requested data, converts the data and then returns the data in the requested format.
Currently at the Library of the University of Amsterdam we are migrating our ILS which also involves converting our data from one bibliographic metadata format (PICA+) to another (MARC). This is extremely complicated, especially because of the non-normalised structure of both formats. And I must say that in my opinion PICA+ is even the better one.
Also all German and Austrian libraries are meant to migrate from the MAB format to MARC, which also seems to be a move away from a superior format.
All because of the need to adhere to international standards, but with the wrong solution.
Maybe the projected new standard for resource description and access RDA will be the solution, but that may take a while yet.
Posted on December 14th, 2008 No comments
Last month I was in the opportunity to participate in the first official Ex Libris “Developers meet developers” meeting in Jerusalem, November 12-13, 2008. The meeting was dedicated to the new Open Platform strategy that Ex Libris has adopted. I already mentioned this development in my post How open are open systems?. Together with one of the other attendees, Mark Dehmlow, of Notre Dame University Library, I wrote a short report on this meeting in the IGeLU newsletter issue 2, 2008, page 21-22.
The intention of this event was that representatives from Ex Libris customer institutions that use Ex Libris’ Digital Library tools Aleph, SFX, MetaLib, and Primo and are actively involved in developing plug-ins, add-ons and extensions to one or more of these products, and Ex Libris staff involved in development of these tools, had the chance to meet face to face and talk, discuss and exchange ideas from both sides.
The political, cultural and social circumstances of the location of the event (about which I blogged some personal thoughts here) are such that I can’t resist the temptation of using them as a metaphor, although I am fully aware that the actual situation in Jerusalem is of course much more complicated. I apologise in advance if I unintentionally offend anyone by using the serious real world situation in an inappropriate way.
So, let’s give it a try: in Jerusalem there are a number of separate areas for different population groups. In general there are the Jewish western part and the Arab eastern part. But there is also the old city right in the middle, with Jewish, Arab, Christian and Armenian quarters. Besides that you can also see separate neighbourhoods within the Jewish part with different Jewish groups. And last but not least, right in the middle of the Christian quarter there is the Church of the Holy Sepulchre, with corners for almost all christian religious groups. Very fascinating and intriguing.
Although there are no physical borders between these areas, the complicated serious political, social and cultural circumstances prevent most people to visit their neighbours in their own areas. Now here comes the metaphor! In the world of informations systems you normally have a similar situation of “us and them”. Customers and users often think that providers of systems do not take them seriously and give them tools they can’t work with, and the other way around system developers often see end users as nagging bores, never satisfied and complaining about everything.
Customers and providers inhabit the same space, like Jerusalem, but do not cross the imaginary border to really meet.
This is why it is so remarkable that the “crossing of the border” between Ex Libris customers and developers actually happened in Jerusalem. Of course I immediately must add that Ex Libris has always favoured open systems for customers to use in their own way, and supports the international user groups, but an actual face-to-face meeting on the level of developers is something different.
From personal experience I know that it is very easy for situations to get out of hand if there is no real communication and no willingness for mutual understanding. That is why I think that it is absolutely vital that meetings like this can continue to take place. From the customers’ side the user groups IGeLU and ELUNA are fully dedicated to this goal, and I really hope that Ex Libris is also serious about it.
In this month of Christmas, Chanuka and Eid Al-Adha, let me end with the wish for better understanding on the personal, professional and global level!
Posted on October 12th, 2008 2 comments
In my post “LING – Library integration next generation” I mentioned Marshall Breedings presentation at TICER “Library Automation Challenges for the next generation”.
Besides “Moving toward new generation of library automation” one of his other two topics was “A Mandate for Openness”, about Open Source, Open Systems, Open Content.
Marshall Breeding distinguishes five types of Open Systems, three of which in my view are the most important:
- Closed Systems: black boxes, only accessible via the user interfaces provided by the developer, no programmable access to the system
- Open Source Model: all aspects of the system available to inspection and modification
- Open API model: the core application is closed and accessible via the user interfaces provided by the developer, but third party developers can create code against the published API’s or database tables
(The other two types are intermediate or combined types: “standard RDBM systems” where third party developers can access the database schema, which in my view contains only part of the system’s data; and “Open Source/Open API”).
Especially the “Open API Model” is an interesting development for most libraries that work with commercial library systems. I have had some experience with two initiatives in this field: OCLC’s “WorldCat Grid“, and Ex Libris’ “Open Platform“. A big and important difference between these two is: WorldCat Grid is about access to a specific database already available to the public at large, Ex Libris’ Open Platform is about access to a number of commercial systems.
Interestingly, both initiatives consist of two parts: a set of open API’s and an open developers’ platform. These two parts make it possible to have a kind of marriage between commercial systems and an open source community. But how does this work in real life, how open is access to both the API’s and the Platform?
Some of OCLC’s WorldCat Grid Services are freely accessible, others are accessible for OCLC members only.
Membership of the WorldCat Grid Developers’ Network is available to “IT professionals from: OCLC member institutions, content providers, other software vendors and publishers, as well as bloggers and others in the library field who see value in a collaborative network related to the development of new functionality for the WorldCat Grid.”
“Software code, snippets and API’s developed within the network will be openly available for members, and the world-at-large, to use and re-use.”
With Ex Libris’ Open Platform, access to the Developers’ Platform is only open for Ex Libris customers.
Access to the existing API mechanisms (“X-Server” for the products Aleph, MetaLib, SFX, and Webservices for Primo) are also only available to Ex Libris customers. What will happen with newly developed API’s (conforming to new API standards like DLF ILS-Discovery Interface protocol) for new products is still unclear.
In my view it does make sense to restrict availability of Open API’s to members or customers in the case of access to licensed metadata or resources. But availability of Open API’s that access public data should be free to all.
It does NOT make sense to restrict access to tools developed on top of the Open API’s to members or customers only.
Granting access to data should be the privilege of the owners of the data, granting access to tools that access data should be the privilege of the developers/owners of these tools.
In this respect the OCLC platform is more open than Ex Libris, but it still is not completely open.
Of course, this is all highly dependent of the motives of the companies for supporting Openness: is it commitment to openness, or fear of losing customers?
Posted on October 8th, 2008 No comments
The project for implementation of Aleph as the new ILS for the Library of the University of Amsterdam started last week (October 2) with the official kick-off meeting. The Ex Libris project plan was presented to the members of the project team, bottlenecks were identified, and a lot of adjustments were made to the planning in order to be able to carry out more tasks simultaneously and thus earlier in time.
First steps are installation of Aleph 18, and giving on site trainings to all people involved, using the new locally installed Aleph 18 system.
But of course, before everything can start, we need the hardware! The central ICT department of the University of Amsterdam (not part of the library) is responsible for configuring and providing the Aleph production server according to the official Ex Libris “Requirements for ALEPH 500 Installation”. And as always there is confusion about what is actually meant by the provider,and as always there are conflicts between the provider’s requirements and the ICT department’s security policy.
As head of the Library Systems Department of the library and as coordinator of the project’s System Administration team, I have been acting this week as an intermediary between our software and hardware providers, passing information about volumes, partitions, database and scratch files, root access, IP addresses and swap areas.
This makes you realise again that all these new web 2.0 systems and techniques are in the end completely dependent on the availability of correctly configured and constantly monitored machines, cables and electricity, and not in the least on all these technicians that know all about hardware and networks.
Posted on October 3rd, 2008 1 comment
I have had this domain name for a long time, before I started working with digital library systems, even before I knew about them. It was January 2000, at the peak of web 1.0.
My main motive was that I wanted to have an email address that I would not have to change every so often because of disappearing free email providers (my first email address was something at crosswinds.net). But I also wanted to create some kind of bridge or virtual meeting place for the different fields I was interested in, art, history, IT, etc.
There were no blogs or blogging software or any modern web2.0 tools, I had to do everything with HTML and CSS.
A funny thing is that Pam’s Paper Pills blog (© photo) compares old “commonplace books” with “modern blogs”.
My first real project that attracted some attention was my “Short guide to free email” .
A couple of years later I found myself kind of “in between careers”, moving away from IT and system development into what I then expected to be arts and humanities. I actually found myself somewhere in the middle in the end (where I still am right now).
I started adding more “literary” and “historical” texts to my website.But I never really got it going.
Until web 2.0 came along. First I moved everything to a WordPress environment, but I still did not have real content. I played around with a couple of different approaches, finally I decided to start a blog on digital libraries. One of the many but it would automatically be part of the current “virtual community” of the blogosphere and the web at large.
It took some time to think of topics that are not really covered by other well known bloggers. Matters were complicated by the fact that I also have another site, that I had started using for a kind of “personal” blogging (http://lukask.blogspot.com).
But I think the next couple of years I may have a lot to blog about. I will be heavily involved in the implementation of Aleph at the Library of the University of Amsterdam, I have just been elected as member of the Steering Committee of IGeLU (International Group of Ex Libris Users), we intend to get involved more in the new Ex Libris developers platform, and of course there is Ex Libris‘ new URM/URD2 strategy to follow.
So, I hope this will be the first of many library2.0 blog posts.