Why use a non-normalised metadata exchange format for suboptimal data storage?
This week I had a nice chat with André Keyzer of Groningen University library and Peter van Boheemen of Wageningen University Library who attended OCLC’s Amsterdam Mashathon 2009. As can be expected from library technology geeks, we got talking about bibliographic metadata formats, very exciting of course. The question came up: what on earth could be the reason for storing bibliographic metadata in exchange formats like MARC?
Exactly my idea! As a matter of fact I think I may have used the same words a couple of times in recent years, probably even at ELAG2008. The thing is: it really does not matter how you store bibliographic metadata in your database, as long as you can present and exchange the data in any format requested, be it MARC or Dublin Core or anything else.
Of course the importance of using internationally accepted standards is beyond doubt, but there clearly exists widespread misunderstanding of the functions of certain standards, like for instance MARC. MARC is NOT a data storage format. In my opinion MARC is not even an exchange format, but merely a presentation format.
With a background and experience in data modeling, database and systems design (among others), I was quite amazed about bibliographic metadata formats when I started working with library systems in libraries, not having a librarian training at all. Of course, MARC (“MAchine Readable Cataloging record“) was invented as a standard in order to facilitate exchange of library catalog records in a digital era.
But I think MARC was invented by old school cataloguers who did not have a clue about data normalisation at all. A MARC record, especially if it corresponds to an official set of cataloging rules like AARC2, is nothing more than a digitised printed catalog card.
In pre-computer times it made perfect sense to have a standardised uniform way of registering bibliographic metadata on a printed card in this way. The catalog card was simultaneously used as a medium for presenting AND storing metadata. This is where the confusion originates from!
But when the Library of Congress says “If a library were to develop a “home-grown” system that did not use MARC records, it would not be taking advantage of an industry-wide standard whose primary purpose is to foster communication of information” it is saying just plain nonsense.
Actually it is better NOT to use something like MARC for other purposes than exchanging, or better, presenting data. To illustrate this I will give two examples of MARC tags that have been annoying me since my first day as a library employee:
- 100 – Main Entry-Personal Name (NR) – subfield $a – Personal name (NR)
- 773 – Host Item Entry (R) – subfield $g – Relationship information (R)
100 – Main Entry-Personal Name
Besides storing an author’s name as a string in each individual bibliographic record instead of using a code, linking to a central authority table (“foreign key” in relational database terms), it is also a mistake to use a person’s name as one complete string in one field. Examples on the Library of Congress MARC website use forms like “Adams, Henry”, “Fowler, T. M.” and “Blackbeard, Author of”. To take only the simple first example, this author could also be registered as “Henry Adams”, “Adams, H.”, “H. Adams”. And don’t say that these forms are not according to the rules! They are out there! There is no way to match these variations as being actually one and the same.
In a normalised relational database, this subfield $a would be stored something like this (simplified!):
- First name=Henry
773 – Host Item Entry
Subfield $g of this MARC tag is used for storing citation information for a journal article, volume, issue, year, start page, end page, all in one string, like: “Vol. 2, no. 2 (Feb. 1976), p. 195-230“. Again I have seen this used in many different ways. In a normalised format this would look something like this, using only the actual values:
- Start page=195
- End page=230
In a presentation of this normalised data record extra text can be added like “Vol.” or “Volume“, “Issue” or “No.“, brackets, replacing codes by descriptions (Month 2 = Feb.) etc., according to the format required. So the stored values could be used to generate the text “Vol. 2, no. 2 (Feb. 1976), p. 195-230” on the fly, but also for instance “Volume 2, Issue 2, dated February 1976, pages 195-230“.
The strange thing with this bibliographic format aimed at exchanging metadata is that it actually makes metadata exchange terribly complicated, especially with these two tags Author and Host Item. I can illustrate this with describing the way this exchange is handled between two digital library tools we use at the Library of the University of Amsterdam, MetaLib and SFX , both from the same vendor, Ex Libris.
The metasearch tool MetaLib is using the described and preferred mechanism of on the fly conversion of received external metadata from any format to MARC for the purpose of presentation.
But if we want to use the retrieved record to link to for instance a full text article using the SFX link resolver, the generated MARC data is used as a source and the non-normalised data in the 100 and 773 MARC tags has to be converted to the OpenURL format, which is actually normalised (example in simple OpenUrl 0.1):
isbn=;issn=0927-3255;date=1976; volume=2;issue=2;spage=195;epage=230; aulast=Adams;aufirst=Henry;auinit=;
In order to do this all kinds of regular expressions and scripting functions are needed to extract the correct values from the MARC author and citation strings. Wouldn’t it be convenient, if the record in MetaLib would already have been in OpenURL or any other normalised format?
The point I am trying to make is of course that it does not matter how metadata is stored, as long as it is possible to get the data out of the database in any format appropriate for the occasion. The SRU/SRW protocol is particularly aimed at precisely this: getting data out of a database in the required format, like MARC, Dublin Core, or anything else. An SRU server is a piece of middleware that receives requests, gets the requested data, converts the data and then returns the data in the requested format.
Currently at the Library of the University of Amsterdam we are migrating our ILS which also involves converting our data from one bibliographic metadata format (PICA+) to another (MARC). This is extremely complicated, especially because of the non-normalised structure of both formats. And I must say that in my opinion PICA+ is even the better one.
Also all German and Austrian libraries are meant to migrate from the MAB format to MARC, which also seems to be a move away from a superior format.
All because of the need to adhere to international standards, but with the wrong solution.
Maybe the projected new standard for resource description and access RDA will be the solution, but that may take a while yet.