Permalink: https://purl.org/cpl/502
Why use a non-normalised metadata exchange format for suboptimal data storage?

This week I had a nice chat with André Keyzer of Groningen University library and Peter van Boheemen of Wageningen University Library who attended OCLC’s Amsterdam Mashathon 2009. As can be expected from library technology geeks, we got talking about bibliographic metadata formats, very exciting of course. The question came up: what on earth could be the reason for storing bibliographic metadata in exchange formats like MARC?
Being asked once at an ELAG conference about the bibliographic format Wageningen University was using in their home grown catalog system, Peter answered: “WDC” ….”we don’t care“.
Exactly my idea! As a matter of fact I think I may have used the same words a couple of times in recent years, probably even at ELAG2008. The thing is: it really does not matter how you store bibliographic metadata in your database, as long as you can present and exchange the data in any format requested, be it MARC or Dublin Core or anything else.
Of course the importance of using internationally accepted standards is beyond doubt, but there clearly exists widespread misunderstanding of the functions of certain standards, like for instance MARC. MARC is NOT a data storage format. In my opinion MARC is not even an exchange format, but merely a presentation format.

With a background and experience in data modeling, database and systems design (among others), I was quite amazed about bibliographic metadata formats when I started working with library systems in libraries, not having a librarian training at all. Of course, MARC (“MAchine Readable Cataloging record“) was invented as a standard in order to facilitate exchange of library catalog records in a digital era.
But I think MARC was invented by old school cataloguers who did not have a clue about data normalisation at all. A MARC record, especially if it corresponds to an official set of cataloging rules like AARC2, is nothing more than a digitised printed catalog card.
In pre-computer times it made perfect sense to have a standardised uniform way of registering bibliographic metadata on a printed card in this way. The catalog card was simultaneously used as a medium for presenting AND storing metadata. This is where the confusion originates from!

But when the Library of Congress says “If a library were to develop a “home-grown” system that did not use MARC records, it would not be taking advantage of an industry-wide standard whose primary purpose is to foster communication of information” it is saying just plain nonsense.
Actually it is better NOT to use something like MARC for other purposes than exchanging, or better, presenting data. To illustrate this I will give two examples of MARC tags that have been annoying me since my first day as a library employee:
- 100 – Main Entry-Personal Name (NR) – subfield $a – Personal name (NR)
- 773 – Host Item Entry (R) – subfield $g – Relationship information (R)
100 – Main Entry-Personal Name
Besides storing an author’s name as a string in each individual bibliographic record instead of using a code, linking to a central authority table (“foreign key” in relational database terms), it is also a mistake to use a person’s name as one complete string in one field. Examples on the Library of Congress MARC website use forms like “Adams, Henry”, “Fowler, T. M.” and “Blackbeard, Author of”. To take only the simple first example, this author could also be registered as “Henry Adams”, “Adams, H.”, “H. Adams”. And don’t say that these forms are not according to the rules! They are out there! There is no way to match these variations as being actually one and the same.
In a normalised relational database, this subfield $a would be stored something like this (simplified!):
- Person
- Surname=Adams
- First name=Henry
- Prefix=
- …
773 – Host Item Entry
Subfield $g of this MARC tag is used for storing citation information for a journal article, volume, issue, year, start page, end page, all in one string, like: “Vol. 2, no. 2 (Feb. 1976), p. 195-230“. Again I have seen this used in many different ways. In a normalised format this would look something like this, using only the actual values:
- Journal
- Volume=2
- Issue=2
- Year=1976
- Month=2
- Day=
- Start page=195
- End page=230
In a presentation of this normalised data record extra text can be added like “Vol.” or “Volume“, “Issue” or “No.“, brackets, replacing codes by descriptions (Month 2 = Feb.) etc., according to the format required. So the stored values could be used to generate the text “Vol. 2, no. 2 (Feb. 1976), p. 195-230” on the fly, but also for instance “Volume 2, Issue 2, dated February 1976, pages 195-230“.
The strange thing with this bibliographic format aimed at exchanging metadata is that it actually makes metadata exchange terribly complicated, especially with these two tags Author and Host Item. I can illustrate this with describing the way this exchange is handled between two digital library tools we use at the Library of the University of Amsterdam, MetaLib and SFX , both from the same vendor, Ex Libris.
The metasearch tool MetaLib is using the described and preferred mechanism of on the fly conversion of received external metadata from any format to MARC for the purpose of presentation.
But if we want to use the retrieved record to link to for instance a full text article using the SFX link resolver, the generated MARC data is used as a source and the non-normalised data in the 100 and 773 MARC tags has to be converted to the OpenURL format, which is actually normalised (example in simple OpenUrl 0.1):
isbn=;issn=0927-3255;date=1976; volume=2;issue=2;spage=195;epage=230; aulast=Adams;aufirst=Henry;auinit=;
In order to do this all kinds of regular expressions and scripting functions are needed to extract the correct values from the MARC author and citation strings. Wouldn’t it be convenient, if the record in MetaLib would already have been in OpenURL or any other normalised format?
The point I am trying to make is of course that it does not matter how metadata is stored, as long as it is possible to get the data out of the database in any format appropriate for the occasion. The SRU/SRW protocol is particularly aimed at precisely this: getting data out of a database in the required format, like MARC, Dublin Core, or anything else. An SRU server is a piece of middleware that receives requests, gets the requested data, converts the data and then returns the data in the requested format.
Currently at the Library of the University of Amsterdam we are migrating our ILS which also involves converting our data from one bibliographic metadata format (PICA+) to another (MARC). This is extremely complicated, especially because of the non-normalised structure of both formats. And I must say that in my opinion PICA+ is even the better one.
Also all German and Austrian libraries are meant to migrate from the MAB format to MARC, which also seems to be a move away from a superior format.
All because of the need to adhere to international standards, but with the wrong solution.
Maybe the projected new standard for resource description and access RDA will be the solution, but that may take a while yet.
22 thoughts on “Who needs MARC?”
Very interesting! As someone with a non-librarian background this helps me understand 😉
I guess that, as with AACR2, we have almost all capitulated to the American systems because they’re the American systems.
So what alternative do you suggest?
“WDC” is a nice Peter like answer, it does not help very much in the real world. or am I missing the point?
RDA is certainly not the answer. It’s no more then a very complicated set of rules for librarians about how to describe resources, not about the format to describe the resource in. better then AACR2 or Fobid, but not an answer to your question.
As far as I can tell there is no other internationally widely used standard format for describing bibliographic objects which could replace MARC at this moment. Marc has got it’s flaws, so deos any other format.
I do support “WDC”, please make you own choices for internal storage, but we also need some workable standard. Marc supplies just that
Bas, I think we actually agree! “We don’t care” what internal format we use, but we need a standard format for exchanging data. At the moment that standard is MARC, with its flaws. But it is the best we have now. But only for data exchange!
Of course one could make use of OAI-PMH conversion, like the XC-project showed.
If MARC served a retrieval purpose, and only that purpose, it would make sense to normalise. Luckily for us, it doesn’t, which is why the unformalised data is available for identification e.g. while matching legacy records without unified identifier. Of course this is true if you only look at the MARC Bib record, which is only one-fifth of the complete format; much of the normalisation takes place in the other parts.
As MARC relates to Z39.50, so does MARCXML (or MODS, or DCMI) relate to SRU/SRW. To compare MARC to SRU borders the absurd.
> MARC is NOT a data storage format. In my opinion MARC is not even an exchange format, but merely a presentation format.
MARC is anything but a presentation format, it is foremost about storage of structured bibliographic data, and a means of exchange. How much of the presentation layer (ISBD) is visible, depends on the software you use.
> But I think MARC was invented by old school cataloguers who did not have a clue about data normalisation at all.
It was invented by Henriette Avram, who knew nothing about cataloguing and whose assignment it was to create a low storage solution for the ton of data of the Library of Congress.
> Maybe the projected new standard for resource description and access RDA will be the solution, but that may take a while yet.
RDA is a set of description rules, exactly like the AACR you criticised earlier. The difference is that RDA gives the cataloguer a context sensitive advice in an online environment. RDA can be used without ISBD presentation, either with MARC, MODS, EAD, VRA, CDWA or DCMI, depending on the specificity the institution requires. In other words, there are a lot of choices, and we make those choices because WDC (We Do Care).
Thank you for mentioning XC project’s OAI-PMH implementation (OAI Toolkit). It is very found for me, as the creator of the first versions of that tool. OAI Toolkit converts MARC to MARCXML and stores that format. While creating the tool, we converted and loaded millions of MARC records from different great US university and school libraries for testing purposes. I should say, that a number of percentage (somewhere more than 10%) of the records were somehow invalid: they did not fit the MARC standards simple formating rules. The greater part of such errors were detected in the leader. I don’t know the reason, and even the librarians did not know why are those error takes place (like strange characters in the last two positions of the Leader, where the standardized way should be ’00’). My mere guess is, that the ILSes use them a creative way for their internal (and secret?) purpose.
Anyway: MARC is a good exchange standard, but those critiques regarding to normalisation in the post are relevant. You are right, that MARC Authority controls Authority Records and Authority Names, but inside the bib standard they do not reflect to this standard, and the meanings of the subfields are not defined (eg. “$t – Title of a work” – in which langauge? which normalized form? etc.).
I can provide another critique: the evolvement of the regional/national MARC standards, which are not 100% compatible. I know HUMMARC, which is the Hungarian MARC, and somehow I run compatibility problems with the OCLC MARC (Marc4j has problems with OCLC Leader). I don’t know the real reason, but I guess, that MARC was not enough for these communities, and they would like to extend it with new or modfied features.
You can read Roy Tennant’s witty article about the main problems with MARC: “MARC must die”
http://roytennant.com/column/?fetch=data/58.xml
Best wishes!
Péter
@ Peter Schouten
A short reaction: I never compared MARC to SRU! It surprises me that you have concluded that. I merely mentioned SRU as a useful protocol to PRESENT for instance MARC-records (MARCXML if you want to be precise) after converting from whatever data storage format is available. Just like Z39.50 of course.
To my eyes MARC is NOT the best storage format there is, as datamodel experts will confirm. At best it is an exchange format, and that is fine with me, but also as exchange format it is not perfect, as I have tried to show.
I agree that MARC is structured, only not structured enough for the present needs of linking to other online tools. I do not doubt that Henriette Avram did a very good job in the time that she created MARC, and it may have served its purpose well for a number of years.
I did not actually criticise AACR2 at all. I just mentioned it as a set of cataloguing rules.
And I must object to the suggestion that I don’t care! Of course I care, otherwise I would not have written this.
What the post is about, if you read carefully, is that I don’t care in which format the data is stored, as long as it is stored in such a way that we can retrieve it in any format we need. And I do not think MARC is such a way.
@ Péter Király
Thanks for the link to Roy Tennant’s article. Someone else told me about it too. Had not seen that before….
I do agree with Peter Schouten about two things. Marc is not a presentation format. Marc is invented to create a low storage solution. It was a low storage solution, especially in the ISO2709 packing it used to be exchanged on tape. (I spend too much time in the past struggling with it).
Marc is an exchange format, and not a very good one as explained in Roy’s article. It can be easily created from a normalised data model. No problem.
(you can do it in many different ways however, which is a bit of a problem :))
The big problem is that a new, better structured, exchange format is hard to define since many organizations can not convert to it. They have Marc as their native storage format. It is easy to convert to Marc. It is impossible to convert from Marc to something better.
Now we have MarcXMl and anybody who knows a litle bit about XML is Laughing Out Loud when looking at Marc XML. If you don’t know about the history of Marc you wonder how anyone could even think of considering a schema like this.
Marc21 is widely accepted and that is the only good reason to use it.
The main role of MARC21 today is to allow commercial systems vendors to create a single system that can be used by any library. It has value because it is a standard — and to replace it, we will need another standard. I think that the new standard should focus on data elements, not record format. If we standardize the data elements, then any record format can be used. MARC standardizes both (and not necessarily very well by today’s technology) in a way that they cannot be easily separated.
As a cataloger who deals with MARC on a daily basis (sometimes I even think in MARC code), I completely agree that it is outdated and needs to be replaced. But the problem lies in the fact that there is not yet any agreement about a replacement. Produce a replacement that is flexible, easy to use, fits the cataloging standards that we already have (so that legacy records are not lost), AND IS WIDELY ACCEPTED, and I’m sure you would find many takers.
Until that comes along, MARC is still the best we’ve got.
I understand the difference between marc and RDA. As a cataloger though, what I see in RDA is such minimal information as to be totally worthless as to being able to distinguish authors with the same names or to distinguish variant editions. To me RDA is a lot like the World Wide Web: it will offer much more information that will require a lot more time to get to what you want to see.
As someone who has been working with MARC data since 1974 (I started my career as an OCLC trainer for SOLINET in the days of one format, Books, and a LOT fewer tags) and who knew or met many of the key players in the development of MARC, I just have to say that it is very easy to judge MARC from today’s perspective. However, it isn’t fair to do that. If you weren’t working during that era, you might find it hard to imagine developing in an environment in which computing power was so low and storage costs were so high that we actually looked for ways to reclaim critical storage by deleting periods at the end of fields. 😉
MARC was succinct. Tags were short and fixed length. Fixed fields and subfields were encoded for maximum meaning in minimal space.
MARC is a communications format designed to support the communication of cataloging information between what were, at the time, a very, very small number of entities – mostly LC and other national libraries, a few universities that were doing local development, and a nascent bibliographic utility industry — all of which was geared to supporting printing catalog cards since online systems were still decades away. MARC had to support the cataloging rules (pre-AACR2)and the realities of printed cards (thus names in inverted form and data to comply with LC filing rules). The uptake of MARC for internal storage by ILS vendors was largely driven by the customer libraries’ RFPs and by the reality that computing power to do data conversion wasn’t always available. It also vastly pre-dated any form of keyword indexing thus not allowing for normalized data.
It was also driven by the realities that catalogers found that MARC-speak facilitated communication among themselves; communication that might not have been possible if all the local systems had stored bibliographic data in WDC. Everyone was learning this new stuff all at once and trying to learn from the originators and each other. If LC’s 100, had been my “Author” data element, your “Main entry-personal name” and OCLC’s “WDC-1”, we would have had a much harder time sharing what we were learning and helping each other create all this standardized data. Referencing the 100 field short cut a lot of “what are you talking about?” in communications. And remember, most of this communications was in person and in print format that took a lot longer to reach an audience since it pre-dated email, WWW, blogs, and Twitter by 30 years!
I fully agree that MARC has been very badly managed regarding the 773 field. I’ve felt this since the early 90s when we started loading journal article citation data and tried to create a workable TOC index to all our data — parsing that field is just ridiculous and and would have been so much easier with some standard encoding much earlier.
Having said all this, I’d fully support a new communications format and discussion of how we get there. I just wish we could do it without making MARC the enemy – it has been a reliable workhorse that has been absolutely key to our getting where we are in both systems and magnitude of bibliographic data. I do regret that the good features of other MARCs worldwide were not incorporated. And not just worldwide. The WLN system (long since absorbed by OCLC) had a 3rd indicator – which was wonderful for handling non-filing indicators for the subfield t of 7XX fields. A loss we constantly decry when trying to integrate 245 fields with titles in author added entries.
When I decided, triggered by a conversation with some colleagues, to write a blog post about something that has been bothering me off and on since I first started working with library systems in 2003, I did not expect it to be picked up so widely. It has been cited and linked to in many different places. But the most surprising part to me is that MARC generates so many diverse, even emotional reactions.
It looks like a classic case of “If your not with us, you’re against us“. But I would like to try to reconcile both opposing parties anyway.
I have received a couple of other reactions outside of this blog as well, both on line (via twitter among others) and off line, and I had some conversations about MARC and PICA+ formats with colleagues. All this has been cause to refine my opinion slightly.
First, a couple of experienced cataloger colleagues have convinced me that PICA+ is not always better than MARC. Like always, both have their pros and cons, as undoubtedly will apply to the MARC-MAB relationship as well.
Second, I must thank my valued IGeLU colleague Michele Newberry for her very clear description of the historical circumstances of the birth of MARC and her experience of many years.
Michele says that it is not fair to judge MARC from today’s perspective only. And I have to admit that she is right. In the early days of MARC it was a very useful and efficient tool in the circumstances of those days.
She also emphasises that MARC was intended as a communications format, which was subsequently used as storage format by ILS vendors.
Michele concludes by saying that she would “fully support a new communications format and discussion of how we get there”. “I just wish we could do it without making MARC the enemy“.
I hereby apologise to everybody who I may have offended by my attack on MARC. I have no intention of discrediting its usefulness in the past and present.
However I stick with my opinion that in a digital web environment it is not an efficient storage and exchange format for bibliographic metadata anymore. We should aim at bringing about a new general efficient and flexible bibliographic format suitable for future developments. As far as I’m concerned this could be MARC22.
MARC can be adapted, this has been done before. As an example: in 2003 a proposal has been made to replace the 773 subfield g tag by either a subfield 773$q, or a new tag 363, with subfields for all levels contained in the 773$g subfield free-text string. Both options are available now. The 363 tag appears to have been accepted as “Normalized Date and Sequential Designation (R)“. But as long as this new field is not used, it has no value. I expect that AACR2 still does not require using 363, but I am not an expert. OCLC recently stated that implementation of 363 is “under consideration”.
Someone pointed out to me recently, that all the problems associated with a massive migration by libraries around the world from MARC to something else, whatever it may be, will be avoided in the case that we all will migrate to one of the new SaaS models of cataloguing (Ex Libris URM, OCLC WorlCat Web, etc.). We will see…
Sometimes we need to make bold steps to move ahead, such as Google is trying with moving the SMTP (e-mail) protocol further by thinking anno 2009 with the introduction of Google Wave
Since my “MARC Must Die” Library Journal column was mentioned here I want to point you to a much longer follow-up piece I wrote for Library Hi Tech. My author’s copy is available at http://roytennant.com/metadata.pdf and it describes the world I would really like to see, which is what Lukas talks about here — “We Don’t Care”.
Reading a 2009 blog (@lukask), and finding lots of gold, especially in the commentary #marcmustdie http://tinyurl.com/ofhbjq
RT @supprian Reading a 2009 blog (@lukask), and finding lots of gold, especially in the commentary #marcmustdie http://tinyurl.com/ofhbjq