Posted on September 2nd, 2011 10 commentsShifting focus from information carriers back to information
Library catalogues have traditionally been used to describe and register books and journals and other physical objects that together constitute the holdings of a library. In an integrated library system (ILS), the public catalogue is combined with acquisition and circulation modules to administer the purchases of book copies and journal subscriptions on one side, and the loans to customers on the other side. The “I” for “Integrated” in ILS stands for an internal integration of traditional library workflows. Integration from a back end view, not from a customer perspective.
Because of the very nature of such a catalogue, namely the description of physical objects and the administration of processing them, there are no explicit relations between the different editions and translations of the same book, nor are there descriptions of individual journal articles. If you do a search on a specific person’s name, you may end up with a large number of result records, written by that person or someone with a similar name, or about that person, even with identical titles, without knowing if there is a relationship between them, and what that relationship might be. What’s certain is that you will not find journal articles written by or about that person. The same applies to a search on title. There is no way of telling if there is any relation between identical titles. A library catalogue user would have to look at specific metadata in the records (like MARC 76X-78X – Linking Entries, 534 – Original Version Note or 580 – Linking Entry Complexity Note), if available, to reach their own conclusions.
Most libraries nowadays also purchase electronic versions of books and journals (ebooks and ejournals) and have free or paid subscriptions to online databases. Sometimes these digital items (ebooks, ejournals and databases) are also entered into the traditional library catalogues, but they are sometimes also made available through other library systems, like federated search tools, integrated discovery tools, A-Z lists, etc. All kinds of combinations occur.
In traditional library catalogues digital items are treated exactly the same as their physical counterparts. They are all isolated individual items without relations. As Karen Coyle put it November 2010 at the SWIB10 conference: “The main goal of cataloguing today is to keep things apart” .
Basically, integrated library systems and traditional catalogues are nothing more than inventory and logistics systems for physical objects, mainly focused on internal workflows. Unfortunately in newer end user interfaces like federated search and integrated discovery tools the user experience in this respect has in general been similar to that of traditional public catalogues.
At some point in time during the rise of electronic online catalogues, apparently the lack of relations between different versions of the same original work became a problem. I’m not sure if it was library customers or librarians who started feeling the need to see these implicit connections made explicit. The fact is that IFLA (International Federation of Library Associations) started developing FRBR in 1998.
FRBR (Functional Requirements for Bibliographic Records) is an attempt to provide a model for describing the relations between physical publications, editions, copies and their common denominator, the Work.
FRBR Group 1 describes publications in terms of the entities Work, Expression, Manifestation and Item (WEMI).
FRAD (Functional Requirements for Authority Data – ‘authors’) and FRSAD (Functional Requirements for Subject Authority Data – ‘subjects’) have been developed later on as alternatives for the FRBR Group 2 and 3 entities.
As an example let’s have a look at The Diary of Anne Frank. The original handwritten diary may be regarded as the Work. There are numerous adaptations and translations (Expressions) of the original unfinished and unedited Work. Each of these Expressions can be published in the form of one or more prints, editions, etc. These are the Manifestations, especially if they have different ISBN’s. Finally a library can have one or more physical copies of a Manifestation, the Items.
Some might even say the actual physical diary is the only existing Item embodying one specific (the first) Expression of the Work (Anne’s thoughts) and/or the only Manifestation of that Expression.
Of course, this model, if implemented, would be an enormous improvement to the old public catalogue situation. It makes it possible for library customers to have an automatic overview of all editions, translations, adaptations of one specific original work through the mechanism of Expressions and Manifestations. RDA (Resource Description and Access) is exactly doing this.
However there are some significant drawbacks, because the FRBR model is an old model, based on the traditional way of library cataloguing of physical items (books, journals, and cd’s, dvd’s), etc. (Karen Coyle at SWIB10).
- In the first place the FRBR model only shows the Works and related Manifestations and Expressions of physical copies (Items) that the library in question owns. Editions not in the possession of the library are ignored. This would be a bit different in a union catalogue of course, but then the model still only describes the holdings of the participating libraries.
- Secondly, the focus on physical copies is also the reason that the original FRBR model does not have a place for journal titles as such, only for journal issues. So there will be as many entries for one journal as the library has issues of it.
- Thirdly, it’s a hierarchical model, which incorporates only relations from the Work top down. There is no room for relations like: ‘similar works’, ‘other material on the same subject’, ‘influenced by’, etc.
- In the fourth place, FRBR still does not look at content. It is document centric, instead of information centric. It does however have the option for describing parts of a Work, if they are considered separate entities/works, like journal articles or volumes of a trilogy.
- Finally, the FRBR Item entity is only interesting in a storage and logistics environment for physical copies, such as the Circulation function in libraries, or the Sales function in bookstores. It has no relation to content whatsoever.
FRBR definitely is a positive and necessary development, but it is just not good enough. Basically it still focuses on information carriers instead of information (it’s a set of rules for managing Bibliographic Records, not for describing Information). It is an introverted view of the world. This was OK as long as it was dictated by the prevailing technological, economical and social conditions.
In a new networked digital information world libraries should shift their focus back to their original objective: being gateways to information as such. This entails replacing an introverted hierarchical model with an extroverted networked one, and moving away from describing static information aggregates in favour of units of content as primary objects.
The linked data concept provides the framework of such a networked model. In this model anything can be related to anything, with explicit declarations of the nature of the relationship. In the example of the Diary of Anne Frank one could identify relations with movies and theater plays that are based on the diary, with people connected to the diary or with the background of World War 2, antisemitism, Amsterdam, etc.
In traditional library catalogues defining relations with movies or theater plays is not possible from the description of the book. They could however be entered as a textual reference in the description of a movie, if for instance a DVD of that movie is catalogued. Relations to people, World War 2, antisemitism and Amsterdam would be described as textual or coded references to a short concept description, which in turn could provide lists of other catalogue items indexed with these subjects.
In a networked linked data model these links could connect to information entities in their own right outside the local catalogue, containing descriptions and other material about the subject, and providing links to other related information entities.
FRBR would still be a valuable part of such a universal networked model, as a subset for a specific purpose. In the context of physical information carriers it is a useful model, although with some missing features, as described above. It could be used in isolation, as originally designed, but if it’s an open model, it would also provide the missing links and options to describe and find related information.
Also, the FRBR model is essential as a minimal condition for enabling links from library catalogue items to other entity types through the Work common denominator.
In a completely digital information environment, the model could be simplified by getting rid of the Item entity. Nobody needs to keep track of available copies of online digital information, unless publishers want to enforce the old business models they have been using in order to keep making a profit. Ebooks for instance are essentially Expressions or Manifestations, depending on their nature, as I stated in my post ‘Is an e-book a book?’.
The FRBR model can be used and is used also in other subject areas, like music, theater performances, etc. The Work – Expression – Manifestation – Item hierarchy is applicable to a number of creative professions.
The networked model provides the option of describing all traditional library objects, but also other and new ones and even objects that currently don’t exist, because it is an open and adaptable model.
In the traditional library models it is for instance impossible, or at least very hard, to describe a story that continues through all volumes of a trilogy as a central thread, apart from and related to the descriptions of the three separate physical books and their own stories. In the Millennium trilogy by Stieg Larsson, Lisbeth Salander’s life story is the central thread, but it can’t be described as a separate “Work” in MARC/FRBR/RDA because it is not the main subject of one physical content carrier (unless we are dealing with an edition in one physical multi part volume). The three volumes will be described with the subjects ‘Missing girl mystery‘, ‘Sex trafficking‘ and ‘Illegal secret service unit‘ respectively.
In an open networked information model on the contrary it would be entirely possible to describe such a ‘roaming story’.
New forms of information objects could appear in the form of new types of aggregates, other than books or journal articles, for instance consisting of text, images, statistics and video, optionally of a flexible nature (dynamic instead of static information objects).
Existing library systems (ILS’s and Integrated Discovery tools alike), using bibliographic metadata formats and frameworks like MARC, FRBR and RDA, can’t easily deal with new developments without some sort of workaround. Obviously this means that if libraries want to continue playing a role in the information gateway world, they need completely different systems and technology. Library system vendors should take note of this.
Finally, instead of only describing information objects, libraries could take up a new role in creating new objects, in the form of subject based virtual information aggregates, like for instance the Anne Frank Timeline, or Qwiki.This would put libraries back in the center of the information access business.
Posted on March 28th, 2011 1 comment
The challenges of generating linked data from legacy databases
Some time ago I wrote a blog post about the linked data proof of concept project I am involved in, connecting bibliographic metadata from the OPAC of the Library of the University of Amsterdam with the theatre performances database maintained by the Theatre Institute of The Netherlands.
I ended that post with a list of next steps to take:
- select/adapt/create a vocabulary for the Production/Performance subject area
- select/adapt/create vocabularies for Persons (FOAF?) and Subjects (SKOS?)
- add internal relationships with the other entities (Play, Production, etc.) in the JSON structure (implement RDF in JSON)
- Add RDF/XML as output option, besides JSON
- add external relationships (to other linked data sources like DBPedia, etc.)
- extend the number of possible URI formats (for Play, Production, etc.)
- add content negotiation to serve both human and machine readable redirects
- extend the options on the OPAC side
- publish UBA bibliographic data as linked open data (probably an entirely new project)
So, what have we achieved so far? I can be brief about all the ‘real’ linked data stuff (RDF, vocabularies, external links, content negotiation): we are not there yet. This will be dealt with in the next phase.
Instead, we have focused on getting the simple JSON implementation right, both on the data publishing side and on the data using side. We have added more URIs and internal relationships, and we are using these in the OPAC user interface.
But we have also encountered a number of crucial problems that are in my view inherent to the type of legacy data models used in libraries and cultural heritage institutions.
First let me describe the improvements we have added so far.
The URI for ‘person’ <baseurl>/person/<personname> now also returns a link to all the ‘titles’ that person is connected to (not only with the ‘author’ role, but for all roles, like director, performer, etc.): <baseurl>/gettitles/<personname>. This link will return a set of URIs of the form <baseurl>/title/<personname>/<title>. The /<personname>/<title> bit is at the moment the only way that a more or less unique identifier can be constructed from the OPAC metadata for the ‘play’ in the TIN database. There are a number of really important problems related to this that I will discuss below.
returns among others:
/title/Beckett, Samuel/Waiting for Godot
/title/Beckett, Samuel/En attendant Godot
The URI for a ‘play’ <baseurl>/title/<personname>/<title> now returns a set of ‘production’ URIs of the form <baseurl>/production/<personname>/<title>/<openingdate>/<idnr>.
The ‘production’ URI returns information about ‘theatre company’, ‘venue‘ and all persons connected to that production, including their URIs, and when available also a link to an image of a poster, and a video.
<baseurl>/title/Beckett, Samuel/Waiting for Godot
/production/Beckett, Samuel/Waiting for Godot/1988-07-28/5777
/production/Beckett, Samuel/Waiting for Godot/1988-11-22/6750
/production/Beckett, Samuel/Waiting for Godot/1992-04-16/10728
/production/Beckett, Samuel/Waiting for Godot/1981-02-18/43032
The last ‘production’ URI returns:
“title”:”Waiting For Godot”,
“description”:”Beckett, Samuel (auteur: toneelspel van)”,
“description”:”Hartnett, John (regie)”,
“description”:”Muller, Frans (decor: ontwerp)”,
“description”:”Newell, Kym (licht: ontwerp)”,
“description”:”Zaal, Kees (geluid)”,
“description”:”Tolstoj, Alexander (uitvoerende: Lucky)”,
“description”:”Weeks, David (uitvoerende: Estragon)”,
“description”:”Coburn, Grant (uitvoerende: Vladimir)”,
“description”:”Evans, Rhys (uitvoerende: Pozzo)”,
“description”:”Geiringer, Karl (uitvoerende: A Boy)”,
“description”:”Guidi, Peter (uitvoering muziek)”,
“description”:”Kimmorley, Roxanne (uitvoering muziek)”,
“description”:”Vries, Hessel de (uitvoering muziek)”,
“uri”:”/person/Vries, Hessel de”
“description”:”Phillips, Margot (uitvoering muziek)”,
Now, the problems (or challenges) that we are facing here are essential to the core concept of linked data:
- we don’t have actual matching unique identifiers (URIs)
- we don’t have explicit internal relations with a common entity in both sources
- part of the data consists of literal strings in a specific language
These three problems are interrelated, they are linked problems, so to speak.
To start with the identifiers. Of course we have internal system identifiers in our local Aleph catalogue database. Because we contribute to the Dutch Union Catalogue (originally a PICA system, now OCLC), our bibliographic records also have national Dutch PICA identifiers. And because the Dutch Union Catalogue records are copied to WorldCat, these records in WorldCat also have OCLC numbers.
Also the Theatre Institute has internal system identifiers in their Adlib database. But at the moment we do not have a match between these separate internal identifier schemes. The Theatre Production database records are not in WorldCat because they’re not bibliographic records.
We are more or less forced to use the string values of the title and author fields to construct a usable URI, on both sides. Clearly this is the basis of lots of errors, because of the great number of possible variations in author and title descriptions.
But even if the Theatre Institute’s records were in the Union Catalogue or WorldCat as well, then we still would not have an automatic match without some kind of broker mechanism ascertaining that the library catalogue record describes the same thing as the theatre production database record. The same applies to the author, which of course should be a relation of the type “written by” between the play and a person record instead of string values. Both systems do have internal author or person authority files, but there is no direct matching. For authors this could theoretically be achieved by linking to an online person authority file like VIAF. But in the current situation this is not available.
This brings me to the second problem. The fact that we are using the string values of title instead of unique identifiers, means that we connect plays and productions with a specific title variety or language. In our current implementation this means that we are not able to link to all versions of one specific play.
For instance, from our OPAC the following URIs are constructed (two in English, one in French, one in Dutch):
/title/Beckett, Samuel/Waiting for Godot
/title/Beckett, Samuel/Waiting for Godot : a tragicomedy in two acts
/title/Beckett, Samuel/En attendant Godot : pièce en deux actes
/title/Beckett, Samuel/Wachten op Godot
In the Theatre Production database (two in English, four in Dutch, one in French, one in German):
/title/Beckett, Samuel/Waiting for Godot
/title/Beckett, Samuel/Waiting For Godot
/title/Beckett, Samuel/Wachten op Godot
/title/Beckett, Samuel/Wachtend op Godot
/title/Beckett, Samuel/Wachten op Godot (De favorieten)
/title/Beckett, Samuel/Wachten op Godot (eerste bedrijf)
/title/Beckett, Samuel/En attendant Godot
/title/Beckett, Samuel/Warten auf Godot
Only the first and fourth URI from the OPAC will find corresponding titles in the Theatre Production database. The second and third one, using a subtitle within the main title, don’t even have equivalents. And only two of the eight entries from the Theatre Production database have a match in the catalogue.
In a library catalogue environment we are used to this problem, because catalogues are used for describing physical objects in the form of editions and copies. Unfortunately, also the Theatre Production database just contains records describing productions of a specific ‘edition’ or translation of a play, with only the opening performance information attached.
This is where I need to talk about FRBR. Basically in a library catalogue environment this means that we should describe the relations between the ‘work’ (original text), the ‘expression’ (the version or translation), the ‘manifestation’ (edition, format, etc.) and the ‘items’ (the physical copies). Via the relations with higher level expression and work, the physical copy could be linked to the unifying work level, and then ideally through some universally valid unique identifier to, in our case, the theatre plays.
Although FRBR is a publication centered schema used only in libraries, the same concepts can be applied to theatre performances: the original work (which is the same as the work in a bibliographical sense) has expressions (adaptations, translations, etc.), which have manifestations (productions), and in the end the individual items (actual performances on a specific date, time and location).
If both the library catalogue and the theatre production database were FRBRised, we could in theory link on the Work level and cover all individual versions. But we would still need a matching mechanism on that Work level of course.
In reality however we can only try to link on the Manifestation level in an imperfect way.
At the moment, in our project, on the catalogue side we extract the title and author from the generated OPAC HTML. It could be an option to get available linking information form the internal MARC records (like the 240, 246, 765, 767, 775 tags), but that is not easy because of a number of reasons. Something similar could be done in the theatre production database, making implicit links explicit. But all this makes the effort to get something sensible out there much bigger.
The third problem, the literal strings in Dutch both in the library catalogue and in the theatre production database, prevents the effective use of the data in multilingual environments, equally in the traditional native interfaces and as linked data. Obviously for English speaking end users the Dutch terms mean nothing. And in a linked data environment the Dutch strings can’t easily be linked to other data, in Dutch, English, or any language, without unique identifiers.
Implicit to explicit
People calling on institutions to publish their data as linked open data tend to say it’s easy once you know how to do it . And of course it must be done. But if the published datasets have a flat internal structure designed to fulfill a specific local business objective, then they just don’t provide sufficient added value for third party use. In order to make your published open data useful for others, you have to make implicit relations explicit. And this requires something more than just making the data available in RDF ‘as is’, it requires a lot of processing.
Posted on January 22nd, 2010 6 comments
New models, new formats
Recently I have been experimenting a bit with reading newspapers on my mobile phone (a G1 android device), or maybe I should say “reading news on my mobile”. I looked at two Dutch newspapers that adopt two completely different approaches.
“NRC Handelsblad” publishes it’s daily print newspaper as a daily “e-paper” in PDF, Mobi and ePub format, to be downloaded every day to the platform of your choice. In order to read the e-paper you need a physical device plus software (mobile phone, PC, e-reader, etc.) that can handle one of the available formats. On my G1 I use the Aldiko e-reader app for android with the ePub format. The e-paper is treated as an e-book file, with touch screen operation for browsing tables of content, paging through chapters or articles, zooming, etc. Access to the e-paper files is on a subscription basis.
“Het Parool” on the other hand offers a free app to be downloaded from the Android Market that serves as a front end to all recent articles available from their news server on the web. There is no need for a daily download of a file in a specific format that has to be supported by the physical platform of your choice. There is also an iPhone app. The app and access to the news articles are free of charge.
Besides the difference in access (free vs paid), the most important contrast between these two mobile newspapers is the form in which the printed news is transformed to the digital and mobile environment. “NRC Handelsblad” takes the physical form the newspaper has had since it’s origin in the 17the century, dictated by physical, logistical and economical conditions, and transforms this 1 to 1 to the digital world: the e-paper still is one big monolithic bundle of articles that can’t be retrieved individually, completely ignoring the fact that the centuries old limitations don’t apply anymore. It is basically exactly the same as most manifestations of e-books.
“Het Parool” does completely the opposite. It treats individual news articles as units of content in their own right, “stories” as I call them in my post “Is an e-book a book?“. And this is how it should be in the digital mobile world. This is similar to the way that e-journals offer direct access to individual articles already.
Readers should be able to apply their own selection of “stories” to read in a specific, virtual, on the fly bundle, using the front end of their choice.
However, the “Parool” app functions as a predefined filter: it presents the reader with the most recent (24 hour max) articles from it’s own source of news. Of course this is fine as long as the readers choose to use the “Parool” app, but they may also choose to read news stories from different sources. This could be achieved with a different mobile, PC or web application that gathers content from a variety of sources.
Another drawback of the ‘Parool” implementation is that it does not offer a “save” option. There is no way to read old articles, other than to go to the official newspaper website, either through mobile browsing or by using a PC web browser. The “NRC Handelsblad” implementation on the other hand does offer this option, because it is based on a download model to begin with.
This brings me to the matter of mobile web browsing. Reading and navigating a web page designed for the PC screen on a mobile device is annoying at least, not to mention the time it takes to load complete web pages into the mobile browser. Common practice is to create a simplified version of full fledged web pages for mobile use only. Of course this means doubling the website maintenance effort.
An alternative could be the adoption of HTML 5 and CSS 3, as was stated at a Top Tech Trends Panel session at ALA Midwinter 2010, where a university library official said: “2010 is the year that the app dies“, because “developers can leverage a single well-designed service to serve both browser-based and mobile users“. But this view completely misses the point: “Apps are not about technology, they are about a business model” as Owen Stephens pointed out. This business model implies the separation of content and presentation in a much broader sense then that of database back end – website front end only. This was an innovative concept until a couple of years ago.
As I briefly described above, we need units of content being accessible by all kinds of platforms and applications through universal APIs. This model not only applies to reading texts, but also to finding these texts. Especially libraries should be aware of that.
Although the ALA Top Trends Panel stated that libraries’ focus should be on content rather than hardware, they did not touch upon the changing concept of what books are in the e-book era, as again Owen Stephens pointed out. New models and formats will have all kinds of consequences for the way we handle information. For instance: pages. A PDF file, which is a 1 to 1 translation of the print unit to a digital unit, as I explained, still has fixed pages and page numbers. An ePub file however has a flexible format that allows “pages” to be automatically adapted to the size of the device’s screen (thanks to @rsnijders and @Wowter for discussing this). There are no fixed pages or page numbers anymore. HTML pages containing full articles don’t have page numbers either, by the way. This will change the way we refer to texts online, without page numbers, which is one of the subject of the Telstar project, again with Owen Stephens involved (watch that guy).
The flexible page is another reason to have a critical look at MARC. There is no use anymore for tags like 300,a “Extent (Number of physical pages, etc.)”, 773,g (“Vol. 2, no. 2 (Feb. 1976), p. 195-230“).
The inevitable conclusion of all this is that all innovative developments on the end user interface presentation front end need to be supported by corresponding developments on the content back end, and vice versa.
Posted on November 19th, 2009 9 commentsAbout cataloging physical items or units of content
2009 is the year of the e-book, or perhaps better: of the e-book reader. This is an important distinction that I will explain below. E-books are becoming more popular because of the increasing availability of various cheap e-book readers.
But what is an e-book? Is it the same as a book? Some people say yes, some people say no. This question shouldn’t be so hard to answer, should it? We just have to define what a book is first. So, what is a book?
When people think of a book, they picture something like the archetypal book: printed, medium sized, hardcover, no illustrations on the front. The thing that you can actually hold in your hands and read.
But if they say: “This book was written by that author”, they don’t think that the author actually wrote that particular item they are holding in their hands. Now we already have two different meanings of the concept “book”: one is a tangible object, the other is the content that is made available in this tangible object by means of printed text.
Besides these conceptual levels, there are more ways by which books can be described, as shown by this incomplete list of examples:
Physical form: Historically there have been clay tablets, inscribed stones, handwritten scrolls, handwritten bound pages, printed pages. We also know different formats targeted at specific uses or audiences: audio books, braille books, pop up books.
Content: A book can contain text only, or images only (for instance a children’s picture book, or a book of photographs), or a combination of both.
Units: A book can consist of one “story” ( for instance a novel), optionally subdivided in chapters, or be made up of several stories, or articles (like a text book about a certain subject). Chapters and stories can be written by the same or by several authors. A book can also contain two or more other books by the same author (“collected works”), etc.
Content type: A book can contain fiction, aimed at entertaining readers. Books can be purely administrative, like accounting books. There are religious books to be used in religious ceremonies (sometimes these are referred to as “THE book“). Some books are for studying and learning (“text books”, which may also contain images by the way). There are scientific books and instructional books (travel guides, cook books, manuals).
First, we need see how all this fits together before we can answer the question “Is an e-book a book?” or more precise: “In which sense is an e-book a book?“. Fortunately there is already a conceptual model for bibliographic entities and the relationships between them that describes this: FRBR (Functional Requirements for Bibliographic Records), published by IFLA. The IFLA Final Report (2009 version) says it all, but there are also a couple of short summaries: Barbara Tillet’s (LoC) “What is FRBR?”, Jenn Riley’s “FRBR” blog post, and there is William Denton‘s FRBR Blog for more information.
The FRBR model is targeted at libraries, maybe even at publishers and booksellers too.
I will not go into the FRBR “Group 2” (persons and corporate bodies) and “Group 3” (subjects) entities here, but focus on the “Group 1” entities.
The FRBR “Group 1 entities” consist of Work, Expression, Manifestation and Item (also referred to as WEMI). FRBR entities not only apply to books or textual works, but also to movies, theater plays, music, etc.
There are hierarchical relationships between the entities:
- Work – a distinct intellectual or artistic creation
- Expression – the intellectual or artistic realization of a work
- Manifestation – the physical embodiment of an expression of a work
- Item – a single exemplar (or copy) of a manifestation
- A work (for instance a book) can have (“is realized through“) one or more expressions (for instance the original English text and the Dutch translation).
- Each expression can have (“is embodied in“) one or more manifestations (for instance a specific edition with an ISBN, or one of more works/expressions in a “collected works” edition).
- Each manifestation has (“is exemplified by“) one or more items, the things you can actually hold in your hands.
- A manifestation can also consist of several expressions, as in the “collected works” example.
Besides these hierarchical relationships between different entity types there are also recursive relationships between entities of the same type: hierarchical and other. Some examples:
- A work is part of another work (hierarchical), as in a series like Harry Potter
- A work is an adaptation of another work
- An expression is a sequel to another expression
- A manifestation is a facsimile of another manifestation
So far so good. The FRBR conceptual model describes (or aims to describe) real world things and relationships on an abstract level. The model can be implemented in actual systems (both computerised and manual!). In these systems you are free to refer to the conceptual model entities (“work”, “expression”, “manifestation”, “item”) by names that are actually used in daily life. This is what Rob Styles is trying to do when he talks about “stories” and “editions” in his recent blog post “Bringing FRBR Down to Earth…” I think. I will define the “story” concept in a different way below.
Until now, catalogers and library systems have been targeted at describing the thing they have in their hands (or better the items that make up the library’s collection). In FRBR terms this means that catalogs describe manifestations and items, not works and expressions (or implicitly at best). In short, a bottom up approach. This is understandable, because in the past there was nothing else to go by than the explicit manifestation information available on the physical item (author, title, ISBN, edition, publisher, etc.) .
Of course, MARC21 provides some options to describe relationships with expressions and works and other manifestations, like the 250 – Edition Statement, the 490 – Series Statement and the 76X-78X – Linking Entries-General Information. But these fields can only be used if the information is known to the cataloger.
Also, in traditional catalogs, works that are distinct expressions in one manifestation (like articles, chapters, stories, poems) are not described separately, because of the same reason: you only catalog the item you have before you. In the ideal world, or better in the new digital world, the unit to be cataloged or described should always be the work, which we may call “story”. In other words: we should catalog units of content (“stories”) instead of, or supplementary to, physical items.
Current library practice is that we catalog books and journals in the catalog and offer article descriptions through subscribed article metadata databases separately.
So, back to the e-book. Where does that fit in? An e-book could be considered nothing more than a manifestation and/or an item belonging to a certain work/expression, because an e-book can be everything a printed book is. As such it is equivalent to a braille or audio book. Some libraries treat e-books as something different, as works/expressions as such. They catalog e-books separately, just like all other items/manifestations are treated as separate works. There are even separate e-book overviews.
But there is more to it than that. The big difference with books until now is that an e-book is not inseparably linked to the physical carrier. A printed book can only be read if the reader has a physical copy (a FRBR item) consisting of bound paper pages containing the text printed on them with ink. The same applies to handwritten texts, scrolls, clay tablets, etc.
Even more so, the physical form, together with economical conditions and possibilities for distribution, often determines the actual manifestation of a book and a journal. A book (or volume) can only contain a certain number of pages in order to be manageable. There is also a cost consideration in the size and distribution of the items.
What we call an e-book is actually only a digital, abstract manifestation of a work/expression. In order to be able to read it you have to download it in a specific format (PDF, epub, etc.) onto a physical carrier (USB-stick, computer disk, etc.), and then you need a physical reading device with dedicated software (dedicated e-book readers like Kindle, a computer, a mobile phone, etc.).
Libraries do not have e-books as items, only as manifestations. These e-book manifestations can be available on an online server somewhere in whatever form, and can be made into an item on-the-fly, using a specific format on-the-fly, choosing a physical carrier on-the-fly. What’s more, the content of e-books can also be selected out of several works/expressions on-the-fly, this way creating manifestations or even expressions on demand.
Now, is the FRBR conceptual model suited for describing e-books? If we treat e-books as manifestations without items (like we handle e-journals in our catalogs), how do we proceed? The FRBR Manifestation item among others has these attributes:
- form of carrier
- extent of the carrier
- physical medium
- system requirements (electronic resource)
- file characteristics (electronic resource)
- mode of access (remote access electronic resource)
- access address (remote access electronic resource)
But we have just seen that in the case of e-books these are features of the items generated on-the-fly, which are not known before. Does this mean that we have to describe as manifestations all possible physical forms that one e-book can take? This would also mean that an e-book as such should be described on the level of a FRBR Expression. This may be correct in some cases (the creation of aggregated content on-the-fly), but not in all: where an e-book is similar to manifestations like braille, audio book, etc.
Does FRBR need an extra level? I am not sure. Let’s look briefly at how e-journals are handled. As far as I can see, journal and e-journal issues are described as separate manifestations of journals and e-journals (with a “part-of” relationship to the higher level). These issue manifestations are treated as aggregates that contain articles, that are also described as manifestations with a “part-of” relationship to the issue. In MARC21 this handled by the 773 Host Item Entry tag.
I am not sure if and how different physical formats (PDF, HTML) for articles in e-journals are handled. The obvious difference with e-books is that the described unit is the article (or “story” as definition of unit of content), which can be downloaded as separate items. The e-journal articles are ideally also identified by unique identifiers (DOI‘s).
What does this mean for e-books? I think we can treat an e-book as either an expression or a manifestation, depending on the nature of the specific e-book in question. For the e-book manifestation we would only need to register the mode of acces, access address and manifestation identifier attributes, preferably in the form of a URI.
I also think we should use the possibilities of the FRBR model to start describing, cataloging and identifying the “stories” (chapters, articles, etc.) that make up books and e-books separately, as units of content in their own right. People are interested in the content, the “stories”, not the physical items or artificial digital aggregate units like e-books or e-journals.
In this sense, the “e-journal” is an archaic concept, where the limitations of the physical journal are translated as such to the digital world. There is no real need to bundle articles in electronic form into one electronic issue of an e-journal that is published at regular intervals in time. Electronic articles can be published individually immediately after peer review and approval. Published articles can be aggregated in one nor more virtual online serials.
Like ISBN’s and ISSN’s we need an identifier for the units of content other than journal articles. As a matter of fact, there already is one, the DOI:
“A DOI name can be used to identify any resource involved in an intellectual property transaction. Intellectual property includes both physical and digital manifestations, performances and abstract works. An entity can be identified at any arbitrary level of granularity.” (see http://www.doi.org/faq.html#2). Thanks to Owen Stephens for pointing this out to me in a twitter discussion with Inga Overkamp.
I may be wrong about all this. I am open for comments and suggestions.
Posted on June 19th, 2009 8 commentsLinked Data and bibliographic metadata models
Some time after I wrote “UMR – Unified Metadata Resources“, I came across Chris Keene’s post “Linked data & RDF : draft notes for comment“, “just a list of links and notes” about Linked Data, RDF and the Semantic Web, put together to start collecting information about “a topic that will greatly impact on the Library / Information management world“.
While reading this post and working my way through the links on that page, I started realising that Linked Data is exactly what I tried to describe as One single web page as the single identifier of every book, author or subject. I did mention Semantic Web, URI’s and RDF, but the term “Linked Data” as a separate protocol had escaped me.
The concept of Linked Data was described by Tim Berners Lee, the inventor of the World Wide Web. Whereas the World Wide Web links documents (pages, files, images), which are basically resources about things, (“Information Resources” in Semantic Web terms), Linked Data (or the Semantic Web) links raw data and real life things (“Non-Information Resources”).
There are several definitions of Linked Data on the web, but here is my attempt to give a simple definition of it (loosely based on the definition in Structured Dynamics’ Linked Data FAQ):Linked Data is a methodology for providing relationships between things (data, concepts and documents) anywhere on the web, using URI’s for identifying, RDF for describing and HTTP for publishing these things and relationships, in a way that they can be interpreted and used by humans and software.
I will try to illustrate the different aspects using some examples from the library world. The article is rather long, because of the nature of the subject, then again the individual sections are a bit short. But I do supply a lot of links for further reading.
Data is relationships
The important thing is that “data is relationships“, as Tim Berners Lee says in his recent presentation for TED.
Before going into relationships between things, I have to point out the important distinction between abstract concepts and real life things, which are “manifestations” of the concepts. In Object modeling these are called “classes” (abstract concepts, types of things) and “objects” (real life things, or “instances” of “classes“).
- the class book can have the instances/objects “Cloud Atlas“, “Moby Dick“, etc.
- the class person can have the instances/objects “David Mitchell“, “Herman Melville“, etc.
In the Semantic Web/RDF model the concept of triples is used to describe a relationship between two things: subject – predicate – object, meaning: a thing has a relation to another thing, in the broadest sense:
- a book (subject) is written by (predicate) a person (object)
You can also reverse this relationship:
- a person (subject) is the author of (predicate) a book (object)
The person in question is only an author because of his or her relationship to the book. The same person can also be a mother of three children, an employee of a library, and a speaker at a conference.
Moreover, and this is important: there can be more than one relationship between the same two classes or types of things. A book (subject) can also be about (predicate) a person (object). In this case the person is a “subject” of the book, that can be described by a “keyword”, “subject heading”, or whatever term is used. A special case would be a book, written by someone about himself (an autobiography).
The problem with most legacy systems, and library catalogues as an example of these, is that a record for let’s say a book contains one or more fields for the author (or at best a link to an entry in an authority file or thesaurus), and separately one or more fields for subjects. This way it is not possible to see books written by an author and books about the same author in one view, without using all kinds of workarounds, link resolvers or mash-ups.
Using two different relationships that link to the same thing would provide for an actual view or representation of the real world situation.
Another important option of Linked Data/RDF: a certain thing can have as a property a link to a concept (or “class”) , describing the nature of the thing: “object Cloud Atlas” has type “book“; “object David Mitchell” has type “person“; “object Cloud Atlas” is written by “object David Mitchell“.
And of course, the property/relationship/predicate can also link to a concept describing the nature of the link.
Anywhere on the web
So far so good. But you may argue that this relationship theory is not very new. Absolutely right, but up until now this data-relationship concept has mainly been used with a view to the inside, focused on the area of the specific information system in question, because of the nature and the limitations of the available technology and infrastructure.
The “triple” model is of course exactly the same as the long standing methodology of Entity Relationship Diagrams (ERD), with which relationships between entities (=”classes“) are described. An ERD is typically used to generate a database that contains data in a specific information system. But ERD’s could just as well be used to describe Linked Data on the web.
Information systems, such as library catalogs, have been, and still are, for the greatest part closed containers of data, or “silos” without connections between them, as Tim Berners Lee also mentions in his TED presentation.
Lots of these silo systems are accessible with web interfaces, but this does not mean that items in these closed systems with dedicated web front ends can be linked to items in other databases or web pages. Of course these systems can have API‘s that allow system developers to create scripts to get related information from other systems and incorporate that external information in the search results of the calling system. This is what is being done in web 2.0 with so-called mash-ups.
But in this situation you need developers who know how to make scripts using specific scripting languages for all the different proprietary API’s that are being supported for all the individual systems.
If Linked Data was a global standard and all open and closed systems and websites supported RDF, then all these links would be available automatically to RDF enabled browser and client software, using SPARQL, the RDF Query Language.
- Linked Data/RDF can be regarded as a universal API.
The good thing about Linked Data is, that it is possible to use Linked Data mechanisms to link to legacy data in silo databases. You just need to provide an RDF wrapper for the legacy system, like has been done with the Library of Congress Subject Headings.
Some examples of available tools for exposing legacy data as RDF:
- Triplify – a web applications plugin that converts relational database structures into RDF triples
- D2R Server – a tool for publishing relational databases on the Semantic Web
- wp-RDFa – a wordpress plugin that adds some RDF information about Author and Title to WordPress blog posts
Of course, RDF that is generated like this will very probably only expose objects to link TO, not links to RDF objects external to the system.
Also, Linked Data can be used within legacy systems, for mixing legacy and RDF data, open and closed access data, etc. In this case we have RDF triples that have a subject URI from one data source and an object URI from another data source. In a situation with interlinked systems it would for instance be possible to see that the author of a specific book (data from a library catalog) is also speaking at a specific conference (data from a conference website). Objects linked together on the web using RDF triples are also known as an “RDF graph”. With RDF-aware client software it is possible to navigate through all the links to retrieve additional information about an object.
URI’s (“Uniform Resource Identifiers”) are necessary for uniquely identifying and linking to resources on the web. A URI is basically a string that identifies a thing or resource on the web. All “Information Resources”, or WWW pages, documents, etc. have a URI, which is commonly known as a URL (Uniform Resource Locator).
With Linked Data we are looking at identifying “Non-information Resources” or “real world objects” (people, concepts, things, even imaginary things), not web pages that contain information about these real world objects. But it is a little more complicated than that. In order to honour the requirement that a thing and its relations can be interpreted and used by humans and software, we need at least 3 different representations of one resource (see: How to publish Linked Data on the web):
- Resource identifier URI (identifies the real world object, the concept, as such)
- RDF document URI (a document readable for semantic web applications, containing the real world object’s RDF data and relationships with other objects)
- HTML document URI (a document readable for humans, with information about the real world object)
For instance, there could be a Resource Identifier URI for a book called “Cloud Atlas“. The web resource at that URI can redirect an RDF enabled browser to the RDF document URI, which contains RDF data describing the book and its properties and relationships. A normal HTML web browser would be redirected to the HTML document URI, for instance a web page about the book at the publisher’s website.
There are several methods of redirecting browsers and application to the required representation of the resource. See Cool URIs for the Semantic Web for technical details.
There are also RDF enabled browsers that transform RDF into web pages readable by humans, like the FireFox addon “Tabulator“, or the web based Disco and Marbles browsers, both hosted at the Free University Berlin.
RDF, vocabularies, ontologies
RDF or Resource Description Framework, is, like the name suggests, just a framework. It uses XML (or a simpler non-XML method N3) to describe resources by means of relationships. RDF can be implemented in vocabularies or ontologies, which are sets of RDF classes describing objects and relationships for a given field.
Basically, anybody can create an RDF vocabulary by publishing an RDF document defining the classes and properties of the vocabulary, at a URI on the web. The vocabulary can then be used in a resource by referring to the namespace (the URI) and the classes in that RDF document.
A nice and useful feature of RDF is that more than one vocabularies can be mixed and used in one resource.
Also, a vocabulary itself can reference other vocabularies and thereby inherit well established classes and properties from other RDF documents.
Another very useful feature of RDF is that objects can be linked to similar object resources describing the same real world thing. This way confusion about which object we are talking about, can be avoided.
A couple of existing and well used RDF vocabularies/ontologies:
- RDF – the base RDF vocabulary
- RDFS (for RDF Schema)
- DC (for Dublin Core)
- FOAF (for FOAF- Friend of a Friend) – online identities and social networks
- SKOS (for SKOS – Simple Knowledge Organisation System) – thesauri, classification schemes, subject heading systems and taxonomies
- OWL (for OWL -Ontology Web Language)
(By the way, the links in the first column (to the RDF files themselves) may act as an illustration of the redirection mechanism described before. Some of them may link to either the RDF file with the vocabulary definition itself, or to a page about the vocabulary, depending on the type of browser you use: rdf-aware or not.)
A special case is:
<?xml version=”1.0″ encoding=”UTF-8″ ?>
- RDFa – a sort of microformat without a vocabulary of its own, which relies on other vocabularies for turning XHTML page attributes into RDF
<dc:publisher>Random House Trade Paperbacks</dc:publisher>
<dc:title>Cloud Atlas: A Novel</dc:title>
<rdfs:label>Cloud Atlas: A Novel</rdfs:label>
<rdfs:label>RDF document about the book: Cloud Atlas: A Novel</rdfs:label>
<rdfs:label>Review number 1 about: Cloud Atlas: A Novel</rdfs:label>
<rdfs:label>RDF Book Mashup</rdfs:label>
A partial view on this RDF file with the Marbles browser:
It seems obvious that Linked Data can be very useful in providing a generic infrastructure for linking data, metadata and objects, available in numerous types of data stores, in the online library world. With such a networked online data structure, it would be fairly easy to create all kinds of discovery interfaces for bibliographic data and objects. Moreover, it would also be possible to link to non-bibliographic data that might interest the users of these interfaces.
A brief and incomplete list of some library related Linked Data projects, some of which already mentioned above:
- RDF BookMashup – Integration of Web 2.0 data sources like Amazon, Google or Yahoo into the Semantic Web.
- Library of Congress Authorities – Exposing LoC Autorities and Vocabularies to the web using URI’s
- DBPedia – Exposing structured data from WikiPedia to the web
- LIBRIS – Linked Data interface to Swedish LIBRIS Union catalog
- Scriblio+Wordpress+Triplify – “A social, semantic OPAC Union Catalogue”
And what about MARC, AACR2 and RDA? Is there a role for them in the Linked Data environment? RDA is supposed to be the successor of AACR2 as a content standard that can be used with MARC, but also with other encoding standards like MODS or Dublin Core.
The RDA Entity Relationship Diagram, that incorporates FRBR as well, can of course easily be implemented as an RDF vocabulary, that could be used to create a universal Linked Data library network. It really does not matter what kind of internal data format the connected systems use.