Library2.0 and beyond
RSS icon Home icon
  • Missing links

    Posted on March 28th, 2011 Lukas Koster 1 comment

    The challenges of generating linked data from legacy databases

    © extranoise

     

    Some time ago I wrote a blog post about the linked data proof of concept project I am involved in, connecting bibliographic metadata from the OPAC of the Library of the University of Amsterdam with the theatre performances database maintained by the Theatre Institute of The Netherlands.
    I ended that post with a list of next steps to take:

    • select/adapt/create a vocabulary for the Production/Performance subject area
    • select/adapt/create vocabularies for Persons (FOAF?) and Subjects (SKOS?)
    • add internal relationships with the other entities (Play, Production, etc.) in the JSON structure (implement RDF in JSON)
    • Add RDF/XML as output option, besides JSON
    • add external relationships (to other linked data sources like DBPedia, etc.)
    • extend the number of possible URI formats (for Play, Production, etc.)
    • add content negotiation to serve both human and machine readable redirects
    • extend the options on the OPAC side
    • publish UBA bibliographic data as linked open data (probably an entirely new project)

    So, what have we achieved so far? I can be brief about all the ‘real’ linked data stuff (RDF, vocabularies, external links, content negotiation): we are not there yet. This will be dealt with in the next phase.
    Instead, we have focused on getting the simple JSON implementation right, both on the data publishing side and on the data using side. We have added more URIs and internal relationships, and we are using these in the OPAC user interface.
    But we have also encountered a number of crucial problems that are in my view inherent to the type of legacy data models used in libraries and cultural heritage institutions.

    Theatre Production data in the Library Catalogue

     

    Progress

    First let me describe the improvements we have added so far.

    The URI for ‘person<baseurl>/person/<personname> now also returns a link to all the ‘titles’ that person is connected to (not only with the ‘author’ role, but for all roles, like director, performer, etc.): <baseurl>/gettitles/<personname>. This link will return a set of URIs of the form <baseurl>/title/<personname>/<title>. The /<personname>/<title> bit is at the moment the only way that a more or less unique identifier can be constructed from the OPAC metadata for the ‘play’ in the TIN database. There are a number of really important problems related to this that I will discuss below.

    The URI:

    <baseurl>/person/Beckett, Samuel

    returns among others:

    /title/Beckett, Samuel/Waiting for Godot
    /title/Beckett, Samuel/En attendant Godot
    /title/Beckett, Samuel/Endgame
    etc.

    The URI for a ‘play<baseurl>/title/<personname>/<title> now returns a set of ‘production’ URIs of the form <baseurl>/production/<personname>/<title>/<openingdate>/<idnr>.
    The ‘production’ URI returns information about ‘theatre company’, ‘venue‘ and all persons connected to that production, including their URIs, and when available also a link to an image of a poster, and a video.

    The URI

    <baseurl>/title/Beckett, Samuel/Waiting for Godot

    returns:

    /production/Beckett, Samuel/Waiting for Godot/1988-07-28/5777
    /production/Beckett, Samuel/Waiting for Godot/1988-11-22/6750
    /production/Beckett, Samuel/Waiting for Godot/1992-04-16/10728
    /production/Beckett, Samuel/Waiting for Godot/1981-02-18/43032

    The last ‘production’ URI returns:

    “name”:”Beckett, Samuel”,
    “title”:”Waiting For Godot”,
    “opening”:”1981-02-18″,
    “people”:
    “description”:”Beckett, Samuel (auteur: toneelspel van)”,
    “uri”:”/person/Beckett, Samuel”

    “description”:”Hartnett, John (regie)”,
    “uri”:”/person/Hartnett, John”

    “description”:”Muller, Frans (decor: ontwerp)”,
    “uri”:”/person/Muller, Frans”

    “description”:”Newell, Kym (licht: ontwerp)”,
    “uri”:”/person/Newell, Kym”

    “description”:”Zaal, Kees (geluid)”,
    “uri”:”/person/Zaal, Kees”

    “description”:”Tolstoj, Alexander (uitvoerende: Lucky)”,
    “uri”:”/person/Tolstoj, Alexander”

    “description”:”Weeks, David (uitvoerende: Estragon)”,
    “uri”:”/person/Weeks, David”

    “description”:”Coburn, Grant (uitvoerende: Vladimir)”,
    “uri”:”/person/Coburn, Grant”

    “description”:”Evans, Rhys (uitvoerende: Pozzo)”,
    “uri”:”/person/Evans, Rhys”

    “description”:”Geiringer, Karl (uitvoerende: A Boy)”,
    “uri”:”/person/Geiringer, Karl”

    “description”:”Guidi, Peter (uitvoering muziek)”,
    “uri”:”/person/Guidi, Peter”

    “description”:”Kimmorley, Roxanne (uitvoering muziek)”,
    “uri”:”/person/Kimmorley, Roxanne”

    “description”:”Vries, Hessel de (uitvoering muziek)”,
    “uri”:”/person/Vries, Hessel de”

    “description”:”Phillips, Margot (uitvoering muziek)”,
    “uri”:”/person/Phillips, Margot”

     

    Challenges/problems

    Now, the problems (or challenges) that we are facing here are essential to the core concept of linked data:

    • we don’t have actual matching unique identifiers (URIs)
    • we don’t have explicit internal relations with a common entity in both sources
    • part of the data consists of literal strings in a specific language

    These three problems are interrelated, they are linked problems, so to speak.

     

    Missing identifiers

    To start with the identifiers. Of course we have internal system identifiers in our local Aleph catalogue database. Because we contribute to the Dutch Union Catalogue (originally a PICA system, now OCLC), our bibliographic records also have national Dutch PICA identifiers. And because the Dutch Union Catalogue records are copied to WorldCat, these records in WorldCat also have OCLC numbers.
    Also the Theatre Institute has internal system identifiers in their Adlib database. But at the moment we do not have a match between these separate internal identifier schemes. The Theatre Production database records are not in WorldCat because they’re not bibliographic records.
    We are more or less forced to use the string values of the title and author fields to construct a usable URI, on both sides. Clearly this is the basis of lots of errors, because of the great number of possible variations in author and title descriptions.
    But even if the Theatre Institute’s records were in the Union Catalogue or WorldCat as well, then we still would not have an automatic match without some kind of broker mechanism ascertaining that the library catalogue record describes the same thing as the theatre production database record. The same applies to the author, which of course should be a relation of the type “written by” between the play and a person record instead of string values. Both systems do have internal author or person authority files, but there is no direct matching. For authors this could theoretically be achieved by linking to an online person authority file like VIAF. But in the current situation this is not available.

     

    Missing relations

    This brings me to the second problem. The fact that we are using the string values of title instead of unique identifiers, means that we connect plays and productions with a specific title variety or language. In our current implementation this means that we are not able to link to all versions of one specific play.
    For instance, from our OPAC the following URIs are constructed (two in English, one in French, one in Dutch):

    /title/Beckett, Samuel/Waiting for Godot
    /title/Beckett, Samuel/Waiting for Godot : a tragicomedy in two acts
    /title/Beckett, Samuel/En attendant Godot : pièce en deux actes
    /title/Beckett, Samuel/Wachten op Godot

    In the Theatre Production database (two in English, four in Dutch, one in French, one in German):

    /title/Beckett, Samuel/Waiting for Godot
    /title/Beckett, Samuel/Waiting For Godot
    /title/Beckett, Samuel/Wachten op Godot
    /title/Beckett, Samuel/Wachtend op Godot
    /title/Beckett, Samuel/Wachten op Godot (De favorieten)
    /title/Beckett, Samuel/Wachten op Godot (eerste bedrijf)
    /title/Beckett, Samuel/En attendant Godot
    /title/Beckett, Samuel/Warten auf Godot

    Only the first and fourth URI from the OPAC will find corresponding titles in the Theatre Production database. The second and third one, using a subtitle within the main title, don’t even have equivalents. And only two of the eight entries from the Theatre Production database have a match in the catalogue.
    In a library catalogue environment we are used to this problem, because catalogues are used for describing physical objects in the form of editions and copies. Unfortunately, also the Theatre Production database just contains records describing productions of a specific ‘edition’ or translation of a play, with only the opening performance information attached.

    This is where I need to talk about FRBR. Basically in a library catalogue environment this means that we should describe the relations between the ‘work’ (original text), the ‘expression’ (the version or translation), the ‘manifestation’ (edition, format, etc.) and the ‘items’ (the physical copies). Via the relations with higher level expression and work, the physical copy could be linked to the unifying work level, and then ideally through some universally valid unique identifier to, in our case, the theatre plays.
    Although FRBR is a publication centered schema used only in libraries, the same concepts can be applied to theatre performances: the original work (which is the same as the work in a bibliographical sense) has expressions (adaptations, translations, etc.), which have manifestations (productions), and in the end the individual items (actual performances on a specific date, time and location).

    Linking library and theatre in theory through FRBR

    If both the library catalogue and the theatre production database were FRBRised, we could in theory link on the Work level and cover all individual versions. But we would still need a matching mechanism on that Work level of course.

    In reality however we can only try to link on the Manifestation level in an imperfect way.

    Linking library and theatre in reality

    At the moment, in our project, on the catalogue side we extract the title and author from the generated OPAC HTML. It could be an option to get available linking information form the internal MARC records (like the 240, 246, 765, 767, 775 tags), but that is not easy because of a number of reasons. Something similar could be done in the theatre production database, making implicit links explicit. But all this makes the effort to get something sensible out there much bigger.

     

    Literal strings

    The third problem, the literal strings in Dutch both in the library catalogue and in the theatre production database, prevents the effective use of the data in multilingual environments, equally in the traditional native interfaces and as linked data. Obviously for English speaking end users the Dutch terms mean nothing. And in a linked data environment the Dutch strings can’t easily be linked to other data, in Dutch, English, or any language, without unique identifiers.

     

    Implicit to explicit

    People calling on institutions to publish their data as linked open data tend to say it’s easy once you know how to do it . And of course it must be done. But if the published datasets have a flat internal structure designed to fulfill a specific local business objective, then they just don’t provide sufficient added value for third party use. In order to make your published open data useful for others, you have to make implicit relations explicit. And this requires something more than just making the data available in RDF ‘as is’, it requires a lot of processing.

    Share

  • Dutch Culture Link

    Posted on October 7th, 2010 Lukas Koster 6 comments

    Linking library and cultural heritage data

    Culture links © Scott Beale/Laughing Squid (http://laughingsquid.com/)

    Interested to publishing a test collection as linked open data to help @StichtingDEN with practical guide for heritage institutions?” That’s what my former colleague at the Library of the University of Amsterdam, now project manager at DEN (Digital Heritage Foundation The Netherlands), Marco Streefkerk asked me in April 2010.

    Was I interested? Of course I was. I had written a blog post “Linked data for libraries” almost a year before, and I had been very interested in the subject since then. Unfortunately in my day job at the Library of the University of Amsterdam (UBA) until very recently there was no opportunity to put my theoretical knowledge to practice. However, in the Library’s “Action plans 2010-2011” (January 2010), the Semantic Web is mentioned in the Innovation chapter as one of the areas with room for a small pilot involving linked data and RDF. I like to think it was me who managed to get it in there ;-)

    To come back to Marco’s question, I was at the time actually trying to think of a linked data/RDF test, and it so happened that I had talked to Ad Aerts of the Theater Institute of The Netherlands (TIN) about organising such a test the day before! So that’s what I told Marco. And things started from there.

    The first idea was to publish a small test set of records from one of the University Library’s own heritage collections. The goal from the point of view of DEN was to publish a short practical guide how to publish heritage collection as linked data, targeted at heritage institutions.
    But after some email discussions and meetings we decided to incorporate TIN in this test and apply both sides of the linked data concept: publish linked data and use linked data.
    Apart from a library catalogue, TIN also has a large database containing metadata on theater performances and a large collection of audiovisual material related to these performances. The plan was to publish the performance metadata and related digital material as linked data.
    The UBA would then use this TIN linked data in their traditional MARC based OPAC to enrich the plain bibliographic metadata if the OPAC search results related to theater plays.

    We decided to name our little proof of concept project “Dutch Culture Link”. The people involved for DEN are Marco Streefkerk, Annelies van Nispen and Monika Lechner. For TIN it’s Ad Aerts. For UBA: Roxana Popistasu and myself. Of these five people I knew four already face to face and one (Monika) on Twitter. I think this helps.

    To start with, we described the data model of the TIN Productions and Performances database (in terms of relationships or triples) as follows:

    • a Play is written by one or more Persons (as author)

      DCL data model

    • a Play can be ‘effectuated’ in one or more Productions
    • a Production can be ‘staged’ in one or more Performances
    • a Performance takes place in one Venue on a specific date and time
    • a Person can be producer of a Production
    • a Person can be director of a Production
    • a Person can play a character in a Production, or even in an individual Performance

    Besides the metadata TIN also has links from the database to digital collections (sound and video recordings, photographs, reviews). The model is strikingly similar to the bibliographic FRBR model. The Play is a FRBR Work, the Production is a FRBR Expression and/or Manifestation, the Performance is a FRBR Item.

    Now we knew who and what, but not yet how. We needed to know how to actually apply the theoretical concepts of linked data to our subject area. Questions we had were:

    • which ontology/vocabulary (‘data model’) do we need for publishing the production data?
    • how to format URIs (the linked data unique identifiers)
    • how do we implement RDF?
    • which publication techniques and platforms do we use?
    • which scripting languages can we use?
    • how do we find and get the published linked data?
    • how do we process and present the retrieved linked data?

    We definitely needed some practical hands-on tutorials or training. We could not find an institution organising practical linked data training courses in The Netherlands at short notice. Via Twitter Ian Davis referred us to their TALIS training options. Unfortunately, because we are only an informal proof of concept pilot project without any project funding, we were unable to proceed on this track.
    However, through a contact at The European Library we managed to enter two members of our project team as participants in the free Linked Data Workshop at DANS in The Hague, with Herbert Van De Sompel, Ivan Herman and Antoine Isaac as trainers. This workshop proved to be very useful. Unfortunately I could not attend myself.

    After the workshop we decided to adopt an “agile” aproach: just start and proceed with small steps. For the short term this meant on the TIN side: implementing a script that accesses the XML gateway of the Adlib system underlying the Theater Production Database and produces result in JSON format. The script accepts as input URIs of the form <baseurl>/person/<name>, <baseurl>/play/<person>/<title>, etc. For now only the <baseurl>/person/<name> works, but there are more to come.

    An example: the request <baseurl>/person/joost van den vondel gives the JSON result:

    jsonTIN({
    “key”:”vondel, joost van den”,
    “name”:”Vondel, Joost van den”,
    “birth.country”:”Duitsland”,
    “birth.date”:”17 november 1587*”,
    “birth.place”:”Keulen”,
    “death.date”:”5 februari 1679*”,
    “death.place”:”Amsterdam”
    })

    On the UBA side, if there is an author and/or title field in an individual OPAC result, a JavaScript addon to the Aleph OPAC HTML templates directs a query at the TIN linked data URL using one or both fields as input. The resulting JSON data from TIN is then processed and displayed. At the moment only the author field is used in the <baseurl>/person/<name> query. But here is more to come.

    UBA test OPAC with TIN data

    Next steps in this project:

    • select/adapt/create a vocabulary for the Production/Performance subject area
    • select/adapt/create vocabularies for Persons (FOAF?) and Subjects (SKOS?)
    • add internal relationships with the other entities (Play, Production, etc.) in the JSON structure (implement RDF in JSON)
    • Add RDF/XML as output option, besides JSON
    • add external relationships (to other linked data sources like DBPedia, etc.)
    • extend the number of possible URI formats (for Play, Production, etc.)
    • add content negotiation to serve both human and machine readable redirects
    • extend the options on the  OPAC side
    • publish UBA bibliographic data as linked open data (probably an entirely new project)

    The team will be blogging about project developments (in Dutch) on the DEN blog (addition July 7 2011: new DEN blog location).

    To be continued…

    Share

  • UMR – Unified Metadata Resources

    Posted on April 12th, 2009 Lukas Koster 7 comments

    One single web page as the single identifier of every book, author or subject

    openlibrary1

    I like the concept of “the web as common publication platform for libraries“, and “every book its own url“, as described by Owen Stephens in two blog posts:
    Its time to change library systems

    I’d suggest what we really need to think about is a common ‘publication’ platform – a way of all of our systems outputting records in a way that can then be easily accessed by a variety of search products – whether our own local ones, remote union ones, or even ones run by individual users. I’d go further and argue that platform already exists – it is the web!

    and “The Future is Analogue

    If every book in your catalogue had it’s own URL – essentially it’s own address on your web, you would have, in a single step, enabled anyone in the world to add metadata to the book – without making any changes to the record in your catalogue.

    This concept of identifying objects by URL:Unified Resource Locator (or maybe better URI: Unified Resource Identifier) is central to the Semantic Web, that uses RDF (resource Description Framework) as a metadata model.

    As a matter of fact at ELAG 2008 I saw Jeroen Hoppenbrouwers (“Rethinking Subject Access “) explaining his idea of doing the same for Subject Headings using the Semantic Web concept of triplets. Every subject its own URL or web page. He said: “It is very easy. You can start doing this right away“.

    elag_2008_hoppenbrouwers

    © Jeroen Hoppenbrouwers

    To make the picture complete we only need the third essential component: every author his or her or its own URL!

    This ideal situation would have to conform to the Open Access guidelines of course. One single web page serving as the single identifier of every book, author or subject, available for everyone to link their own holdings, subscriptions, local keywords and circulation data to.

    In real life we see a number of current initiatives on the web by commercial organisations and non commercial groups, mainly in the area of “books” (or rather “publications”) and “authors”. “Subjects” apparently is a less appealing area to start something like this, because obviously stand-alone “subjects” without anything to link them to are nothing at all, whereas you always have “publications” and “authors”, even without “subjects”. The only project I know of is MACS (Multilingual Acces to Subjects), which is hosted on Jeroen Hoppenbrouwers’ domain.

    For publications we have OCLC’s WorldCat, Librarything, Open Library, to name just a few. And of course these global initiatives have had their regional and local counterparts for many years already (Union Catalogues, Consortia models). But this is again a typical example of multiple parallel data stores of the same type of entities. The idea apparently is that you want to store everything in one single database aiming to be complete, instead of the ideal situation of single individual URI’s floating around anywhere on the web.
    Ex Libris’ new Unified Resource Management development (URM, and yes: the title of this blog post is an ironic allusion to that acronym), although it promotes sharing of metadata, it does this within another separate system into which metadata from other systems can be copied.

    The same goes for authors. We have WorldCat Identities, VIAF, local authority schemes like DAI, etc. Again, we see parallel silos instead of free floating entities.

    Of course, the ideal picture sketched above is much too simple. We have to be sure which version of a publication, which author and which translation of a subject for instance we are dealing with. For publications this means that we need to implement FRBR (in short: an original publication/work and all of its manifestations), for authors we need author names thesauri, for subjects multilingual access.

    I have tried to illustrate this in this simplified and incomplete diagram:

    © Lukas Koster

    © Lukas Koster

    In this model libraries can use their local URI-objects representing holdings and copies for their acquisitions and circulation management, while the bibliographic metadata stay out there in the global, open area. Libraries (and individuals of course) can also attach local keywords to the global metadata, which in turn can become available globally (“social tagging”).

    It is obvious that the current initiatives have dealt with these issues with various levels of success. Some examples to illustrate this:

    • Work: Desiderius ErasmusEncomium Moriae (Greek), Laus Stultitiae (Latin), Lof der Zotheid (Dutch), Praise of Folly (English)
    • Author: David Mitchell

    Authors
    Good:

    Medium:

    Bad:

    Publications
    Good:

    Bad:

    These findings seem to indicate that some level of coordination (which the commercial initiatives apparently have implemented better than the non-commercial ones) is necessary in order to achieve the goal of “one URI for each object”.

    Who wants to start?

    Share

  • Unique authors

    Posted on February 4th, 2009 Lukas Koster No comments

    Jonathan Rochkind, in his post “How do name authorities work anyway?“, wonders if catalogers will confuse him with another writer of the same name that has an LC authority record, whereas he does not have one.

    I guess the relevance of this problem depends entirely on the question: do you think it’s important to know that an author of a specific work is the same as the author of another work? A former colleague of mine whom I respect very much, used to say that it does not matter, as long as the correct name appears with the work in question. This was only six years ago, before the emergence of web 2.0 and library 2.0 type services. It is just like looking at a printed book: you read the author’s name, and if there is no further information on the back cover, or a list of publications by the author inside, then that’s all there is to it. In normal life, if you read a book or an article for pleasure, or even for business, study or research, that is no problem. No need for author authority records at all.

    However, the picture is completely different from the point of view of the authors, especially in the case of professional scientific and research staff, where the exact number of publications and citations is crucial. For these authors it is vital that the correct authority record is used for their publications. Here we definitely need authority records with unique identifiers. But of course there are so many different systems in use: LC authority records , WorldCat Identities , national systems etc., they all use their own identifiers.

    There is the proposal to develop the UAI, Universal Author Identifier . This system depends on authors registering and maintaining their own personal information in a freely accessible web based database. There was a pilot system for a while, but it is not clear if any results were reached.

    In The Netherlands a similar project on a national scale has led to a live implementation: the DAI, Digital Author Identifier . The DAI is based on the identifier used for authors in the OCLC-PICA Dutch National Union Catalog /Common Catalog system “PPN”, and is assigned to every author who has been appointed to a position at a Dutch university or research institute or has some other relevant connection with one of these organisations. The DAI is used in the Dutch university repositories, the Dutch national Research Database and in the national integrated portal NARCIS .
    The difference with UAI is that DAI is assigned by catalogers in one of the participating organisations, whereas UAI depends on voluntary cooperation of the authors themselves.

    Of course a “universal author identifier” still does not solve Jonathan’s initial question: confusion is still possible if the authors do not have a clear interest in maintaining their information themselves.

    Another issue here, about which something more can be said in a future post, is that for a real universal system we should use URI’s, as for unique works (see Owen Stephens’ post “The Future is Analogue “) and subject headings.

    Share