Data. The final frontier.
RSS icon Home icon
  • Maps, dictionaries and guidebooks

    Posted on August 3rd, 2015 Lukas Koster 7 comments

    Interoperability in heterogeneous library data landscapes

    Maps, dictionaries, guidebooks

    Libraries have to deal with a highly opaque landscape of heterogeneous data sources, data types, data formats, data flows, data transformations and data redundancies, which I have earlier characterized as a “data maze”. The level and magnitude of this opacity and heterogeneity varies with the amount of content types and the number of services that the library is responsible for. Academic and national libraries are possibly dealing with more extensive mazes than small public or company libraries.

    In general, libraries curate collections of things and also provide discovery and delivery services for these collections to the public. In order to successfully carry out these tasks  they manage a lot of data. Data can be regarded as the signals between collections and services.

    These collections and services are administered using dedicated systems with dedicated datastores. The data formats in these dedicated datastores are tailored to perform the dedicated services that these dedicated systems are designed for. In order to use the data for delivering services they were not designed for, it is common practice to deploy dedicated transformation procedures, either manual ones or as automated utilities. These transformation procedures function as translators of the signals in the form of data.

    Here lies the origin of the data maze: an inextricably entangled mishmash of systems with explicit and

    © Ron Zack

    © Ron Zack

    implicit data redundancies using a number of different data formats, some of which systems are talking to each other in some way. This is not only confusing for end users but also for library system staff. End users lack clarity about user interfaces to use, and are missing relevant results from other sources and possible related information. Libraries need licenses and expertise for ongoing administration, conversion and migration of multiple systems, and suffer unforeseen consequences of adjustments elsewhere.

    To take the linguistic analogy further, systems make use of a specific language (data format) to code their signals in. This is all fine as long as they are only talking to themselves. But as soon as they want to talk to other systems that use a different language, translations are needed, as mentioned. Sometimes two systems use the same language (like MARC, DC, EAD), but this does not necessarily mean they can understand each other. There may be dialects (DANMARC, UNIMARC), local colloquialisms, differences in vocabularies and even alphabets (local fields, local codes, etc.). Some languages are only used by one system (like PNX for Primo). All languages describe things in their own vocabulary. In the systems and data universe there are not many loanwords or other mechanisms to make it clear that systems are talking about the same thing (no relations or linked data). And then there is syntax and grammar (such as subfields and cataloguing rules) that allow for lots of variations in formulations and formats.

    Translation does not only require applying a dictionary, but also interpretation of the context, syntax, local variations and transcriptions. Consequently much is lost in translation.lostintranslation

    The transformation utilities functioning as translators of the data signals suffer from a number of limitations. They translate between two specific languages or dialects only. And usually they are employed by only one system (proprietary utilities). So even if two systems speak the same language, they probably both need their own translator from a common source language. In many cases even two separate translators are needed if source and target system do not speak each other’s language or dialect. The source signals are translated to some common language which in turn is translated into the target language. This export-import scenario, which entails data redundancy across systems, is referred to as ETL (Extract Transform Load). Moreover, most translators only know a subset of the source and target language dependent on the data signals needed by the provided services. In some cases “data mappings” are used as conversion guides. This term does not really cover what is actually needed, as I have tried to demonstrate. It is not enough to show the paths between source and target signals. It is essential to add the selections and transformations needed as well. In order to make sense of the data maze you need a map, a dictionary and a guidebook.

    To make things even more complicated, sometimes reading data signals is only possible with a passport or visa (authentication for access to closed data). Or even worse, when systems’ borders are completely closed and no access whatsoever is possible, not even with a passport. Usually, this last situation is referred to with the term “data silos”, but that is not the complete picture. If systems are fully open, but their data signals are coded by means of untranslatable languages or syntaxes, we are also dealing with silos.

    Anyway, a lot of attention and maintenance is required to keep this Tower of Babel functioning. This practice is extremely resource-intensive, costly and vulnerable. Are there any solutions available to diminish maintenance, costs and vulnerability? Yes there are.

    First of all, it is absolutely crucial to get acquainted with the maze. You need a map (or even an atlas) to be able to see which roads are there, which ones are inaccessible, what traffic is allowed, what shortcuts are possible, which systems can be pulled down and where new roads can be built. This role can be fulfilled by a Dataflow Repository, which presents an up-to-date overview of locations and flows of all content types and data elements in the landscape.

    Secondly it is vital to be able to understand the signals. You need a dictionary to be able to interpret all signals, languages, syntaxes, vocabularies, etc. A Data Dictionary describing data elements, datastores, dataflows and data formats is the designated tool for this.

    And finally it is essential to know which transformations are taking place en route. A guidebook should be incorporated in the repository, describing selections and transformations for every data flow.

    You could leave it there and be satisfied with these guiding tools to help you getting around the existing data maze more efficiently, with all its ETL utilities and data redundancies. But there are other solutions, that focus on actually tackling or even eliminating the translation problem. Basically we are looking at some type of Service Oriented Architecture (SOA) implementation. SOA is a rather broad concept, but it refers to an environment where individual components (“systems”) communicate with each other in a technology and vendor agnostic way using interoperable building blocks (“services”). In this definition “services” refer to reusable dataflows between systems, rather than to useful results for end users. I would prefer a definition of SOA to mean “a data and utilities architecture focused on delivering optimal end user services no matter what”.

    Broadly speaking there are four main routes to establish a SOA-like condition, all of which can theoretically be implemented on a global, intermediate or local level.

    1. Single Store/Single Format: A single universal integrated datastore using a universal data format. No need for dataflows and translations. This would imply some sort of linked (open) data landscape with RDF as universal language and serving all systems and services. A solution like this would require all providers of relevant systems and databases to commit to a single universal storage format. Unrealistic in the short term indeed, but definitely something to aim for, starting at the local level.
    2. Multiple Stores/Shared Format: A heterogeneous system and datastore landscape with a universal communication language (a lingua franca, like English) for dataflows. No need for countless translators between individual systems. This universal format could be RDF in any serialization. A solution like this would require all providers of relevant systems and databases to commit to a universal exchange format. Already a bit less unrealistic.
    3. Shared Store/Shared Format: A heterogeneous system and datastore landscape with a central shared intermediate integrated datastore in a single shared format. Translations from different source formats to only one shared format. Dataflows run to and from the shared store only. For instance with RDF functioning as Esperanto, the artificial language which is actually sometimes used as “Interlingua” in machine translation. A solution like this does not require a universal exchange format, only a translator that understands and speaks all formats, which is the basis of all ETL tools. This is much more realistic, because system and vendor dependencies are minimized, except for variations in syntax and vocabularies. The platform itself can be completely independent.
    4. Multiple Stores/Single Translation Pool: or what is known as an Enterprise Service Bus (ESB). No translations are stored, no data is integrated. Simultaneous point to point translations between systems happen on the fly. Looks very much like the existing data maze, but with all translators sitting together in one cubicle. This solution is not a source of much relief, or as one large IT vendor puts it: “Using an ESB can become problematic if large volumes of data need to be sent via the bus as a large number of individual messages. ESBs should never replace traditional data integration like ETL tools. Data replication from one database to another can be resolved more efficiently using data integration, as it would only burden the ESB unnecessarily.”.

    Overlooking the possible routes out of the data maze, it seems that the first step should be employing the map, dictionary and guidebook concept of the dataflow repository, data dictionary and transformation descriptions. After that the only feasible road on the short term is the intermediate integrated Shared Store/Shared Format solution.

  • Local library data in the new global framework

    Posted on January 5th, 2012 Lukas Koster 33 comments

    2011 has in a sense been the year of library linked data. Not that libraries of all kinds are now publishing and consuming linked data in great numbers. No. But we have witnessed the publication of the final report of the W3C Library Linked Data Incubator Group, the Library of Congress announcement of the new Bibliographic Framework for the Digital Age based on Linked Data and RDF, the release by a number of large libraries and library consortia of their bibliographic metadata, many publications, sessions and presentations on the subject.

    All these events focus mainly on publishing library bibliographic metadata as linked open data. Personally I am not convinced that this is the most interesting type of data that libraries can provide. Bibliographic metadata as such describe publications, in the broadest sense, providing information about title, authors, subjects, editions, dates, urls, but also physical attributes like dimensions, number of pages, formats, etc. This type of information, in FRBR terms: Work, Expression and Manifestation metadata, is typically shared among a large number of libraries, publishers, booksellers, etc. ‘Shared’ in this case means ‘multiplied and redundantly stored in many different local systems‘. It doesn’t really make sense if all libraries in the world publish identical metadata side by side, does it?

    In essence only really unique data is worth publishing. You link to the rest.

    Currently, library data that is really unique and interesting is administrative information about holdings and circulation. After having found metadata about a potentially relevant publication it is very useful for someone to know how and where to get access to it, if it’s not freely available online. Do you need to go to a specific library location to get the physical item, or to have access to the online article? Do you have to be affiliated to a specific institution to be entitled to borrow or access it?

    Usage data about publications, both print and digital, can be very useful in establishing relevance and impact. This way information seekers can be supported in finding the best possible publications for their specific circumstances. There are some interesting projects dealing with circulation data already, such as the research project by Magnus Pfeffer and Kai Eckert as presented at the SWIB 11 conference, and the JISC funded Library Impact Data project at the University of Huddersfield. The Ex Libris bX service presents article recommendations based on SFX usage log analysis.

    The consequence of this assertion is that if libraries want to publish linked open data, they should focus on holdings and circulation data, and for the rest link to available bibliographic metadata as much as possible. It is to be expected that the Library of Congress’ New Bibliographic Framework will take care of that part one way or another.

    In order to achieve this libraries should join forces with each other and with publishers and aggregators to put their efforts into establishing shared global bibliographic metadata pools accessible through linked open data. We can think of already existing data sources like WorldCat, OpenLibrary, Summon, Primo Central and the like. We can only hope that commercial bibliographic metadata aggregators like OCLC, SerialsSolutions and Ex Libris will come to realise that it’s in everybody’s interest to contribute to the realisation of the new Bibliographic Framework. The recent disagreement between OCLC and the Swedish National Library seems to indicate that this may take some time. For a detailed analysis of this see the blog post ‘Can linked library data disrupt OCLC? Part one’.


    An interesting initiative in this respect is LibraryCloud, an open, multi-library data service that aggregates and delivers library metadata. And there is the HBZ LOBID project, which is targeted at ‘the conversion of existing bibliographic data and associated data to Linked Open Data‘.

    So what would the new bibliographic framework look like? If we take the FRBR model as a starting point, the new framework could look something like this. See also my slideshow “Linked Open Data for libraries”, slides 39-42.

    The basic metadata about a publication or a unit of content, on the FRBR Work level, would be an entry in a global datastore identified by a URI ( Uniform Resource Identifier). This datastore could for instance be WorldCat, or OpenLibrary, or even a publisher’s datastore. It doesn’t really matter. We don’t even have to assume it’s only one central datastore that contains all Work entries.

    The thing identified by the URI would have a text string field associated with it containing the original title, let’s say “The Da Vinci Code” as an example of a book. But also articles can and should be identified this way. The basic information we need to know about the Work would be attached to it using URIs to other things in the linked data web. A set of two things linked by a URI is called a ‘triple’. ‘Author’ could for instance be a link to OCLC’s VIAF ( = Dan Brown), which would then constitute a triple. If there are more authors, you simply add a URI for every person or institution. Subjects could be links to DBPedia/Wikipedia, Freebase, the Library of Congress Authority files, etc. There could be some more basic information, maybe a year, or a URI to a source describing the background of the work.

    At the Expression level, a Dutch translation would have it’s own URI, stored in the same or another datastore. I could imagine that the publisher who commissioned the translation would maintain a datastore with this information. Attached to the Expression there would be the URI of the original Work, a URI pointing to the language, a URI identifying the translator and a text string contaning the Dutch title, among others.

    Every individual edition of the work could have it’s own Manifestation level URI, with a link to the Expression (in this case the Dutch translation), a publisher URI, a year, etc. For articles published according to the long standing tradition of peer reviewed journals, there would also be information about the journal. On this level there should also be URIs to the actual content when dealing with digital objects like articles, ebooks, etc., no matter if access is free or restricted.

    So far we have everything we need to know about publications “in the cloud”, or better: in a number of datastores available on a number of servers connected to the world wide web. This is more or less the situation described by OCLC’s Lorcan Dempsey in his recent post ‘Linking not typing … knowledge organization at the network level’. The only thing we need now is software to present all linked information to the user.

    No libraries in sight yet. For accessing freely available digital content on the web you actually don’t need a library, unless you need professional assistance finding the correct and relevant information. Here we have identified a possible role of librarians in this new networked information model.

    Now we have reached the interesting part: how to link local library data to this global shared model? We immediately discover that the original FRBR model is inadequate in this networked environment, because it implies a specific local library situation. Individual copies of a work (the Items) are directly linked to the Manifestation, because FRBR refers to the old local catalogue which describes only the works/publications one library actually owns.

    In the global shared library linked data network we need an extra explicit level to link physical Items owned by the library or online subscriptions of the library to the appropriate shared network level. I suggest to use the “Holding” level. A Holding would have it’s own URI and contain URIs of the Manifestation and of the Library. A specific Holding in this way would indicate that a specific library has one or more copies (Items) of a specific edition of a work (Manifestation), or offers access to an online digital article by way of a subscription.


    If a Holding refers to physical copies (print books or journal issues for instance) then we also need the Item level. An Item would have it’s own URI and the URI of the Holding. For each Item, extra information can be provided, for instance ‘availability’, ‘location’, etc. Local circulation administration data can be registered for all Holdings and Items. For online digital content we don’t need Items, only subscription information directly attached to the Holding.

    Local Holding and Item information can reside on local servers within the library’s domain or just as well on some external server ‘in the cloud’.

    It’s on the level of the Holding that usage statistics per library can be collected and aggregated, both for physical items and for digital material.

    Now, this networked linked library data model still allows libraries to present a local traditional catalogue type interface, showing only information about the library’s own print and digital holdings. What’s needed is software to do this using the local Holdings as entry level.

    But the nice thing about the model is that there will also be a lot of other options. It will also be possible to start at the other end and search all bibliographic metadata available in the shared global network, and then find the most appropriate library to get access to a specific publication, much like WorldCat does, but on an even larger scale.

    Another nice thing of using triples, URIs and linked data, is that it allows for adding all kinds of other, non-traditional bibliographic links to the old inward looking library world, making it into a flexible and open model, ready for future developments. It will for instance be possible for people to discover links to publications and library holdings from any other location on the web, for instance a Wikipedia page or a museum website. And the other way around, from an item in local library holdings to let’s say a recorded theatre performance on YouTube.

    When this new data and metadata framework will be in place, there will be two important issues to be solved:

    • Getting new software, systems and tools for both back end administrative functions and front end information finding needs. For this we need efforts from traditional library systems vendors but also from developers in libraries.
    • Establishing future roles for libraries, librarians and information professionals in the new framework. This may turn out to be the most important issue.
  • Missing links

    Posted on March 28th, 2011 Lukas Koster 1 comment

    The challenges of generating linked data from legacy databases

    © extranoise


    Some time ago I wrote a blog post about the linked data proof of concept project I am involved in, connecting bibliographic metadata from the OPAC of the Library of the University of Amsterdam with the theatre performances database maintained by the Theatre Institute of The Netherlands.
    I ended that post with a list of next steps to take:

    • select/adapt/create a vocabulary for the Production/Performance subject area
    • select/adapt/create vocabularies for Persons (FOAF?) and Subjects (SKOS?)
    • add internal relationships with the other entities (Play, Production, etc.) in the JSON structure (implement RDF in JSON)
    • Add RDF/XML as output option, besides JSON
    • add external relationships (to other linked data sources like DBPedia, etc.)
    • extend the number of possible URI formats (for Play, Production, etc.)
    • add content negotiation to serve both human and machine readable redirects
    • extend the options on the OPAC side
    • publish UBA bibliographic data as linked open data (probably an entirely new project)

    So, what have we achieved so far? I can be brief about all the ‘real’ linked data stuff (RDF, vocabularies, external links, content negotiation): we are not there yet. This will be dealt with in the next phase.
    Instead, we have focused on getting the simple JSON implementation right, both on the data publishing side and on the data using side. We have added more URIs and internal relationships, and we are using these in the OPAC user interface.
    But we have also encountered a number of crucial problems that are in my view inherent to the type of legacy data models used in libraries and cultural heritage institutions.

    Theatre Production data in the Library Catalogue



    First let me describe the improvements we have added so far.

    The URI for ‘person<baseurl>/person/<personname> now also returns a link to all the ‘titles’ that person is connected to (not only with the ‘author’ role, but for all roles, like director, performer, etc.): <baseurl>/gettitles/<personname>. This link will return a set of URIs of the form <baseurl>/title/<personname>/<title>. The /<personname>/<title> bit is at the moment the only way that a more or less unique identifier can be constructed from the OPAC metadata for the ‘play’ in the TIN database. There are a number of really important problems related to this that I will discuss below.

    The URI:

    <baseurl>/person/Beckett, Samuel

    returns among others:

    /title/Beckett, Samuel/Waiting for Godot
    /title/Beckett, Samuel/En attendant Godot
    /title/Beckett, Samuel/Endgame

    The URI for a ‘play<baseurl>/title/<personname>/<title> now returns a set of ‘production’ URIs of the form <baseurl>/production/<personname>/<title>/<openingdate>/<idnr>.
    The ‘production’ URI returns information about ‘theatre company’, ‘venue‘ and all persons connected to that production, including their URIs, and when available also a link to an image of a poster, and a video.

    The URI

    <baseurl>/title/Beckett, Samuel/Waiting for Godot


    /production/Beckett, Samuel/Waiting for Godot/1988-07-28/5777
    /production/Beckett, Samuel/Waiting for Godot/1988-11-22/6750
    /production/Beckett, Samuel/Waiting for Godot/1992-04-16/10728
    /production/Beckett, Samuel/Waiting for Godot/1981-02-18/43032

    The last ‘production’ URI returns:

    “name”:”Beckett, Samuel”,
    “title”:”Waiting For Godot”,
    “description”:”Beckett, Samuel (auteur: toneelspel van)”,
    “uri”:”/person/Beckett, Samuel”

    “description”:”Hartnett, John (regie)”,
    “uri”:”/person/Hartnett, John”

    “description”:”Muller, Frans (decor: ontwerp)”,
    “uri”:”/person/Muller, Frans”

    “description”:”Newell, Kym (licht: ontwerp)”,
    “uri”:”/person/Newell, Kym”

    “description”:”Zaal, Kees (geluid)”,
    “uri”:”/person/Zaal, Kees”

    “description”:”Tolstoj, Alexander (uitvoerende: Lucky)”,
    “uri”:”/person/Tolstoj, Alexander”

    “description”:”Weeks, David (uitvoerende: Estragon)”,
    “uri”:”/person/Weeks, David”

    “description”:”Coburn, Grant (uitvoerende: Vladimir)”,
    “uri”:”/person/Coburn, Grant”

    “description”:”Evans, Rhys (uitvoerende: Pozzo)”,
    “uri”:”/person/Evans, Rhys”

    “description”:”Geiringer, Karl (uitvoerende: A Boy)”,
    “uri”:”/person/Geiringer, Karl”

    “description”:”Guidi, Peter (uitvoering muziek)”,
    “uri”:”/person/Guidi, Peter”

    “description”:”Kimmorley, Roxanne (uitvoering muziek)”,
    “uri”:”/person/Kimmorley, Roxanne”

    “description”:”Vries, Hessel de (uitvoering muziek)”,
    “uri”:”/person/Vries, Hessel de”

    “description”:”Phillips, Margot (uitvoering muziek)”,
    “uri”:”/person/Phillips, Margot”



    Now, the problems (or challenges) that we are facing here are essential to the core concept of linked data:

    • we don’t have actual matching unique identifiers (URIs)
    • we don’t have explicit internal relations with a common entity in both sources
    • part of the data consists of literal strings in a specific language

    These three problems are interrelated, they are linked problems, so to speak.


    Missing identifiers

    To start with the identifiers. Of course we have internal system identifiers in our local Aleph catalogue database. Because we contribute to the Dutch Union Catalogue (originally a PICA system, now OCLC), our bibliographic records also have national Dutch PICA identifiers. And because the Dutch Union Catalogue records are copied to WorldCat, these records in WorldCat also have OCLC numbers.
    Also the Theatre Institute has internal system identifiers in their Adlib database. But at the moment we do not have a match between these separate internal identifier schemes. The Theatre Production database records are not in WorldCat because they’re not bibliographic records.
    We are more or less forced to use the string values of the title and author fields to construct a usable URI, on both sides. Clearly this is the basis of lots of errors, because of the great number of possible variations in author and title descriptions.
    But even if the Theatre Institute’s records were in the Union Catalogue or WorldCat as well, then we still would not have an automatic match without some kind of broker mechanism ascertaining that the library catalogue record describes the same thing as the theatre production database record. The same applies to the author, which of course should be a relation of the type “written by” between the play and a person record instead of string values. Both systems do have internal author or person authority files, but there is no direct matching. For authors this could theoretically be achieved by linking to an online person authority file like VIAF. But in the current situation this is not available.


    Missing relations

    This brings me to the second problem. The fact that we are using the string values of title instead of unique identifiers, means that we connect plays and productions with a specific title variety or language. In our current implementation this means that we are not able to link to all versions of one specific play.
    For instance, from our OPAC the following URIs are constructed (two in English, one in French, one in Dutch):

    /title/Beckett, Samuel/Waiting for Godot
    /title/Beckett, Samuel/Waiting for Godot : a tragicomedy in two acts
    /title/Beckett, Samuel/En attendant Godot : pièce en deux actes
    /title/Beckett, Samuel/Wachten op Godot

    In the Theatre Production database (two in English, four in Dutch, one in French, one in German):

    /title/Beckett, Samuel/Waiting for Godot
    /title/Beckett, Samuel/Waiting For Godot
    /title/Beckett, Samuel/Wachten op Godot
    /title/Beckett, Samuel/Wachtend op Godot
    /title/Beckett, Samuel/Wachten op Godot (De favorieten)
    /title/Beckett, Samuel/Wachten op Godot (eerste bedrijf)
    /title/Beckett, Samuel/En attendant Godot
    /title/Beckett, Samuel/Warten auf Godot

    Only the first and fourth URI from the OPAC will find corresponding titles in the Theatre Production database. The second and third one, using a subtitle within the main title, don’t even have equivalents. And only two of the eight entries from the Theatre Production database have a match in the catalogue.
    In a library catalogue environment we are used to this problem, because catalogues are used for describing physical objects in the form of editions and copies. Unfortunately, also the Theatre Production database just contains records describing productions of a specific ‘edition’ or translation of a play, with only the opening performance information attached.

    This is where I need to talk about FRBR. Basically in a library catalogue environment this means that we should describe the relations between the ‘work’ (original text), the ‘expression’ (the version or translation), the ‘manifestation’ (edition, format, etc.) and the ‘items’ (the physical copies). Via the relations with higher level expression and work, the physical copy could be linked to the unifying work level, and then ideally through some universally valid unique identifier to, in our case, the theatre plays.
    Although FRBR is a publication centered schema used only in libraries, the same concepts can be applied to theatre performances: the original work (which is the same as the work in a bibliographical sense) has expressions (adaptations, translations, etc.), which have manifestations (productions), and in the end the individual items (actual performances on a specific date, time and location).

    Linking library and theatre in theory through FRBR

    If both the library catalogue and the theatre production database were FRBRised, we could in theory link on the Work level and cover all individual versions. But we would still need a matching mechanism on that Work level of course.

    In reality however we can only try to link on the Manifestation level in an imperfect way.

    Linking library and theatre in reality

    At the moment, in our project, on the catalogue side we extract the title and author from the generated OPAC HTML. It could be an option to get available linking information form the internal MARC records (like the 240, 246, 765, 767, 775 tags), but that is not easy because of a number of reasons. Something similar could be done in the theatre production database, making implicit links explicit. But all this makes the effort to get something sensible out there much bigger.


    Literal strings

    The third problem, the literal strings in Dutch both in the library catalogue and in the theatre production database, prevents the effective use of the data in multilingual environments, equally in the traditional native interfaces and as linked data. Obviously for English speaking end users the Dutch terms mean nothing. And in a linked data environment the Dutch strings can’t easily be linked to other data, in Dutch, English, or any language, without unique identifiers.


    Implicit to explicit

    People calling on institutions to publish their data as linked open data tend to say it’s easy once you know how to do it . And of course it must be done. But if the published datasets have a flat internal structure designed to fulfill a specific local business objective, then they just don’t provide sufficient added value for third party use. In order to make your published open data useful for others, you have to make implicit relations explicit. And this requires something more than just making the data available in RDF ‘as is’, it requires a lot of processing.

  • Do we need mobile library services? Not really

    Posted on December 21st, 2010 Lukas Koster 43 comments

    Mobile services have to fulfill information needs here and now

    Any time anywhere © Simona K

    Like many other libraries, the Library of the University of Amsterdam released a mobile web app this year. For background information about why and how we did it, have a look at the slideshow my colleague Roxana Popistasu and I gave at the IGeLU 2010 conference.
    For now I want to have a closer look at the actual reception and use of our mobile library services and draw some conclusions for the future. I have expressed some expectations earlier about mobile library services in my post “Mobile library services”. In summary, I expected that the most valued mobile library services would be of a practical nature, directly tied to the circumstances of internet access ‘any time, anywhere’, and would not include reading and processing of electronic texts.

    Let me emphasise that I define mobile devices as smart phones and similar small devices that can be carried around literally any time anywhere, and that need dedicated apps to be used on a small touchscreen. So I am not talking about tablets like the iPad, which are large enough to be used with standard applications and websites, just like netbooks.

    As you can see, most, if not all of the services in the Library of the University of Amsterdam mobile app are of a practical nature: opening hours, locations, contact information, news. And of course there is a mobile catalogue. This is the general situation in mobile library land, as has been described by Aaron Tay in his blog post “What are mobile friendly library sites offering? A survey”.

    In my view these practical services are not really library services. They are learning or study centre services at best. There is no difference with practical services offered by other organisations like municipal authorities or supermarkets. Nothing wrong with that of course, they are very useful, but I don’t consider these services to be core library services, which would involve enabling access to content.
    Real mobile devices are simply to small to be used for reading and processing large bodies of scholarly text. This might be different for public libraries.Their customers may appreciate being able to read fiction on their smart phones, provided that publishers allow them to read ebooks via libraries at all.

    Even a mobile library catalogue can be considered a practical service intended to fulfill practical needs of a physical nature, like finding and requesting print books and journals to be delivered to a specific location and renewing loans to avoid paying fines. Let’s face it: an Integrated Library System is basically nothing more than an inventory and logistics management system for physical objects.

    Usage statistics of the Library of the University of Amsterdam mobile web app show that between the launch in April and November 2010 the number of unique visits evolves around 30 per day on average, with a couple of peaks (350) on two specific days in October. The full website shows around 6000 visits per day on normal weekdays.
    For the mobile catalogue this is between 30 and 50 visits per day. The full OPAC shows around 3000 visits on normal weekdays.

    In November we see a huge increase in usage. Our killer mobile app was introduced: an overview of currently available workstations per location. The number of unique visits rises to between 300 and 400 a day. The number of pageviews rises from under 100 per day to around 1000 on weekdays in November. The ‘available workstations’ service accounts for 80% of these. In December 2010, an exam period, these figures rise to around 2000 pageviews per day, with 90% for the ‘available workstations’ service.

    We can safely conclude that our students are mainly using our mobile library app on their smart phones to locate the nearest available desktop PC.

    Mobile users expect services that are useful to them here and now.

    What does this mean for core library services, aimed at giving access to content, on small mobile devices? I think that there is no future for providing mobile access on smart phones  to traditional library content in digital form: electronic articles and ebooks. I agree with Aaron Tay when he says “I don’t believe there is any reason to think that it will necessarily lead to high demand for library mobile services” in his post “A few heretical thoughts about library tech trends“.

    Rather, mobile services should provide information about specific subjects useful to people here and now.

    Anne Frank House AR example

    In the near future anybody interested in a specific physical object or location will have access via their location aware smart phones and augmented reality to information of all kinds (text, images, sound, video, maps, statistics, etc.) from a number of sources: museums, archives, government agencies, maybe even libraries. To make this possible it is essential that all these organisations publish their information as linked open data. This means: under an open license using a generic linked data protocol like RDF.

    I expect that consumers of this new type of mobile location based augmented linked information would appreciate some guidance in the possibly overwhelming information landscape, in the form of specific views, with preselection of information sources and their context taken into account.
    There may be an opportunity here for libraries, especially public libraries, taking on a new coordinating role as information brokers on the intersection of a large number of different information providers. Of course if libraires want to achieve that, they need to look beyond their traditional scope and invest more in new information technologies, services and expertise.

    The future of mobile information services lies in the combination of location awareness, augmented reality and linked open data. Maybe libraries can help.

  • Dutch Culture Link

    Posted on October 7th, 2010 Lukas Koster 6 comments

    Linking library and cultural heritage data

    Culture links © Scott Beale/Laughing Squid (

    Interested to publishing a test collection as linked open data to help @StichtingDEN with practical guide for heritage institutions?” That’s what my former colleague at the Library of the University of Amsterdam, now project manager at DEN (Digital Heritage Foundation The Netherlands), Marco Streefkerk asked me in April 2010.

    Was I interested? Of course I was. I had written a blog post “Linked data for libraries” almost a year before, and I had been very interested in the subject since then. Unfortunately in my day job at the Library of the University of Amsterdam (UBA) until very recently there was no opportunity to put my theoretical knowledge to practice. However, in the Library’s “Action plans 2010-2011” (January 2010), the Semantic Web is mentioned in the Innovation chapter as one of the areas with room for a small pilot involving linked data and RDF. I like to think it was me who managed to get it in there 😉

    To come back to Marco’s question, I was at the time actually trying to think of a linked data/RDF test, and it so happened that I had talked to Ad Aerts of the Theater Institute of The Netherlands (TIN) about organising such a test the day before! So that’s what I told Marco. And things started from there.

    The first idea was to publish a small test set of records from one of the University Library’s own heritage collections. The goal from the point of view of DEN was to publish a short practical guide how to publish heritage collection as linked data, targeted at heritage institutions.
    But after some email discussions and meetings we decided to incorporate TIN in this test and apply both sides of the linked data concept: publish linked data and use linked data.
    Apart from a library catalogue, TIN also has a large database containing metadata on theater performances and a large collection of audiovisual material related to these performances. The plan was to publish the performance metadata and related digital material as linked data.
    The UBA would then use this TIN linked data in their traditional MARC based OPAC to enrich the plain bibliographic metadata if the OPAC search results related to theater plays.

    We decided to name our little proof of concept project “Dutch Culture Link”. The people involved for DEN are Marco Streefkerk, Annelies van Nispen and Monika Lechner. For TIN it’s Ad Aerts. For UBA: Roxana Popistasu and myself. Of these five people I knew four already face to face and one (Monika) on Twitter. I think this helps.

    To start with, we described the data model of the TIN Productions and Performances database (in terms of relationships or triples) as follows:

    • a Play is written by one or more Persons (as author)

      DCL data model

    • a Play can be ‘effectuated’ in one or more Productions
    • a Production can be ‘staged’ in one or more Performances
    • a Performance takes place in one Venue on a specific date and time
    • a Person can be producer of a Production
    • a Person can be director of a Production
    • a Person can play a character in a Production, or even in an individual Performance

    Besides the metadata TIN also has links from the database to digital collections (sound and video recordings, photographs, reviews). The model is strikingly similar to the bibliographic FRBR model. The Play is a FRBR Work, the Production is a FRBR Expression and/or Manifestation, the Performance is a FRBR Item.

    Now we knew who and what, but not yet how. We needed to know how to actually apply the theoretical concepts of linked data to our subject area. Questions we had were:

    • which ontology/vocabulary (‘data model’) do we need for publishing the production data?
    • how to format URIs (the linked data unique identifiers)
    • how do we implement RDF?
    • which publication techniques and platforms do we use?
    • which scripting languages can we use?
    • how do we find and get the published linked data?
    • how do we process and present the retrieved linked data?

    We definitely needed some practical hands-on tutorials or training. We could not find an institution organising practical linked data training courses in The Netherlands at short notice. Via Twitter Ian Davis referred us to their TALIS training options. Unfortunately, because we are only an informal proof of concept pilot project without any project funding, we were unable to proceed on this track.
    However, through a contact at The European Library we managed to enter two members of our project team as participants in the free Linked Data Workshop at DANS in The Hague, with Herbert Van De Sompel, Ivan Herman and Antoine Isaac as trainers. This workshop proved to be very useful. Unfortunately I could not attend myself.

    After the workshop we decided to adopt an “agile” aproach: just start and proceed with small steps. For the short term this meant on the TIN side: implementing a script that accesses the XML gateway of the Adlib system underlying the Theater Production Database and produces result in JSON format. The script accepts as input URIs of the form <baseurl>/person/<name>, <baseurl>/play/<person>/<title>, etc. For now only the <baseurl>/person/<name> works, but there are more to come.

    An example: the request <baseurl>/person/joost van den vondel gives the JSON result:

    “key”:”vondel, joost van den”,
    “name”:”Vondel, Joost van den”,
    “”:”17 november 1587*”,
    “”:”5 februari 1679*”,

    On the UBA side, if there is an author and/or title field in an individual OPAC result, a JavaScript addon to the Aleph OPAC HTML templates directs a query at the TIN linked data URL using one or both fields as input. The resulting JSON data from TIN is then processed and displayed. At the moment only the author field is used in the <baseurl>/person/<name> query. But here is more to come.

    UBA test OPAC with TIN data

    Next steps in this project:

    • select/adapt/create a vocabulary for the Production/Performance subject area
    • select/adapt/create vocabularies for Persons (FOAF?) and Subjects (SKOS?)
    • add internal relationships with the other entities (Play, Production, etc.) in the JSON structure (implement RDF in JSON)
    • Add RDF/XML as output option, besides JSON
    • add external relationships (to other linked data sources like DBPedia, etc.)
    • extend the number of possible URI formats (for Play, Production, etc.)
    • add content negotiation to serve both human and machine readable redirects
    • extend the options on the  OPAC side
    • publish UBA bibliographic data as linked open data (probably an entirely new project)

    The team will be blogging about project developments (in Dutch) on the DEN blog (addition July 7 2011: new DEN blog location).

    To be continued…

  • Linked Data for Libraries

    Posted on June 19th, 2009 Lukas Koster 8 comments
    Linked Data and bibliographic metadata models


    © PhOtOnQuAnTiQuE

    Some time after I wrote “UMR – Unified Metadata Resources“, I came across Chris Keene’s post “Linked data & RDF : draft notes for comment“, “just a list of links and notes” about Linked Data, RDF and the Semantic Web, put together to start collecting information about “a topic that will greatly impact on the Library / Information management world“.

    While reading this post and working my way through the links on that page, I started realising that Linked Data is exactly what I tried to describe as One single web page as the single identifier of every book, author or subject. I did mention Semantic Web, URI’s and RDF, but the term “Linked Data” as a separate protocol had escaped me.

    The concept of Linked Data was described by Tim Berners Lee, the inventor of the World Wide Web. Whereas the World Wide Web links documents (pages, files, images), which are basically resources about things, (“Information Resources” in Semantic Web terms), Linked Data (or the Semantic Web) links raw data and real life things (“Non-Information Resources”).

    There are several definitions of Linked Data on the web, but here is my attempt to give a simple definition of it (loosely based on the definition in Structured Dynamics’ Linked Data FAQ):

    Linked Data is a methodology for providing relationships between things (data, concepts and documents) anywhere on the web, using URI’s for identifying, RDF for describing and HTTP for publishing these things and relationships, in a way that they can be interpreted and used by humans and software.

    I will try to illustrate the different aspects using some examples from the library world. The article is rather long, because of the nature of the subject, then again the individual sections are a bit short. But I do supply a lot of links for further reading.

    Data is relationships
    The important thing is that “data is relationships“, as Tim Berners Lee says in his recent presentation for TED.
    Before going into relationships between things, I have to point out the important distinction between abstract concepts and real life things, which are “manifestations” of the concepts. In Object modeling these are called “classes” (abstract concepts, types of things) and “objects” (real life things, or “instances” of “classes“).


    • the class book can have the instances/objects “Cloud Atlas“, “Moby Dick“, etc.
    • the class person can have the instances/objects “David Mitchell“, “Herman Melville“, etc.

    In the Semantic Web/RDF model the concept of triples is used to describe a relationship between two things: subject – predicate – object, meaning: a thing has a relation to another thing, in the broadest sense:

    • a book (subject) is written by (predicate) a person (object)

    You can also reverse this relationship:

    • a person (subject) is the author of (predicate) a book (object)


    The person in question is only an author because of his or her relationship to the book. The same person can also be a mother of three children, an employee of a library, and a speaker at a conference.
    Moreover, and this is important: there can be more than one relationship between the same two classes or types of things. A book (subject) can also be about (predicate) a person (object). In this case the person is a “subject” of the book, that can be described by a “keyword”, “subject heading”, or whatever term is used. A special case would be a book, written by someone about himself (an autobiography).

    The problem with most legacy systems, and library catalogues as an example of these, is that a record for let’s say a book contains one or more fields for the author (or at best a link to an entry in an authority file or thesaurus), and separately one or more fields for subjects. This way it is not possible to see books written by an author and books about the same author in one view, without using all kinds of workarounds, link resolvers or mash-ups.
    Using two different relationships that link to the same thing would provide for an actual view or representation of the real world situation.

    Another important option of Linked Data/RDF: a certain thing can have as a property a link to a concept (or “class”) , describing the nature of the thing: “object Cloud Atlas” has type “book“; “object David Mitchell” has type “person“; “object Cloud Atlas” is written by “object David Mitchell“.

    And of course, the property/relationship/predicate can also link to a concept describing the nature of the link.

    Anywhere on the web



    So far so good. But you may argue that this relationship theory is not very new. Absolutely right, but up until now this data-relationship concept has mainly been used with a view to the inside, focused on the area of the specific information system in question, because of the nature and the limitations of the available technology and infrastructure.

    The “triple” model is of course exactly the same as the long standing methodology of Entity Relationship Diagrams (ERD), with which relationships between entities (=”classes“) are described. An ERD is typically used to generate a database that contains data in a specific information system. But ERD’s could just as well be used to describe Linked Data on the web.

    Information systems, such as library catalogs, have been, and still are, for the greatest part closed containers of data, or “silos” without connections between them, as Tim Berners Lee also mentions in his TED presentation.

    Lots of these silo systems are accessible with web interfaces, but this does not mean that items in these closed systems with dedicated web front ends can be linked to items in other databases or web pages. Of course these systems can have API‘s that allow system developers to create scripts to get related information from other systems and incorporate that external information in the search results of the calling system. This is what is being done in web 2.0 with so-called mash-ups.
    But in this situation you need developers who know how to make scripts using specific scripting languages for all the different proprietary API’s that are being supported for all the individual systems.
    If Linked Data was a global standard and all open and closed systems and websites supported RDF, then all these links would be available automatically to RDF enabled browser and client software, using SPARQL, the RDF Query Language.

    • Linked Data/RDF can be regarded as a universal API.

    The good thing about Linked Data is, that it is possible to use Linked Data mechanisms to link to legacy data in silo databases. You just need to provide an RDF wrapper for the legacy system, like has been done with the Library of Congress Subject Headings.

    Some examples of available tools for exposing legacy data as RDF:

    • Triplify – a web applications plugin that converts relational database structures into RDF triples
    • D2R Server – a tool for publishing relational databases on the Semantic Web
    • wp-RDFa – a wordpress plugin that adds some RDF information about Author and Title to WordPress blog posts

    Of course, RDF that is generated like this will very probably only expose objects to link TO, not links to RDF objects external to the system.

    Also, Linked Data can be used within legacy systems, for mixing legacy and RDF data, open and closed access data, etc. In this case we have RDF triples that have a subject URI from one data source and an object URI from another data source. In a situation with interlinked systems it would for instance be possible to see that the author of a specific book (data from a library catalog) is also speaking at a specific conference (data from a conference website). Objects linked together on the web using RDF triples are also known as an “RDF graph”. With RDF-aware client software it is possible to navigate through all the links to retrieve additional information about an object.

    Linked Data

    Linked Data

    URI’s (“Uniform Resource Identifiers”) are necessary for uniquely identifying and linking to resources on the web. A URI is basically a string that identifies a thing or resource on the web. All “Information Resources”, or WWW pages, documents, etc. have a URI, which is commonly known as a URL (Uniform Resource Locator).

    With Linked Data we are looking at identifying “Non-information Resources” or “real world objects” (people, concepts, things, even imaginary things), not web pages that contain information about these real world objects. But it is a little more complicated than that. In order to honour the requirement that a thing and its relations can be interpreted and used by humans and software, we need at least 3 different representations of one resource (see: How to publish Linked Data on the web):

    • Resource identifier URI (identifies the real world object, the concept, as such)
    • RDF document URI (a document readable for semantic web applications, containing the real world object’s RDF data and relationships with other objects)
    • HTML document URI (a document readable for humans, with information about the real world object)


    For instance, there could be a Resource Identifier URI for a book called “Cloud Atlas“. The web resource at that URI can redirect an RDF enabled browser to the RDF document URI, which contains RDF data describing the book and its properties and relationships. A normal HTML web browser would be redirected to the HTML document URI, for instance a web page about the book at the publisher’s website.

    There are several methods of redirecting browsers and application to the required representation of the resource. See Cool URIs for the Semantic Web for technical details.

    There are also RDF enabled browsers that transform RDF into web pages readable by humans, like the FireFox addon “Tabulator“, or the web based Disco and Marbles browsers, both hosted at the Free University Berlin.

    RDF, vocabularies, ontologies
    RDF or Resource Description Framework, is, like the name suggests, just a framework. It uses XML (or a simpler non-XML method N3) to describe resources by means of relationships. RDF can be implemented in vocabularies or ontologies, which are sets of RDF classes describing objects and relationships for a given field.
    Basically, anybody can create an RDF vocabulary by publishing an RDF document defining the classes and properties of the vocabulary, at a URI on the web. The vocabulary can then be used in a resource by referring to the namespace (the URI) and the classes in that RDF document.

    A nice and useful feature of RDF is that more than one vocabularies can be mixed and used in one resource.
    Also, a vocabulary itself can reference other vocabularies and thereby inherit well established classes and properties from other RDF documents.
    Another very useful feature of RDF is that objects can be linked to similar object resources describing the same real world thing. This way confusion about which object we are talking about, can be avoided.

    A couple of existing and well used RDF vocabularies/ontologies:

    (By the way,  the links in the first column (to the RDF files themselves) may act as an illustration of the redirection mechanism described before. Some of them may link to either the RDF file with the vocabulary definition itself, or to a page about the vocabulary, depending on the type of browser you use: rdf-aware or not.)

    A special case is:

    • RDFa – a sort of microformat without a vocabulary of its own, which relies on other vocabularies for turning XHTML page attributes into RDF

    A shortened example for “Cloud Atlas” by David Mitchell from the RDF BookMashup at the Free University Berlin, which uses a number of different vocabularies:

    <?xml version=”1.0″ encoding=”UTF-8″ ?>


    <rdf:Description rdf:about=”″>
    <rev:hasReview rdf:resource=”″/>
    <dc:creator rdf:resource=””/>
    <dc:identifier rdf:resource=”urn:ISBN:0375507256″/>
    <dc:publisher>Random House Trade Paperbacks</dc:publisher>
    <dc:title>Cloud Atlas: A Novel</dc:title>

    <scom:Book rdf:about=”″>
    <rdfs:label>Cloud Atlas: A Novel</rdfs:label>
    <skos:subject rdf:resource=””/>
    <skos:subject rdf:resource=””/>

    <foaf:depiction rdf:resource=””/>
    <foaf:thumbnail rdf:resource=””/>

    <rdf:Description rdf:about=”″>
    <dc:license rdf:resource=”;node=3440661&amp;no=12738641&amp;me=A36L942TSJ2AJA”/>
    <dc:license rdf:resource=””/>

    <foaf:Document rdf:about=”″>
    <rdfs:label>RDF document about the book: Cloud Atlas: A Novel</rdfs:label>
    <foaf:maker rdf:resource=”″/>
    <foaf:primaryTopic rdf:resource=”″/>

    <rdf:Description rdf:about=””>
    <rdfs:label>David Mitchell</rdfs:label>

    <rdf:Description rdf:about=”″>
    <rdfs:label>Review number 1 about: Cloud Atlas: A Novel</rdfs:label>

    <rdf:Description rdf:about=”″>
    <rdfs:label>RDF Book Mashup</rdfs:label>


    A partial view on this RDF file with the Marbles browser:

    RDF browser view

    RDF browser view

    See also the same example in the Disco RDF browser.

    Library implementations
    It seems obvious that Linked Data can be very useful in providing a generic infrastructure for linking data, metadata and objects, available in numerous types of data stores, in the online library world. With such a networked online data structure, it would be fairly easy to create all kinds of discovery interfaces for bibliographic data and objects. Moreover, it would also be possible to link to non-bibliographic data that might interest the users of these interfaces.

    A brief and incomplete list of some library related Linked Data projects, some of which already mentioned above:

    And what about MARC, AACR2 and RDA? Is there a role for them in the Linked Data environment? RDA is supposed to be the successor of AACR2 as a content standard that can be used with MARC, but also with other encoding standards like MODS or Dublin Core.
    The RDA Entity Relationship Diagram, that incorporates FRBR as well, can of course easily be implemented as an RDF vocabulary, that could be used to create a universal Linked Data library network. It really does not matter what kind of internal data format the connected systems use.