Posted on December 1st, 2013 1 comment
Struggling towards usable linked data services at SWIB13
Paraphrasing some of the challenges proposed by keynote speaker Dorothea Salo, the unofficial theme of the SWIB13 conference in Hamburg might be described as “No more ontologies, we want out of the box linked data tools!”. This sounds like we are dealing with some serious confrontations in the linked open data world. Judging by Martin Malmsten’s LIBRIS battle cry “Linked data or die!” you might even think there’s an actual war going on.
Looking at the whole range of this year’s SWIB pre-conference workshops, plenary presentations and lightning talks, you may conclude that “linked data is a technology that is maturing” as Rurik Greenall rightly states in his conference report. “But it has quite a way to go before we can say this stuff is ready to roll out in libraries” as he continues. I completely agree with this. Personally I got the impression that we are in a paradoxical situation where on the one hand people speak of “we” and “community”, and on the other hand they take fundamentalist positions, unconditionally defending their own beliefs and slandering and ridiculing other options. In my view there are multiple, sometimes overlapping, sometimes irreconcilable “we’s” and “communities”. Sticking to your own point of view without willingness to reason with the other party really does not bring “us” further.
This all sounds a bit grim, but I again agree with Rurik Greenall when he says that he “enjoyed this conference immensely because of the people involved”. And of course on the whole the individual workshops and presentations were of a high quality.
Before proceeding to the positive aspects of the conference, let me first elaborate a bit on the opposing positions I observed during the conference, which I think we should try to overcome.
Developers disagree on a multitude of issues:
Developers hate MARC. Everybody seems to hate RDF/XML, JSON-LD seems to be the thing for RDF, but some say only Turtle should be used, or just JSON.
Tools and languages
Perl users hate Java, Jave users hate PHP, there’s Python and Ruby bashing.
Create your own, reuse existing ones, yes or no upper ontologies, no ontologies but usable tools.
Windows/UNIX/Linux/Apple… it’s either/or.
Open source vs. commercial software
Need I say more?
Belgians hate German beer, or any foreign beer for that matter.
(Not to mention PDF).
OK, I hope I made myself clear. The point is that I have no problem at all with having diverse opinions, but I dislike it when people are convinced that their own opinion is the only right one and refuse to have a conversation with those who think otherwise, or even respect their choices in silence. The developer “community” definitely has quite a way to go.
Apart from these internal developer disagreements I noticed, there is the more fundamental gap between developers and users of linked open data. By “users” I do not mean “end users” in this case, but the intermediary deployers of systems. Let’s call them “libraries”.
Linked Data developers talk about tools and programming languages, metadata formats, open source, ontologies, technology stacks. Librarians want to offer useful services to their end users, right now. They may not always agree on what kind of services and what kind of end users, and they may have an opinion on metadata formats in systems, but their outlook is slightly different from the developers’ horizon. It’s all about expectations and expectation management. That is basically Dorothea Salo’s keynote’s point. Of course theoretical, scientific and technical papers and projects are needed to take linked data further, but libraries need linked data tools, focused on providing new services to their end users/customers in the world of the web, that can easily be implemented and maintained.
In this respect OCLC’s efforts to add linked data features to WorldCat is praiseworthy. OCLC’s Technology Evangelist Richard Wallis presented his view on the benefits of linked open data for libraries, using Google’s Knowledge Graph as an example. His talk was mainly focused at a librarian audience. At SWIB, where the majority of attendees are developers or technology staff, this seemed somewhat misplaced. By chance I had been present at Richard’s talk at the Dutch National Information Professional annual meeting two weeks earlier, where he delivered almost the same presentation for a large room full of librarians. There and then that was completely on target. For the SWIB audience this all may have been old news, except for the heads up about OCLC’s work on FRBR “Works” BIBFRAME type linked data which will result in published URIs for Works in WorldCat.
An important point here is that OCLC is a company with many library customers worldwide, so developments like this benefit all of these libraries. The same applies to customers of one of the other big library system vendors, Ex Libris. They have been working on developing linked data features for their so called “next generation” tools since some time now, in close cooperation with the international user groups’ Linked Open Data Special Interest Working Group, as I explained in the lightning talk I gave. Also open source library systems like Koha are working on adding linked open data features to their tools. It’s with tools like these, that reach a large number of libraries, that linked open data for libraries can spread relatively quickly.
In contrast to this linked data broadcasting, the majority of the SWIB presentations showed local proprietary development or research projects, mostly of high quality notwithstanding. In the case of systems or tools that were built all the code and ontologies are available on GitHub, making them open source. However, while it is commendable, open source on GitHub doesn’t mean that these potentially ground breaking systems and ontologies can and will be adopted as de facto standards in the wider library community. Most libraries, both public and academic, are dependent on commercial system and content providers and can’t afford large scale local system development. This also applies up to a point to libraries that deploy large open source tools like Koha, I presume.
It would be great if some of these many great open source projects could evolve into commonly used standard tools, like Koha, Fedora and Drupal, just to name a few. Vivo is another example of an open source project rapidly moving towards an accepted standard. It is a framework for connecting and publishing research information of different nature and origin, based on linked data concepts. At SWIB there was a pre-conference “VivoCamp”, organised by Lambert Heller, Valeria Pesce and myself. Research information is an area rapidly gaining importance in the academic world. The Library of the University of Amsterdam, where I work, is in the process of starting a Vivo pilot, in which I am involved. (Yes, the Library of the University of Amsterdam uses both commercial providers like OCLC and Ex Libris, and many open source tools). The VivoCamp was a good opportunity to have a practical introduction in and discussion about the framework, not in the least by the presence of John Fereira of Cornell University, one of the driving forces behind Vivo. All attendees (26) expressed their interest in a follow-up.
Vivo, although it may be imperfect, represents the type of infrastructure that may be needed for large scale adoption of linked open data in libraries. PUB, the repository based linked data research information project at Bielefeld University presented by Vitali Peil, is aimed at exactly the same domain as Vivo, but it again is a locally developed system, using another smaller scale open source framework (LibreCat/Catmandu of Bielefeld, Ghent and Lund universities) and a number of different ontologies, of which Vivo is just one. My guess is that, although PUB/LibreCat might be superior, Vivo will become the de facto standard in linked data based research information systems.
Instead of focusing on systems, maybe the library linked data world would be better served by a common user-friendly metadata+services infrastructure. Of course, the web and the semantic web are supposed to be that infrastructure, but in reality we all move around and process metadata all the time, from one system and database to another, in order to be able to offer new legacy and linked data services. At SWIB there was mention of a number of tools for ETL, which is developer jargon for Extract, Transform, Load. By the way, jargon is a very good way to widen the gap between developers and libraries.
There were pre-conference workshop for the ETL tools Catmandu and Metafacture, and in a lightning talk SLUB Dresden, in collaboration with Avantgarde Labs, presented a new project focused on using ETL for a separate multi-purpose data management platform, serving as a unified layer between external data sources and services. This looks like a very interesting concept, similar to the ideas of a data services hub I described in an earlier post “(Discover AND deliver) OR else”. The ResourceSync project, presented by Simeon Warner, is trying to address the same issue by a different method, distributed synchronisation of web resources.
One can say that the BIBFRAME project is also focused on data infrastructure, albeit at the moment limited to the internal library cataloguing workflow, aimed at replacing MARC. An overview of the current state of the project was presented by Lars Svensson of the German National Library.
The same can be said for the National Library of Sweden’s new LIBRIS linked data based cataloguing system, presented by Martin Malmsten (Decentralisation, Distribution, Disintegration – towards Linked Data as a First Class Citizen in Libraryland). The big difference is that they’re actually doing what BIBFRAME is trying to plan. The war cry “Linked data or die!” refers to the fact that it is better to start from scratch with a domain and format independent data infrastructure, like linked data, than to try and build linking around existing rigid formats like MARC. Martin Malmsten rightly stated that we should keep formats outside our systems, as is also the core statement of the MARC-MUST-DIE movement. Proprietary formats can be dynamically imported and exported at will, as was demonstrated by the “MARC” button in the LIBRIS user interface. New library linked data developments will have to coexist with the existing wider library metadata and systems environment for some time.
Like all other local projects, the LIBRIS source code and ontology descriptions are available on GitHub. In this case the mere scope of the National Library of Sweden and of the project makes it a bit more plausible that this may actually be reused on a larger scale. At least the library cataloguing ontology in JSON-LD there is worth having a look at.
To return to our starting point, the LIBRIS project acknowledges the fact that we need actual tools besides the ontologies. As Martin Malmsten quoted: “Trying to sell the idea of linked data without interfaces is like trying to sell a fax without the invention of paper”.
The central question in all this: what is the role of libraries in linked data? Developers or implementers, individually or in a community? There is obviously not one answer. Maybe we will know more at SWIB14. Paraphrasing Fabian Steeg and Pascal Christoph of hbz and Dorothea Salo, next years theme might be “Out of the box data knitting for great justice”.
Posted on June 10th, 2013 17 comments
The inside-out library at ELAG 2013
This year marked my fifth ELAG conference since 2008 (I skipped 2009), which is not much if you take into account that ELAG2013 was the 37th one. I really enjoyed the 2013 conference, not in the least because of the wonderful people of the local organising committee at the Ghent University Library, who made ELAG2013 a very pleasant event.This year’s theme was “the inside-out library”, a concept coined by Lorcan Dempsey, which in brief emphasises the need for libraries to shift focus 180 degrees.
In my personal overall conference experience major emphasis was on research support in libraries. This was partly due to my attendance of the pre-conference Joint OpenAIRE/LIBER Workshop ‘Dealing with Data – what’s the role for the library?’ on May 28. It was good to have sessions focusing on different perspectives: data management, data publication, the researchers’ needs, library support and training. I was honoured to be invited to participate in the closing round table panel discussion together with two library directors Wilma van Wezenbeek (TU Delft Library) and Wolfram Horstmann (Bodleian Library), under the excellent supervision of Kevin Ashley (DDC). An important central concept in the workshop was the research life cycle, which consists of many different tasks of a very diverse nature. Academic and research libraries should focus on those tasks for which they are or can easily become qualified.
Looking from another angle we can distinguish two main perspectives in integrating research: the research ecosystem itself, which can be seen as the main topic of the OpenAIRE/LIBER workshop, and the research content, the actual focus of researchers and research projects. I will try to address both perspectives here.
On the first day of the actual conference Herbert Van de Sompel gave the keynote speech with the title “A clean slate”. Rurik Greenall aptly describes the scope and meaning of Herbert’s argument. Herbert has been involved in a number of important and relevant projects in the domain of scholarly communication. My impression this time was: now he’s bringing it all together around the fairly new concept of the “research object”, integrating a number of projects and protocols, like ORE, Memento, OpenAnnotation, Provenance, ResourceSync. It’s all about connections between all components related to research on the web in all dimensions.
This linking of input, output, procedures and actors of research projects in various temporal and contextual dimensions in a machine readable way is extremely important in order to be able to process all relevant information by means of computer systems and present it to the human consumer. In this respect I think it is essential that data citations in scholarly articles should not only be made available in the article text, but also as machine readable metadata that can be indexed by external aggregators.
Moreover, it would be even better if it was possible to provide links to research projects that would serve as central hubs for linking to all associated entities, not only datasets. This is the role that the research object can fulfill. During the OpenAIRE/LIBER workshop I tried to address this issue a number of times, because I am a bit surprised that both researchers and publishers appear to be satisfied with having text only clickable dataset citations. That is even the case the other way around with links to articles in dataset repositories like Dryad. I think there is a role here for information professionals and metadata experts in libraries. This is exactly the point that Peter van Boheemen made in his talk about producing better metadata for research output. Similarly Jing Wang stressed the importance of investigating the role of metadata specialists and data librarians for interoperability and authority control in her presentation on the open source linked data based research discovery tool Vivo.
Again there are two perspectives here. Even if we have machine readable metadata on research projects and datasets, most systems are not adequately equipped with functionality to process or present this information. It is not so easy to update complex systems with new functionality. Planned update cycles, including extensive testing, are necessary in order to adhere to the system’s design and architecture and to avoid breaking things. This equally applies to commercial, open source and home grown systems. Joachim Neubert’s presentation of the use of the open source CMS Drupal for linked data enhanced publishing for special collections illustrated this. Some very specialist custom extensions to the essentially quite flexible system were needed to make this a success. (On a different note, it was nice to see that Joachim used a simple triple diagram from my first library linked data blog post to illustrate the use of different types of predicates between similar subjects and objects.)
Anyway, a similar point can be made about systems and identifiers for people (authors, researchers, etc.). I participated in the workshop on ISNI, ORCID and VIAF : Examining the fundamentals and application of contributor identifiers led by Anila Angjeli and Thom Hickey, one of six ELAG workshops this year. Thom and Anila presented a very complete and detailed overview of the similarities and differences of these three identifier schemes. One of the discussion topics was the difference in adoption of these schemes by the community on the one hand and as machine readable metadata and their application in library systems on the other.
Here comes “resilience” into play, a concept introduced by Beate Rusch in her talk on the changing roles of the German regional library consortia and service centres in the world of cloud computing and SaaS. Rurik Greenall captures the essence of her talk when he says “… homogenous, generic solutions will not work in practice because they are at odds with how things are done …” and that “messy, imperfect systems… are smart and long lived”. Since Beate’s presentation the term “resilience” popped up in a number of discussions with colleagues, during and after the conference, mainly in the sense that most systems, communities, infrastructures are NOT resilient. Resilience is a concept mainly used in psychology and physics, meaning the ability of someone or something to return to its original state after being subjected to a severe disturbance. Beate’s idea with resilience is that we can adapt better to changing circumstances and needs in the world around us if we are less perfect and rigid than we usually are. In this sense I think resilience can also mean that a structure could permanently change instead of returning to its original state.
In the library world resilience can be applied to librarians, libraries, library infrastructure and library systems alike. In my view “resilience” might apply to the alternative architecture I have described in a recent blog post, where I argue that we should stop thinking systems and start thinking data. In order to be resilient we need an open, connected infrastructure, that is of the web (not on the web). The SCAPE infrastructure for processing large datasets for long term preservation, presented by Sven Schlarb, might fit this description.
A number of presentations focused on infrastructure and architecture. The new version of the Swedish union catalogue LIBRIS could be described as a resilient system. Martin Malmsten, Markus Sköld and Niklas Lindström showed their new linked open data based integrated library framework which was built from the ground up, from ”a clean slate” so to speak. I can only echo Rurik’s verdict “ With this, Libris really are showing the world how things are done”. Contrary to the Library of Congress BibFrame development which started very promising, but now seems to evolve into an inward looking rigid New Marc. This was illustrated by Martin Malmsten when he revealed to us that Marc is undead, and by Becky Yoose, who wrote a very pertinent parable telling the tale of the resurrection of Marc.
Rurik Greenall described the direction taken at his own institution NTNU Library: getting rid of old legacy library and webpage formats and moving towards being part of the web, providing information for the web, being data driven. It’s a slow and uphill struggle, but better than the alternative. A clean slate again!
Dave Pattern presented a different approach in connecting data from a number of existing systems and databases by means of APIs, and combining these into a new and well received reading list service at the University of Huddersfield.
Back to research. In our presentation, or rather performance, Jane Stevenson and I tried to present the conflicting perspectives of collection managers and researchers in a theatrical way, showing parallel developments in the music industry. Afterwards we tried to analyse the different perspectives, argued that researchers need connected information of all types and from all sources and concluded that information professionals should try and learn to take the researcher’s perspective in order to avoid becoming irrelevant in that area.
The relationship between libraries and researchers was also the subject of the talk “Partners in research. Outside the library, inside the infrastructure“, by Sally Chambers and Saskia Scheltjens. Here the focus was on providing comprehensive infrastructures for research support, especially in the digital humanities. Central question: large top-down institutionalised structures, or bottom-up connected networks? Bottom line is: the researcher’s needs have to be met in the best possible way.
A very interesting example of an actual digital humanities research and teaching project in collaboration between researchers and the library is the Annotated Books Online project that was presented by Utrecht University staff. The collection of rare books is made available online in order to crowdsource the interpretation of handwritten annotations present in these books.
Besides research support there were presentations on other “inside out library” topics: publishing, teaching, data analysis and GLAM.
Anders Söderbäck presented the Stockholm University Press, a new publishing house for open access digital and print on demand books. I was pleasantly surprised that Anders included two quotes of my aforementioned blog post in his talk: “...in the near future we will see the end of the academic library as we know it” and “According to some people university libraries are very suitable and qualified to become scholarly publishers … I am not sure that this is actually the case. Publishing as it currently exists requires a number of specific skills that have nothing to do with librarian expertise“. But of course Anders’ most important achievement was winning the Library Automation Bingo by including all required terms in one slide in a coherent and meaningful way.
Merrilee Proffitt presented an overview of MOOCs and libraries, Sarah Brown described the way that learning materials at the Open University in the UK are successfully connected and integrated in the linked data based STELLAR project. Looking at these developments the question arises if there are already efforts to come to a Teaching Object model, similar to the Research Object?
Andrew Nagy described the importance of analysing huge amounts of usage data in order to improve the usability and end user front end of the Summon discovery tool. Dan Chudnov presented the Social Media Manager prototype, used for collecting data from twitter in order to be used in social science research.
Valentine Charles described the activities carried out by Europeana to contribute large amounts of digitised library heritage resources to Wikimedia Commons by means of the GLAMwiki toolset in order to improve visibility of these resources the Open Access way. The GLAMwiki toolset currently appears to offer a number of challenges for the interoperability and integration of metadata standards between the library and the Wikimedia world. Another plea for resilience.
Then there were the workshops. The combination of these parallel hands-on and engaging group activities and the plenary sessions makes ELAG a unique experience. Although I only participated in one, obviously, I have heard good reports from all other workshops. I would like to give a special mention to Ade and Jane Stevenson’s “Very Gentle Linked Data” workshop, where they managed to teach even non-tech people not only the basic principles of linked data, but also how to create their own triple store and query it with SPARQL.
Summarising: looking at the ELAG2013 presentations, are we ready for the inside out library? Sometimes we can start with a clean slate, but that is not always possible. Resilience seems to be a requirement if we want to cope with the dramatic changes we are facing. But you can’t simply decide to be resilient, either something is resilient or it isn’t. A clean slate might be the only option. In any case it seems obvious that connections are key. The information profession needs to invest in new connections on every level, creating new forms of knowledge, in order to stay relevant.
Posted on March 22nd, 2013 46 comments
The BeyondThePDF2 conference, organised by FORCE11, was held in Amsterdam, March 19-20. From the website: “...we aim to bring about a change in modern scholarly communications through the effective use of information technology”. Basically the conference participants discussed new models of content creation, content dissemination, content consumption, funding and research evaluation.
Because I work for an academic library in Amsterdam, dealing with online scholarly information systems and currently trying to connect traditional library information to related research information, I decided to attend.
Academic libraries are supposed to support university students, teaching and research staff by providing access to scholarly information. They should be somewhere in the middle between researchers, authors, publishers, content providers, students and teachers. Consequently, any big changes in the way that scholarly communication is being carried out in the near and far future definitely affects the role of academic libraries. For instance, if the scholarly publication model would change overnight from the current static document centered model to a dynamic linked data model, the academic library discovery and delivery systems infrastructure would grind to a halt.
So I was surprised to see that the library representation at the conference was so low compared to researchers, publishers, students and tech/tools people (thanks to Paul Groth for the opening slides). No Dutch university library directors were present. Maybe that’s because they all attended the Research Data Alliance launch in Gothenburg which was held at the same time. I know of at least one Dutch university library director who was there. Maybe an official international association is more appealing to managers than an informal hands on bunch like FORCE11.
A number of questions arise from this observation:
Are academic libraries talking to researchers?
Probably (or maybe even apparently) not enough. Besides traditional library services like providing access to publications and collections, academic libraries are more and more asked to provide support for the research process as such, research data management, preservation and reuse, scholarly output repositories and research information systems. In order to perform these new tasks in an efficient way for both the library and the researcher, they need to communicate about needs and solutions.
I took the opportunity and talked to a couple of scholars/researchers at BeyondThePDF2, asking among other things: “When looking for information relevant to your research topic, do you use (our) library search tools?” Answer: “No. Google.” or similar. Which brings me to the next question.
Do researchers know what academic libraries are doing?
Probably (or maybe even apparently) not enough. Same answer indeed. It struck me that of the few times libraries were mentioned in talks and presentations, it was almost always in the form of the old stereotype of the stack of books. Books? I always say: “Forget books, it’s about information!”. One of the presenters whose visionary talk I liked very much even told me that they hoped the new Amsterdam University Library Director would know something about books.That really left me speechless.
Fortunately the keynote speaker on the second day, Carol Tenopir, had lots of positive things to say about libraries. One remark was made (not sure who said it) that has been made before: “if libraries do their work properly, they are invisible”. This specifically referred to academic libraries’ role in selecting, acquiring, paying for and providing technical access to scholarly publications from publishers and other content providers.
Another illustration of this invisibility is the in itself great initiative that was started during the conference: “An open alternative to Google Scholar”, which could just as well have been called “An open alternative to Google Scholar, Primo Central, WorldCatLocal, Summon, EDS”. These last four are the best known commercial global scholarly metadata indexes that lots of academic libraries offer.
Anyway, my impression that academic libraries need to pay attention to their changing role in a changing environment was once again confirmed.
Publishers and researchers talk to each other!
(Yes I know that’s not a question). In the light of the recent war between open access advocates and commercial publishers it was good to see so many representatives of Elsevier, Springer etc. actively engaged in discussions with representatives of the scholarly community about new forms of content creation and dissemination. Some of the commercial content providers/aggregators are also vendors of the above mentioned Google Scholar alternatives (OCLC-WorldCatLocal, Proquest/SerialsSolutions-Summon, EBSCO-EDS). All of these are very reluctant to contribute their own metadata to their competitors’ indexes. Academic libraries are caught in the middle here. They pay lots of money for content that apparently they can only access through the provider’s own channels. And in this case the publishers/providers do not listen to the libraries.
Why so many tools/tech people?
Frankly I don’t know. However, I talked to a tools/tech person who worked for one of the publishers. So there obviously is some overlap in the attendee provenance information. Speaking about myself, working for a library, I am not a librarian, but rather a tools/tech person (with an academic degree even). Tools/tech people work for publishers, universities, university libraries and other types of organisations.
There is a lot of interesting innovative technical work being done in libraries by tools/tech people. We even have our own conferences and unconferences that have the same spirit as BeyondThePDF. If you want to talk to us, come for instance to ELAG2013 in Ghent in May, where the conference theme will be “The inside-out library”. Or have a look at Code4Lib, or the Library Linked Data movement.
Besides the good presentations, discussions and sessions, the most striking result of BeyondThePDF2 was the start of no less than three bottom-up revolutionary initiatives that draw immediate attention on the web:
- The Scholarly Revolution – Peter Murray-Rust
- The Open Alternative to Google Scholar – Stian Håklev
- The Amsterdam Manifesto on Data Citation Principles - Merce Crosas, Todd Carpenter, Jody Schneider
We can make it work.
Posted on October 10th, 2012 28 comments
Or: Think “different” or paint yourself in a corner
EMTACL12 – Emerging Technologies in Academic Libraries 2012
I attended the EMTACL12 conference in Trondheim October 1-3, 2012, organised by the Library of NTNU Norwegian University of Science and Technology, both as a member of the international programme committee and as a speaker. EMTACL stands for “emerging technologies in academic libraries”. Looking back, my impression was that the conference was not so much about emerging technologies, as about emerging tasks using existing technologies. One of the keynote speakers, Rudolf Mumenthaler, expressed similar thoughts in his blog post “No new technologies in libraries”, but some of the other participants disagreed, saying that “being emerging” has more to do with the context of technology than with the technology itself (see the comments on that blog post). Some technologies can be established, but may still be emerging in certain domains. There is something to say for that. Anyway, whatever you say, we all mean the same thing.
EMTACL12 was the second EMTACL conference. The first one was organised in 2010. One of the presentations that caused a great stir amongst librarians on twitter in the 2010 edition was the one entitled “I’ve got Google, why do I need you? A student’s expectations of academic libraries” by Ida Aalen. Let’s look at this year’s conference with that perspective in mind: is there a future for academic libraries in supporting students and researchers other than just giving access to publications?
The word “change” best describes the overall impression I got from all EMTACL12 presentations. And “data”. Both concepts involving “support and services for research and education”. Technologies that were mentioned: linked data, apis, mobile computing, visualisation, infrastructure, communication.
The EMTACL12 programme consisted of 8 plenary keynote presentations by invited speakers, and a number of presentations in two parallel tracks. Let me report on the things that struck me most.
The title of the opening keynote presentation by Herbert Van de Sompel, “Paint-yourself-in-the-corner Infrastructure” aptly describes the current situation of academic libraries. “Paint yourself in a corner” means something like: “To put yourself in a situation with no visible solution or alternative”. Herbert Van de Sompel talked about the changing nature of the scholarly record: from “fixity” and “boundary” to dynamic and interdependent on the web. Online publications and related information, like research project information, references and data, change over time, so it becomes increasingly difficult to recreate a scholarly record. These are the challenges that academic libraries need to address. Van de Sompel mentioned a couple of new tools and protocols that can help: Memento, DURI (Durable URI), SiteStory. See also the excellent report of this session by Jane Stevenson on the Archives Hub blog.
‘Think “different”’ is what Karen Coyle told us, using the famous Steve Jobs quote. And yes, the quotes around “different” are there for a reason, it’s not the grammatically correct “think differently”, because that’s too easy. What is meant here is: you have to have the term “different” in your mind all the time. Karen Coyle confronted us with a number of ingrained obsolete practices in libraries. Like the ineradicable need for alphabetic ordering, which only makes sense in physical catalogue card systems. “Alphabetical order is not generally meaningful and an accident of language” she said. Same with page numbers and ebooks: “…it is literally impossible to get everyone ‘on the same page’”. Before printing we already had a perfect reference system for texts, independent of physical appearance: paragraph or verse numbers (like in the Bible).
Libraries put things on shelves, forcing the user to see individual items, and ignoring the connections there are between them. “Library classification is a knowledge-prevention system, not a knowledge-organisation system”. The focus is still too much on physical items: “The FRBR user tasks drive me insane, as they end with obtain”. According to Karen Coyle, libraries are two-dimensional linear things. We need to add a third (links), fourth (time) and fifth dimension (the users).
Is linked data the answer? Not as such: “ISBD in RDF is like putting a turbo engine on a dinosaur”. The world is not waiting for libraries’ bibliographic data as Linked Open Data. The web is awash with bibliographic data. But we have holdings information, and that is unique and adds value. We should try and get that information into Google search results rich snippets.
Karen’s message, which I wholeheartedly support, was: “The mission of the library is not to gather physical things into an inventory, but to organize human knowledge that has been very inconveniently packaged.”
Rurik Greenall’s keynote “Defining/Defying reality: the struggle towards relevance in bibliographic data” also focused on the imminent irrelevancy of libraries, from another perspective. “Outsourcing library business is better called ‘outscarcing’. Libraries are losing skills.”. “You can tell a lot about an organization from the way it treats its data.”. “We see metadata as good and data as bad. The terms are the same.” . “Ideas change, so should your data.”. Buying shelf-ready data means being static. “Data should age like wine, not like fish.”. In this changing environment bibliographic data needs to be enhanced. There is a role for experts, for the library. Final quote: “The semantic web doesn’t exist anymore, it’s been absorbed by the web”.
Rudolf Mumenthaler spoke about “Innovation management in and for libraries”. During and after his talk the big question was: can innovation be promoted by management, or does it need to grow of itself in freedom, by allowing staff to play the Google way? It appears that there may be cultural differences. Main thing is: innovation has to be facilitated in one way or another. See the comments on his blog post.
Astrophysicist Eirik Newth entertained the audience with his slideless “Forecast for the academic library of 2025: Cloudy with a chance of user participation and content lock-in”.
Jens Vigen, Head Librarian at CERN, delivered a very entertaining and compelling argumentation for open access with his talk “Connecting people and information: how open access supports research in High Energy Physics. Since 50 years!” The CERN convention of 1953 already effectively contains an Open Access Manifesto. CERN supports SCOAP3, Sponsoring Consortium for Open Access Publishing in Particle Physics. CERN uses subscription funds for open access. “You librarians today spend money on subscriptions, tomorrow you will spend it on open access.”.
A couple of very interesting remarks by Jens Vigen that are of direct interest to online library discovery layers:
“A researcher would never go to an institutional repository, they find their colleagues in subject repositories.”.
“A successful digital library: one size does not fit all.”.
OCLC’s new “Technology evangelist” Richard Wallis‘ talk “OCLC WorldShare and Linked Data” actually was not about WorldShare and Linked data, but consisted of two parts, a WorldShare commercial, and a presentation of WorldCat and linked data, mainly the embedding of additional schema.org markup in WorldCat search results. Richard Wallis also mentioned the WorldCat Linked Data Facebook app, which almost nobody seemed to know. Maybe Facebook isn’t the right platform for things like this after all?
In his closing keynote “What Next for Libraries? Making Sense of the Future“ Brian Kelly, UKOLN, University of Bath, in the UK, made it clear that it is very hard to foresee the future, with Star Trek, monorails and paper planes as evidence.
Obviously I could only attend half of the parallel tracks sessions. Moreover, I chaired two sessions of two presentations each, in the “Semantic Web” and “Supporting Research” tracks, and I gave one presentation myself.
In “The winner takes it al? – APIs and Linked Data battle it out” Jane and Adrian Stevenson (yes, they’re married, and work together) of the MIMAS National Data Centre at the University of Manchester in the UK, performed an actual battle defending the use of the generic linked data protocol versus the more dedicated API approach in making data available for reuse and mashups. Two interesting projects served as an example: the World War 1 Discovery Project (Adrian for APIs) and Linking Lives (Jane for Linked Data). Conclusion: too close to call.
Norwegian Black Metal was the intriguing topic of Kim Tallerås’ talk “Using Linked data to harmonize heterogeneous metadata – Modeling the birth of Norwegian black metal”. He and three others combined complicated metadata from two heterogeneous data sources about early Norwegian black metal bands, performances and recordings using linked data ontologies and graph matching techniques. We saw some very interesting slides containing MARC records and some typical Black Metal band and song names.
Afterwards we had the opportunity to experience the real thing in the Black Metal Room in the Norwegian Rock and Pop Museum Rockheim during our conference excursion.
“Mubil: a digital laboratory” is a project (NTNU Trondheim, PERCRO, Pisa, Italy) aimed at augmenting and enriching rare old books in a digital 3D architecture, ready for all kinds of platforms and devices. Results are touch ebooks, with options for retrieving extra textual information and virtual 3D objects. A very interesting presentation by Alexandra Angeletaki, Marcello Carrozzino and Chiara Evangelista.
In her talk “Libraries, research infrastructures and the digital humanities: are we ready for the challenge?”, Sally Chambers (DARIAH Göttingen) gave us a very thorough and complete overview of what “Digital Humanities” means and of all organisations and infrastructures currently available to libraries that are charged with supporting digital humanities research.
The History Engine project was the subject of the presentation “Driving history forward: The History Engine as a vehicle for engaging undergraduate research” by Paulina Rousseau, Whitney Kemble and Christine Berkowitz (University of Toronto Scarborough), as a real example of how libraries can support undergraduate students in their efforts to master research.
Sharon Favaro, Digital Services Librarian at Seton Hall University in South Orange, USA, showed us the landscape of disconnected tools used in the different stages of research projects: catalogues, databases, writing tools, drawing tools, reference managers, task managers, email; on the web, on internet file sharing tools, on desktop, on flash drives. The topic of her talk “Designing tools for the 21st century workflow of research and how it changes what libraries must do” was: how can research libraries support scholars within the entire lifecycle of the research process? The goal being to identify areas where library tools could be better integrated to support library resource use throughout the lifecycle of research. It was a pity that there was no real view yet on the best way to solve this problem: create a new library based infrastructure platform, use existing linking features, or other options. This will hopefully be the objective of a follow-up project at Seton Hall University Library.
“Publication profiles – presenting research in a new way“: Urban Andersson and Stina Johansson presented the Chalmers University (Gothenburg, Sweden) Publication Profiles Platform, in which all kinds of information related to Chalmers University researchers and publications are linked together. The main objective is to increase the visibility of Chalmers University research. A good example of how university libraries should take care of their own research and publications domain. A very interesting visualisation feature was shown: Chalmers Geography, or geographical relations between researchers and projects on Google Maps. A question I should have asked (but didn’t) is: how does this project relate to the VIVO project?
In my own presentation “Primo at the University of Amsterdam – Technology vs Real Life” I tried to show the discrepancies between the in theory unlimited possibilities of the technology used in library discovery layers and the limitations in the actual implementation of these tools, focused on content, indexing and user interface configuration. One of my conclusions was already expressed earlier by Jens Vigen: “A successful digital library: one size does not fit all.”.
Let’s not forget Rune Martin Andersen’s talk of the Bartebuss (Moustache Bus) Trondheim public transport open data app project. This is yet another proof that public transport apps are the killer apps of open data.
Last but not least: the food (delicious and lots of it), the photos, Patrick Hochstenbach’s doodles and the music: the excursion to Rockheim Museum, the conference dinner entertainment by Skrømt, and the afterparty at Ramp bar, resulting in an interesting playlist afterwards.
Posted on January 5th, 2012 33 comments
2011 has in a sense been the year of library linked data. Not that libraries of all kinds are now publishing and consuming linked data in great numbers. No. But we have witnessed the publication of the final report of the W3C Library Linked Data Incubator Group, the Library of Congress announcement of the new Bibliographic Framework for the Digital Age based on Linked Data and RDF, the release by a number of large libraries and library consortia of their bibliographic metadata, many publications, sessions and presentations on the subject.
All these events focus mainly on publishing library bibliographic metadata as linked open data. Personally I am not convinced that this is the most interesting type of data that libraries can provide. Bibliographic metadata as such describe publications, in the broadest sense, providing information about title, authors, subjects, editions, dates, urls, but also physical attributes like dimensions, number of pages, formats, etc. This type of information, in FRBR terms: Work, Expression and Manifestation metadata, is typically shared among a large number of libraries, publishers, booksellers, etc. ‘Shared’ in this case means ‘multiplied and redundantly stored in many different local systems‘. It doesn’t really make sense if all libraries in the world publish identical metadata side by side, does it?
In essence only really unique data is worth publishing. You link to the rest.
Currently, library data that is really unique and interesting is administrative information about holdings and circulation. After having found metadata about a potentially relevant publication it is very useful for someone to know how and where to get access to it, if it’s not freely available online. Do you need to go to a specific library location to get the physical item, or to have access to the online article? Do you have to be affiliated to a specific institution to be entitled to borrow or access it?
Usage data about publications, both print and digital, can be very useful in establishing relevance and impact. This way information seekers can be supported in finding the best possible publications for their specific circumstances. There are some interesting projects dealing with circulation data already, such as the research project by Magnus Pfeffer and Kai Eckert as presented at the SWIB 11 conference, and the JISC funded Library Impact Data project at the University of Huddersfield. The Ex Libris bX service presents article recommendations based on SFX usage log analysis.
The consequence of this assertion is that if libraries want to publish linked open data, they should focus on holdings and circulation data, and for the rest link to available bibliographic metadata as much as possible. It is to be expected that the Library of Congress’ New Bibliographic Framework will take care of that part one way or another.
In order to achieve this libraries should join forces with each other and with publishers and aggregators to put their efforts into establishing shared global bibliographic metadata pools accessible through linked open data. We can think of already existing data sources like WorldCat, OpenLibrary, Summon, Primo Central and the like. We can only hope that commercial bibliographic metadata aggregators like OCLC, SerialsSolutions and Ex Libris will come to realise that it’s in everybody’s interest to contribute to the realisation of the new Bibliographic Framework. The recent disagreement between OCLC and the Swedish National Library seems to indicate that this may take some time. For a detailed analysis of this see the blog post ‘Can linked library data disrupt OCLC? Part one’.
An interesting initiative in this respect is LibraryCloud, an open, multi-library data service that aggregates and delivers library metadata. And there is the HBZ LOBID project, which is targeted at ‘the conversion of existing bibliographic data and associated data to Linked Open Data‘.
So what would the new bibliographic framework look like? If we take the FRBR model as a starting point, the new framework could look something like this. See also my slideshow “Linked Open Data for libraries”, slides 39-42.
The basic metadata about a publication or a unit of content, on the FRBR Work level, would be an entry in a global datastore identified by a URI ( Uniform Resource Identifier). This datastore could for instance be WorldCat, or OpenLibrary, or even a publisher’s datastore. It doesn’t really matter. We don’t even have to assume it’s only one central datastore that contains all Work entries.
The thing identified by the URI would have a text string field associated with it containing the original title, let’s say “The Da Vinci Code” as an example of a book. But also articles can and should be identified this way. The basic information we need to know about the Work would be attached to it using URIs to other things in the linked data web. A set of two things linked by a URI is called a ‘triple’. ‘Author’ could for instance be a link to OCLC’s VIAF (http://viaf.org/viaf/102403515 = Dan Brown), which would then constitute a triple. If there are more authors, you simply add a URI for every person or institution. Subjects could be links to DBPedia/Wikipedia, Freebase, the Library of Congress Authority files, etc. There could be some more basic information, maybe a year, or a URI to a source describing the background of the work.
At the Expression level, a Dutch translation would have it’s own URI, stored in the same or another datastore. I could imagine that the publisher who commissioned the translation would maintain a datastore with this information. Attached to the Expression there would be the URI of the original Work, a URI pointing to the language, a URI identifying the translator and a text string contaning the Dutch title, among others.
Every individual edition of the work could have it’s own Manifestation level URI, with a link to the Expression (in this case the Dutch translation), a publisher URI, a year, etc. For articles published according to the long standing tradition of peer reviewed journals, there would also be information about the journal. On this level there should also be URIs to the actual content when dealing with digital objects like articles, ebooks, etc., no matter if access is free or restricted.
So far we have everything we need to know about publications “in the cloud”, or better: in a number of datastores available on a number of servers connected to the world wide web. This is more or less the situation described by OCLC’s Lorcan Dempsey in his recent post ‘Linking not typing … knowledge organization at the network level’. The only thing we need now is software to present all linked information to the user.
No libraries in sight yet. For accessing freely available digital content on the web you actually don’t need a library, unless you need professional assistance finding the correct and relevant information. Here we have identified a possible role of librarians in this new networked information model.
Now we have reached the interesting part: how to link local library data to this global shared model? We immediately discover that the original FRBR model is inadequate in this networked environment, because it implies a specific local library situation. Individual copies of a work (the Items) are directly linked to the Manifestation, because FRBR refers to the old local catalogue which describes only the works/publications one library actually owns.
In the global shared library linked data network we need an extra explicit level to link physical Items owned by the library or online subscriptions of the library to the appropriate shared network level. I suggest to use the “Holding” level. A Holding would have it’s own URI and contain URIs of the Manifestation and of the Library. A specific Holding in this way would indicate that a specific library has one or more copies (Items) of a specific edition of a work (Manifestation), or offers access to an online digital article by way of a subscription.
If a Holding refers to physical copies (print books or journal issues for instance) then we also need the Item level. An Item would have it’s own URI and the URI of the Holding. For each Item, extra information can be provided, for instance ‘availability’, ‘location’, etc. Local circulation administration data can be registered for all Holdings and Items. For online digital content we don’t need Items, only subscription information directly attached to the Holding.
Local Holding and Item information can reside on local servers within the library’s domain or just as well on some external server ‘in the cloud’.
It’s on the level of the Holding that usage statistics per library can be collected and aggregated, both for physical items and for digital material.
Now, this networked linked library data model still allows libraries to present a local traditional catalogue type interface, showing only information about the library’s own print and digital holdings. What’s needed is software to do this using the local Holdings as entry level.
But the nice thing about the model is that there will also be a lot of other options. It will also be possible to start at the other end and search all bibliographic metadata available in the shared global network, and then find the most appropriate library to get access to a specific publication, much like WorldCat does, but on an even larger scale.
Another nice thing of using triples, URIs and linked data, is that it allows for adding all kinds of other, non-traditional bibliographic links to the old inward looking library world, making it into a flexible and open model, ready for future developments. It will for instance be possible for people to discover links to publications and library holdings from any other location on the web, for instance a Wikipedia page or a museum website. And the other way around, from an item in local library holdings to let’s say a recorded theatre performance on YouTube.
When this new data and metadata framework will be in place, there will be two important issues to be solved:
- Getting new software, systems and tools for both back end administrative functions and front end information finding needs. For this we need efforts from traditional library systems vendors but also from developers in libraries.
- Establishing future roles for libraries, librarians and information professionals in the new framework. This may turn out to be the most important issue.
Posted on September 2nd, 2011 10 commentsShifting focus from information carriers back to information
Library catalogues have traditionally been used to describe and register books and journals and other physical objects that together constitute the holdings of a library. In an integrated library system (ILS), the public catalogue is combined with acquisition and circulation modules to administer the purchases of book copies and journal subscriptions on one side, and the loans to customers on the other side. The “I” for “Integrated” in ILS stands for an internal integration of traditional library workflows. Integration from a back end view, not from a customer perspective.
Because of the very nature of such a catalogue, namely the description of physical objects and the administration of processing them, there are no explicit relations between the different editions and translations of the same book, nor are there descriptions of individual journal articles. If you do a search on a specific person’s name, you may end up with a large number of result records, written by that person or someone with a similar name, or about that person, even with identical titles, without knowing if there is a relationship between them, and what that relationship might be. What’s certain is that you will not find journal articles written by or about that person. The same applies to a search on title. There is no way of telling if there is any relation between identical titles. A library catalogue user would have to look at specific metadata in the records (like MARC 76X-78X – Linking Entries, 534 – Original Version Note or 580 – Linking Entry Complexity Note), if available, to reach their own conclusions.
Most libraries nowadays also purchase electronic versions of books and journals (ebooks and ejournals) and have free or paid subscriptions to online databases. Sometimes these digital items (ebooks, ejournals and databases) are also entered into the traditional library catalogues, but they are sometimes also made available through other library systems, like federated search tools, integrated discovery tools, A-Z lists, etc. All kinds of combinations occur.
In traditional library catalogues digital items are treated exactly the same as their physical counterparts. They are all isolated individual items without relations. As Karen Coyle put it November 2010 at the SWIB10 conference: “The main goal of cataloguing today is to keep things apart” .
Basically, integrated library systems and traditional catalogues are nothing more than inventory and logistics systems for physical objects, mainly focused on internal workflows. Unfortunately in newer end user interfaces like federated search and integrated discovery tools the user experience in this respect has in general been similar to that of traditional public catalogues.
At some point in time during the rise of electronic online catalogues, apparently the lack of relations between different versions of the same original work became a problem. I’m not sure if it was library customers or librarians who started feeling the need to see these implicit connections made explicit. The fact is that IFLA (International Federation of Library Associations) started developing FRBR in 1998.
FRBR (Functional Requirements for Bibliographic Records) is an attempt to provide a model for describing the relations between physical publications, editions, copies and their common denominator, the Work.
FRBR Group 1 describes publications in terms of the entities Work, Expression, Manifestation and Item (WEMI).
FRAD (Functional Requirements for Authority Data – ‘authors’) and FRSAD (Functional Requirements for Subject Authority Data – ‘subjects’) have been developed later on as alternatives for the FRBR Group 2 and 3 entities.
As an example let’s have a look at The Diary of Anne Frank. The original handwritten diary may be regarded as the Work. There are numerous adaptations and translations (Expressions) of the original unfinished and unedited Work. Each of these Expressions can be published in the form of one or more prints, editions, etc. These are the Manifestations, especially if they have different ISBN’s. Finally a library can have one or more physical copies of a Manifestation, the Items.
Some might even say the actual physical diary is the only existing Item embodying one specific (the first) Expression of the Work (Anne’s thoughts) and/or the only Manifestation of that Expression.
Of course, this model, if implemented, would be an enormous improvement to the old public catalogue situation. It makes it possible for library customers to have an automatic overview of all editions, translations, adaptations of one specific original work through the mechanism of Expressions and Manifestations. RDA (Resource Description and Access) is exactly doing this.
However there are some significant drawbacks, because the FRBR model is an old model, based on the traditional way of library cataloguing of physical items (books, journals, and cd’s, dvd’s), etc. (Karen Coyle at SWIB10).
- In the first place the FRBR model only shows the Works and related Manifestations and Expressions of physical copies (Items) that the library in question owns. Editions not in the possession of the library are ignored. This would be a bit different in a union catalogue of course, but then the model still only describes the holdings of the participating libraries.
- Secondly, the focus on physical copies is also the reason that the original FRBR model does not have a place for journal titles as such, only for journal issues. So there will be as many entries for one journal as the library has issues of it.
- Thirdly, it’s a hierarchical model, which incorporates only relations from the Work top down. There is no room for relations like: ‘similar works’, ‘other material on the same subject’, ‘influenced by’, etc.
- In the fourth place, FRBR still does not look at content. It is document centric, instead of information centric. It does however have the option for describing parts of a Work, if they are considered separate entities/works, like journal articles or volumes of a trilogy.
- Finally, the FRBR Item entity is only interesting in a storage and logistics environment for physical copies, such as the Circulation function in libraries, or the Sales function in bookstores. It has no relation to content whatsoever.
FRBR definitely is a positive and necessary development, but it is just not good enough. Basically it still focuses on information carriers instead of information (it’s a set of rules for managing Bibliographic Records, not for describing Information). It is an introverted view of the world. This was OK as long as it was dictated by the prevailing technological, economical and social conditions.
In a new networked digital information world libraries should shift their focus back to their original objective: being gateways to information as such. This entails replacing an introverted hierarchical model with an extroverted networked one, and moving away from describing static information aggregates in favour of units of content as primary objects.
The linked data concept provides the framework of such a networked model. In this model anything can be related to anything, with explicit declarations of the nature of the relationship. In the example of the Diary of Anne Frank one could identify relations with movies and theater plays that are based on the diary, with people connected to the diary or with the background of World War 2, antisemitism, Amsterdam, etc.
In traditional library catalogues defining relations with movies or theater plays is not possible from the description of the book. They could however be entered as a textual reference in the description of a movie, if for instance a DVD of that movie is catalogued. Relations to people, World War 2, antisemitism and Amsterdam would be described as textual or coded references to a short concept description, which in turn could provide lists of other catalogue items indexed with these subjects.
In a networked linked data model these links could connect to information entities in their own right outside the local catalogue, containing descriptions and other material about the subject, and providing links to other related information entities.
FRBR would still be a valuable part of such a universal networked model, as a subset for a specific purpose. In the context of physical information carriers it is a useful model, although with some missing features, as described above. It could be used in isolation, as originally designed, but if it’s an open model, it would also provide the missing links and options to describe and find related information.
Also, the FRBR model is essential as a minimal condition for enabling links from library catalogue items to other entity types through the Work common denominator.
In a completely digital information environment, the model could be simplified by getting rid of the Item entity. Nobody needs to keep track of available copies of online digital information, unless publishers want to enforce the old business models they have been using in order to keep making a profit. Ebooks for instance are essentially Expressions or Manifestations, depending on their nature, as I stated in my post ’Is an e-book a book?’.
The FRBR model can be used and is used also in other subject areas, like music, theater performances, etc. The Work – Expression – Manifestation – Item hierarchy is applicable to a number of creative professions.
The networked model provides the option of describing all traditional library objects, but also other and new ones and even objects that currently don’t exist, because it is an open and adaptable model.
In the traditional library models it is for instance impossible, or at least very hard, to describe a story that continues through all volumes of a trilogy as a central thread, apart from and related to the descriptions of the three separate physical books and their own stories. In the Millennium trilogy by Stieg Larsson, Lisbeth Salander’s life story is the central thread, but it can’t be described as a separate “Work” in MARC/FRBR/RDA because it is not the main subject of one physical content carrier (unless we are dealing with an edition in one physical multi part volume). The three volumes will be described with the subjects ‘Missing girl mystery‘, ‘Sex trafficking‘ and ‘Illegal secret service unit‘ respectively.
In an open networked information model on the contrary it would be entirely possible to describe such a ‘roaming story’.
New forms of information objects could appear in the form of new types of aggregates, other than books or journal articles, for instance consisting of text, images, statistics and video, optionally of a flexible nature (dynamic instead of static information objects).
Existing library systems (ILS’s and Integrated Discovery tools alike), using bibliographic metadata formats and frameworks like MARC, FRBR and RDA, can’t easily deal with new developments without some sort of workaround. Obviously this means that if libraries want to continue playing a role in the information gateway world, they need completely different systems and technology. Library system vendors should take note of this.
Finally, instead of only describing information objects, libraries could take up a new role in creating new objects, in the form of subject based virtual information aggregates, like for instance the Anne Frank Timeline, or Qwiki.This would put libraries back in the center of the information access business.
Posted on March 28th, 2011 1 comment
The challenges of generating linked data from legacy databases
Some time ago I wrote a blog post about the linked data proof of concept project I am involved in, connecting bibliographic metadata from the OPAC of the Library of the University of Amsterdam with the theatre performances database maintained by the Theatre Institute of The Netherlands.
I ended that post with a list of next steps to take:
- select/adapt/create a vocabulary for the Production/Performance subject area
- select/adapt/create vocabularies for Persons (FOAF?) and Subjects (SKOS?)
- add internal relationships with the other entities (Play, Production, etc.) in the JSON structure (implement RDF in JSON)
- Add RDF/XML as output option, besides JSON
- add external relationships (to other linked data sources like DBPedia, etc.)
- extend the number of possible URI formats (for Play, Production, etc.)
- add content negotiation to serve both human and machine readable redirects
- extend the options on the OPAC side
- publish UBA bibliographic data as linked open data (probably an entirely new project)
So, what have we achieved so far? I can be brief about all the ‘real’ linked data stuff (RDF, vocabularies, external links, content negotiation): we are not there yet. This will be dealt with in the next phase.
Instead, we have focused on getting the simple JSON implementation right, both on the data publishing side and on the data using side. We have added more URIs and internal relationships, and we are using these in the OPAC user interface.
But we have also encountered a number of crucial problems that are in my view inherent to the type of legacy data models used in libraries and cultural heritage institutions.
First let me describe the improvements we have added so far.
The URI for ‘person’ <baseurl>/person/<personname> now also returns a link to all the ‘titles’ that person is connected to (not only with the ‘author’ role, but for all roles, like director, performer, etc.): <baseurl>/gettitles/<personname>. This link will return a set of URIs of the form <baseurl>/title/<personname>/<title>. The /<personname>/<title> bit is at the moment the only way that a more or less unique identifier can be constructed from the OPAC metadata for the ‘play’ in the TIN database. There are a number of really important problems related to this that I will discuss below.
returns among others:
/title/Beckett, Samuel/Waiting for Godot
/title/Beckett, Samuel/En attendant Godot
The URI for a ‘play’ <baseurl>/title/<personname>/<title> now returns a set of ‘production’ URIs of the form <baseurl>/production/<personname>/<title>/<openingdate>/<idnr>.
The ‘production’ URI returns information about ‘theatre company’, ‘venue‘ and all persons connected to that production, including their URIs, and when available also a link to an image of a poster, and a video.
<baseurl>/title/Beckett, Samuel/Waiting for Godot
/production/Beckett, Samuel/Waiting for Godot/1988-07-28/5777
/production/Beckett, Samuel/Waiting for Godot/1988-11-22/6750
/production/Beckett, Samuel/Waiting for Godot/1992-04-16/10728
/production/Beckett, Samuel/Waiting for Godot/1981-02-18/43032
The last ‘production’ URI returns:
“title”:”Waiting For Godot”,
“description”:”Beckett, Samuel (auteur: toneelspel van)”,
“description”:”Hartnett, John (regie)”,
“description”:”Muller, Frans (decor: ontwerp)”,
“description”:”Newell, Kym (licht: ontwerp)”,
“description”:”Zaal, Kees (geluid)”,
“description”:”Tolstoj, Alexander (uitvoerende: Lucky)”,
“description”:”Weeks, David (uitvoerende: Estragon)”,
“description”:”Coburn, Grant (uitvoerende: Vladimir)”,
“description”:”Evans, Rhys (uitvoerende: Pozzo)”,
“description”:”Geiringer, Karl (uitvoerende: A Boy)”,
“description”:”Guidi, Peter (uitvoering muziek)”,
“description”:”Kimmorley, Roxanne (uitvoering muziek)”,
“description”:”Vries, Hessel de (uitvoering muziek)”,
“uri”:”/person/Vries, Hessel de”
“description”:”Phillips, Margot (uitvoering muziek)”,
Now, the problems (or challenges) that we are facing here are essential to the core concept of linked data:
- we don’t have actual matching unique identifiers (URIs)
- we don’t have explicit internal relations with a common entity in both sources
- part of the data consists of literal strings in a specific language
These three problems are interrelated, they are linked problems, so to speak.
To start with the identifiers. Of course we have internal system identifiers in our local Aleph catalogue database. Because we contribute to the Dutch Union Catalogue (originally a PICA system, now OCLC), our bibliographic records also have national Dutch PICA identifiers. And because the Dutch Union Catalogue records are copied to WorldCat, these records in WorldCat also have OCLC numbers.
Also the Theatre Institute has internal system identifiers in their Adlib database. But at the moment we do not have a match between these separate internal identifier schemes. The Theatre Production database records are not in WorldCat because they’re not bibliographic records.
We are more or less forced to use the string values of the title and author fields to construct a usable URI, on both sides. Clearly this is the basis of lots of errors, because of the great number of possible variations in author and title descriptions.
But even if the Theatre Institute’s records were in the Union Catalogue or WorldCat as well, then we still would not have an automatic match without some kind of broker mechanism ascertaining that the library catalogue record describes the same thing as the theatre production database record. The same applies to the author, which of course should be a relation of the type “written by” between the play and a person record instead of string values. Both systems do have internal author or person authority files, but there is no direct matching. For authors this could theoretically be achieved by linking to an online person authority file like VIAF. But in the current situation this is not available.
This brings me to the second problem. The fact that we are using the string values of title instead of unique identifiers, means that we connect plays and productions with a specific title variety or language. In our current implementation this means that we are not able to link to all versions of one specific play.
For instance, from our OPAC the following URIs are constructed (two in English, one in French, one in Dutch):
/title/Beckett, Samuel/Waiting for Godot
/title/Beckett, Samuel/Waiting for Godot : a tragicomedy in two acts
/title/Beckett, Samuel/En attendant Godot : pièce en deux actes
/title/Beckett, Samuel/Wachten op Godot
In the Theatre Production database (two in English, four in Dutch, one in French, one in German):
/title/Beckett, Samuel/Waiting for Godot
/title/Beckett, Samuel/Waiting For Godot
/title/Beckett, Samuel/Wachten op Godot
/title/Beckett, Samuel/Wachtend op Godot
/title/Beckett, Samuel/Wachten op Godot (De favorieten)
/title/Beckett, Samuel/Wachten op Godot (eerste bedrijf)
/title/Beckett, Samuel/En attendant Godot
/title/Beckett, Samuel/Warten auf Godot
Only the first and fourth URI from the OPAC will find corresponding titles in the Theatre Production database. The second and third one, using a subtitle within the main title, don’t even have equivalents. And only two of the eight entries from the Theatre Production database have a match in the catalogue.
In a library catalogue environment we are used to this problem, because catalogues are used for describing physical objects in the form of editions and copies. Unfortunately, also the Theatre Production database just contains records describing productions of a specific ‘edition’ or translation of a play, with only the opening performance information attached.
This is where I need to talk about FRBR. Basically in a library catalogue environment this means that we should describe the relations between the ‘work’ (original text), the ‘expression’ (the version or translation), the ‘manifestation’ (edition, format, etc.) and the ‘items’ (the physical copies). Via the relations with higher level expression and work, the physical copy could be linked to the unifying work level, and then ideally through some universally valid unique identifier to, in our case, the theatre plays.
Although FRBR is a publication centered schema used only in libraries, the same concepts can be applied to theatre performances: the original work (which is the same as the work in a bibliographical sense) has expressions (adaptations, translations, etc.), which have manifestations (productions), and in the end the individual items (actual performances on a specific date, time and location).
If both the library catalogue and the theatre production database were FRBRised, we could in theory link on the Work level and cover all individual versions. But we would still need a matching mechanism on that Work level of course.
In reality however we can only try to link on the Manifestation level in an imperfect way.
At the moment, in our project, on the catalogue side we extract the title and author from the generated OPAC HTML. It could be an option to get available linking information form the internal MARC records (like the 240, 246, 765, 767, 775 tags), but that is not easy because of a number of reasons. Something similar could be done in the theatre production database, making implicit links explicit. But all this makes the effort to get something sensible out there much bigger.
The third problem, the literal strings in Dutch both in the library catalogue and in the theatre production database, prevents the effective use of the data in multilingual environments, equally in the traditional native interfaces and as linked data. Obviously for English speaking end users the Dutch terms mean nothing. And in a linked data environment the Dutch strings can’t easily be linked to other data, in Dutch, English, or any language, without unique identifiers.
Implicit to explicit
People calling on institutions to publish their data as linked open data tend to say it’s easy once you know how to do it . And of course it must be done. But if the published datasets have a flat internal structure designed to fulfill a specific local business objective, then they just don’t provide sufficient added value for third party use. In order to make your published open data useful for others, you have to make implicit relations explicit. And this requires something more than just making the data available in RDF ‘as is’, it requires a lot of processing.
Posted on December 21st, 2010 43 comments
Mobile services have to fulfill information needs here and now
Like many other libraries, the Library of the University of Amsterdam released a mobile web app this year. For background information about why and how we did it, have a look at the slideshow my colleague Roxana Popistasu and I gave at the IGeLU 2010 conference.
For now I want to have a closer look at the actual reception and use of our mobile library services and draw some conclusions for the future. I have expressed some expectations earlier about mobile library services in my post “Mobile library services”. In summary, I expected that the most valued mobile library services would be of a practical nature, directly tied to the circumstances of internet access ‘any time, anywhere’, and would not include reading and processing of electronic texts.
Let me emphasise that I define mobile devices as smart phones and similar small devices that can be carried around literally any time anywhere, and that need dedicated apps to be used on a small touchscreen. So I am not talking about tablets like the iPad, which are large enough to be used with standard applications and websites, just like netbooks.
As you can see, most, if not all of the services in the Library of the University of Amsterdam mobile app are of a practical nature: opening hours, locations, contact information, news. And of course there is a mobile catalogue. This is the general situation in mobile library land, as has been described by Aaron Tay in his blog post “What are mobile friendly library sites offering? A survey”.
In my view these practical services are not really library services. They are learning or study centre services at best. There is no difference with practical services offered by other organisations like municipal authorities or supermarkets. Nothing wrong with that of course, they are very useful, but I don’t consider these services to be core library services, which would involve enabling access to content.
Real mobile devices are simply to small to be used for reading and processing large bodies of scholarly text. This might be different for public libraries.Their customers may appreciate being able to read fiction on their smart phones, provided that publishers allow them to read ebooks via libraries at all.
Even a mobile library catalogue can be considered a practical service intended to fulfill practical needs of a physical nature, like finding and requesting print books and journals to be delivered to a specific location and renewing loans to avoid paying fines. Let’s face it: an Integrated Library System is basically nothing more than an inventory and logistics management system for physical objects.
Usage statistics of the Library of the University of Amsterdam mobile web app show that between the launch in April and November 2010 the number of unique visits evolves around 30 per day on average, with a couple of peaks (350) on two specific days in October. The full website shows around 6000 visits per day on normal weekdays.
For the mobile catalogue this is between 30 and 50 visits per day. The full OPAC shows around 3000 visits on normal weekdays.
In November we see a huge increase in usage. Our killer mobile app was introduced: an overview of currently available workstations per location. The number of unique visits rises to between 300 and 400 a day. The number of pageviews rises from under 100 per day to around 1000 on weekdays in November. The ‘available workstations’ service accounts for 80% of these. In December 2010, an exam period, these figures rise to around 2000 pageviews per day, with 90% for the ‘available workstations’ service.
We can safely conclude that our students are mainly using our mobile library app on their smart phones to locate the nearest available desktop PC.
Mobile users expect services that are useful to them here and now.
What does this mean for core library services, aimed at giving access to content, on small mobile devices? I think that there is no future for providing mobile access on smart phones to traditional library content in digital form: electronic articles and ebooks. I agree with Aaron Tay when he says “I don’t believe there is any reason to think that it will necessarily lead to high demand for library mobile services” in his post “A few heretical thoughts about library tech trends“.
Rather, mobile services should provide information about specific subjects useful to people here and now.
In the near future anybody interested in a specific physical object or location will have access via their location aware smart phones and augmented reality to information of all kinds (text, images, sound, video, maps, statistics, etc.) from a number of sources: museums, archives, government agencies, maybe even libraries. To make this possible it is essential that all these organisations publish their information as linked open data. This means: under an open license using a generic linked data protocol like RDF.
I expect that consumers of this new type of mobile location based augmented linked information would appreciate some guidance in the possibly overwhelming information landscape, in the form of specific views, with preselection of information sources and their context taken into account.
There may be an opportunity here for libraries, especially public libraries, taking on a new coordinating role as information brokers on the intersection of a large number of different information providers. Of course if libraires want to achieve that, they need to look beyond their traditional scope and invest more in new information technologies, services and expertise.
The future of mobile information services lies in the combination of location awareness, augmented reality and linked open data. Maybe libraries can help.
Posted on October 7th, 2010 6 comments
Linking library and cultural heritage data
“Interested to publishing a test collection as linked open data to help @StichtingDEN with practical guide for heritage institutions?” That’s what my former colleague at the Library of the University of Amsterdam, now project manager at DEN (Digital Heritage Foundation The Netherlands), Marco Streefkerk asked me in April 2010.
Was I interested? Of course I was. I had written a blog post “Linked data for libraries” almost a year before, and I had been very interested in the subject since then. Unfortunately in my day job at the Library of the University of Amsterdam (UBA) until very recently there was no opportunity to put my theoretical knowledge to practice. However, in the Library’s “Action plans 2010-2011” (January 2010), the Semantic Web is mentioned in the Innovation chapter as one of the areas with room for a small pilot involving linked data and RDF. I like to think it was me who managed to get it in there
To come back to Marco’s question, I was at the time actually trying to think of a linked data/RDF test, and it so happened that I had talked to Ad Aerts of the Theater Institute of The Netherlands (TIN) about organising such a test the day before! So that’s what I told Marco. And things started from there.
The first idea was to publish a small test set of records from one of the University Library’s own heritage collections. The goal from the point of view of DEN was to publish a short practical guide how to publish heritage collection as linked data, targeted at heritage institutions.
But after some email discussions and meetings we decided to incorporate TIN in this test and apply both sides of the linked data concept: publish linked data and use linked data.
Apart from a library catalogue, TIN also has a large database containing metadata on theater performances and a large collection of audiovisual material related to these performances. The plan was to publish the performance metadata and related digital material as linked data.
The UBA would then use this TIN linked data in their traditional MARC based OPAC to enrich the plain bibliographic metadata if the OPAC search results related to theater plays.
We decided to name our little proof of concept project “Dutch Culture Link”. The people involved for DEN are Marco Streefkerk, Annelies van Nispen and Monika Lechner. For TIN it’s Ad Aerts. For UBA: Roxana Popistasu and myself. Of these five people I knew four already face to face and one (Monika) on Twitter. I think this helps.
To start with, we described the data model of the TIN Productions and Performances database (in terms of relationships or triples) as follows:
- a Play is written by one or more Persons (as author)
- a Play can be ‘effectuated’ in one or more Productions
- a Production can be ‘staged’ in one or more Performances
- a Performance takes place in one Venue on a specific date and time
- a Person can be producer of a Production
- a Person can be director of a Production
- a Person can play a character in a Production, or even in an individual Performance
Besides the metadata TIN also has links from the database to digital collections (sound and video recordings, photographs, reviews). The model is strikingly similar to the bibliographic FRBR model. The Play is a FRBR Work, the Production is a FRBR Expression and/or Manifestation, the Performance is a FRBR Item.
Now we knew who and what, but not yet how. We needed to know how to actually apply the theoretical concepts of linked data to our subject area. Questions we had were:
- which ontology/vocabulary (‘data model’) do we need for publishing the production data?
- how to format URIs (the linked data unique identifiers)
- how do we implement RDF?
- which publication techniques and platforms do we use?
- which scripting languages can we use?
- how do we find and get the published linked data?
- how do we process and present the retrieved linked data?
We definitely needed some practical hands-on tutorials or training. We could not find an institution organising practical linked data training courses in The Netherlands at short notice. Via Twitter Ian Davis referred us to their TALIS training options. Unfortunately, because we are only an informal proof of concept pilot project without any project funding, we were unable to proceed on this track.
However, through a contact at The European Library we managed to enter two members of our project team as participants in the free Linked Data Workshop at DANS in The Hague, with Herbert Van De Sompel, Ivan Herman and Antoine Isaac as trainers. This workshop proved to be very useful. Unfortunately I could not attend myself.
After the workshop we decided to adopt an “agile” aproach: just start and proceed with small steps. For the short term this meant on the TIN side: implementing a script that accesses the XML gateway of the Adlib system underlying the Theater Production Database and produces result in JSON format. The script accepts as input URIs of the form <baseurl>/person/<name>, <baseurl>/play/<person>/<title>, etc. For now only the <baseurl>/person/<name> works, but there are more to come.
An example: the request <baseurl>/person/joost van den vondel gives the JSON result:
“key”:”vondel, joost van den”,
“name”:”Vondel, Joost van den”,
“birth.date”:”17 november 1587*”,
“death.date”:”5 februari 1679*”,
Next steps in this project:
- select/adapt/create a vocabulary for the Production/Performance subject area
- select/adapt/create vocabularies for Persons (FOAF?) and Subjects (SKOS?)
- add internal relationships with the other entities (Play, Production, etc.) in the JSON structure (implement RDF in JSON)
- Add RDF/XML as output option, besides JSON
- add external relationships (to other linked data sources like DBPedia, etc.)
- extend the number of possible URI formats (for Play, Production, etc.)
- add content negotiation to serve both human and machine readable redirects
- extend the options on the OPAC side
- publish UBA bibliographic data as linked open data (probably an entirely new project)
To be continued…
Posted on July 8th, 2010 2 comments
Meeting new user expectations at ELAG 2010
In the near future libraries and librarians will be very different from what they are now. That’s the overall impression I took away from the ELAG 2010 conference in Helsinki, June 8-11, 2010. ELAG stands for “European Library Automation Group”, which is an indication of its age (34 years): “automation” was then what is now “ICT”. The meetings are characterised by a combination of plenary presentations and parallel workshops.
This year’s theme was “Meeting new users’ expectations”, where the term “users” refers to “end users”, “customers” or “patrons”, as library customers are also called. When you hear the phrase “end user expectations” in relation to library technology you first of all think of front end functionality (user interfaces and services) and the changing experiences there. A number of presentations and workshops were indeed focused on user experience and user studies.
Keywords: discovery, guidance, knowing/engaging users, relevance ranking, context.
But a considerable number of sessions, maybe even the majority, were dedicated to backend technology and systems development.
Keywords: webservices, API, REST, JSON, XML, Xpath, SOLR, data wells, aggregation, identifiers, FRBR, linked data, RDF.
It is becoming ever more obvious that improving libraries’ digital user experience cannot be accomplished without proper data infrastructures and information systems and services. This is directly related to the shift of existing library traditions to the new web experience, which was the leading topic of the presentation given by Rosemie Callewaert and myself: “Discovering the library collections”. We are experiencing a move from closed local physical collections to open networked digital information.
First of all, library collections will be digital. If you don’t believe that, look at the music industry. The recording of stories started 5000 years ago already. The first music recordings only date from the 19th century.
Next, collections will be networked, interlinked and virtual. Data, metadata, and digital objects will be fetched from all kinds of databases on the web, not only traditional bibliographic metadata from library catalogues, and mixed into new result sets, using mashup or linked data techniques.
In this open digital environment, existing and new library systems and discovery tools simply cannot incorporate all possible data services available now and in the future. That is why libraries (or maybe we should start saying ‘information brokers’) MUST have ‘developer skills’ in one form or another. This can range from building your own data wells and discovery tools on one end to using existing online service builders for enriching third party frontends on the other, and everything in between, with different levels of skills required.
Another inevitable development in this open information environment is “cooperation” in all kinds of areas with all kinds of partners in all kinds of forms. Cooperation in development, procurement, hosting and sharing of software (systems, services) and aggregation of data, with libraries, museums, archives, educational institutions, commercial partners, etc.
Last but not least there is the question of the value of the physical library building in the digital age. A number of people stress the importance of libraries as places where students like to come to study. But being a learning center in my view is not part of the core business of a library, which is providing access to information. In pre-digital times it was obviously a natural and necessary thing to study information at the location of the physical collection. But this direct physical link between access to and processing of information does not exist anymore in an open digital information environment.
Back to the ELAG 2010 theme “Meeting new users’ expectations”. In the last slide of our presentation we asked the question “Can LIBRARIES meet new user expectations?” Because we did not have time to discuss it then and there, I will answer it here: “No, not libraries as they are now!”.
New users don’t expect libraries, they expect information services. Libraries were once the best way of providing access to information. Instead of taking the defensive position of trying to secure their survival as organisation (as is the natural aspiration of organisations) libraries should focus on finding new ways of achieving their original mission. This may even lead to the disappearance of libraries, or rather the replacement of the library organisation by other organisational structures. This may of course vary between types of libraries (public, academic, special, etc.).
We may need to redefine the concept of library from “the location of a physical collection” to “a set of information services administered by a group of specialists”.
To summarise: the new digital and networked nature of collections of information leads to a focus on new information services, supported by library staff with information and technology skills, in new organisational structures and in cooperation with other organisations.