Data. The final frontier.
RSS icon Home icon
  • Ten years linked open data

    Posted on June 4th, 2016 Lukas Koster 22 comments

    This post is the English translation of my original article in Dutch, published in META (2016-3), the Flemish journal for information professionals.

    rdf

    Ten years after the term “linked data” was introduced by Tim Berners-Lee it appears to be time to take stock of the impact of linked data for libraries and other heritage institutions in the past and in the future. I will do this from a personal historical perspective, as a library technology professional, systems and database designer, data infrastructure specialist, social scientist, internet citizen and information consumer.

    Linked data is a set of universal methods for connecting information from multiple web sources in order to generate enriched or new information and prevent information redundancy and ambiguity. This is achieved by describing information as “triples” (relationships between two objects) in RDF (Resource Description Framework), in which both objects and relationships are represented as URI’s (Unique Resource Identifiers), which point to definitions of these on the web. The object’s type and attributes can also be represented as triples.

    “Open data” means that the information concerned actually can and may be used.

    It can be ascertained that the concept of “linked data” came too early for the library and heritage world in general. The majority of libraries, particularly public libraries, at that time simply did not possess the context and expertise to do something meaningful with it. Only larger institutions with sufficient expertise, technical staff and funding were capable of executing linked data pilot projects and implement linked data services, such as national libraries, scientific institutes, library consortia and renowned heritage institutions. Furthermore many institutions are dependent on external system, database and content providers. It is only in the last couple of years (roughly since 2014) that influential organisations in the international library and heritage world have seriously begun exploring linked data. These are for instance large commercial system vendors like OCLC and Ex Libris, and national and regional umbrella organisations like National Libraries and Library Consortia.

    The first time I used the term “linked data” myself is documented on the web, in a blog post dated June 19, 2009, with the title ‘Linked Data for Libraries’, already in reference to libraries. The main assertion of my argumentation was “data is relationships” which still holds in full. The gist of my story was rather optimistic, focusing on a couple of technical and modelling aspects (URI’s, RDF, ontologies, content negotiation, etc.) for which there simply seemed to be a number of solutions at hand. In practice however these technical and modelling aspects turned out to be the subject of much discussion among linked data theorists and evangelists. Because of theoretical discussions like these, however necessary, consensus on standards and best practices is usually not reached very swiftly, which in turn leads to holding off development of universal and practical applications too.

    At that time I already worked at the Library of the University of Amsterdam (UvA), in charge of a number of library systems. I had however already applied the concepts underlying linked data years before that, even before the term “linked data” existed, to be precise in the period 2000-2002 at the former NIWI (Netherlands Institute for Scientific Information Services), in collaboration with my colleague Niek van Baalen. Essentially we are dealing here with nothing more than very elementary and universal principles that can make life a lot easier for system and database designers. Our basic premise was that everything to be described was a thing or an object with a unique ID, to which a type or concept was assigned, such as ‘person’, ‘publication’, ‘organisation’ etc. Depending on the type, the object could have a number of attributes (such as ‘name’, ‘start date’, etc.) and relationships with other objects. The objects could be denoted with various textual labels in specific languages. All of this implemented in an independent relational database, with a fully decoupled web frontend based on object oriented software as a middle layer. This approach was a logical answer to the problem of integrating the various databases and information systems of the six former institutes of the KNAW (Dutch Royal Academy of Science) that constituted NIWI [See: http://www.slideshare.net/lukask/concepts-and-relations-2595603 and http://www.niekvanbaalen.net/swiftbox/].

    Unfortunately both our concept-relational approach and NIWI were premature. The ideas on system independent concepts and relationships did not fall on fertile ground, and also the time was not right for an interdisciplinary scientific institute. From the late NIWI the current Dutch Data Archiving Institute DANS has risen, which continues the activities of the former Steinmetz Institute and the Dutch Historical Data Archive. One of the main areas of research for DANS nowadays is linked data.

    Anyway, when I first learned about the concept of linked data in 2009, I was immediately converted. In 2010 I had the opportunity to carry out a linked data pilot in collaboration with Ad Aerts of the former Theatre Institute of the Netherlands (TIN) and my UvA colleague Roxana Popistasu, in which theatre texts in the UvA Aleph OPAC were enriched with related information about performances of the play in question from the TIN Adlib Theatre Productions Database. The objective of this pilot was to show the added value of enrichment of search results via linked data with relevant information from other databases, while at the same time exposing bottlenecks in the data used. In particular the lack of universally used identifiers for objects, people and subjects at that time appeared to be a barrier for successfully implementing linked data.

    Example theatre linked data pilot: Waiting for Godot

    Example theatre linked data pilot: Waiting for Godot

    2010 was also the year that I first attended the SWIB conference (Semantic Web In Libraries). As it was only the second time the conference was organised, SWIB was still a largely German language meeting for a predominantly German audience. In the mean time SWIB has developed into one of the most important international linked open data conferences, held completely in English. Attending linked data conferences like SWIB often generates ambiguous feelings. On the one hand the discussions and the projects presented are a source of motivation, on the other hand they also give rise to frustration, because after returning to your own place of work it becomes clear once more that what large institutions can do in projects is not possible in everyday life. It is particularly the dependence on system providers that makes it difficult for libraries to implement linked data. In the theatre play pilot with the Ex Libris Aleph library system mentioned before it was only possible to use JavaScript add-ons in the user interface HTML pages, but not to adjust the internal system architecture and the international bibliographic MARC standard.

    This vendor dependence was the immediate motive of establishing the Linked Open Data Special Interest Working Group (LOD SIWG) within IGeLU, the International Group of Ex Libris Users. This group’s objective was and is to convince the global library systems provider Ex Libris to implement linked data options in their systems. Some effort was needed to make Ex Libris appreciate the value of this, but after five years the company has officially initiated a “Linked Data Collaboration Program”, in which the Library of the University of Amsterdam is a partner. Besides the LOD SIWG activities, of course parallel developments in the library world have contributed to this as well, such as the Library of Congress BIBFRAME project and the linked data activities of competitor OCLC.

    The BIBFRAME project is concerned with storing bibliographic data as linked data in RDF, replacing the international bibliographic MARC format. OCLC is primarily focused on publishing

    BIBFRAME basic schema

    BIBFRAME basic schema

    WorldCat and authority information as linked data through URI’s and enhancing its findability in search engines like Google through schema.org. Storing linked data should in principle utilize information published as linked data elsewhere, especially authority files such as VIAF and LoC Vocabularies.

    Consuming data published elsewhere is of course the actual goal of implementing linked data, in particular for the purpose of presenting end users with additional relevant information about topics they are interested in, without the need to execute similar searches in other systems. Academic libraries for example are increasingly developing an interest in presenting research output not only in the form of scholarly publications, but also in the form of related information about research projects, research data, procedures, networks, etc.

    In 2012-2013 I have in this context carried out a pilot linking scholarly publications, harvested from the UvA institutional repository and loaded into the UvA Primo discovery index, to related information in the Dutch national research information repository NARCIS, which since a number of years is managed by the previously mentioned DANS. In NARCIS a limited subset of “Enhanced Publications” is available, in which all available research information is connected. These publications can also be retrieved as linked data/RDF. Unfortunately the only workable result of this test was adding an external link to author information in NARCIS. Processing of URI’s and linked data was and is not yet available in Primo. But this is going to change now with the aforementioned Ex Libris Linked Data Collaboration Program.

    Example of NARCIS Enhanced Publications

    Example of NARCIS Enhanced Publications

    However, even if one has access to software that is targeted at storing and processing linked data and RDF, that does not suffice to actually tie together information from multiple sources. This was the outcome of another UvA pilot in the area of linked data and research information, using the open source linked data research information tool VIVO. This pilot showed that the data available in the internal university research information system was not good and complete enough. The objective of registering research information had always been limited to monitoring and publishing research output in an optimal way, mainly in the form of scholarly publications.

    In 2016 the odds appear to be steadily turning in favour of a broader application of linked data in libraries and other heritage institutions, in any case in my own experience. The Library of the University of Amsterdam is a partner in the Ex Libris Linked Data Collaboration Program Discovery Track. And the term “linked data” appears more and more in official library policy documents.

    Looking back on ten years of linked data and libraries one can conclude that successful implementation depends on the state of affairs in the full heritage information processing ecosystem. In this respect five preconditions within individual organisations are of importance: business case, tools, data, workflow and lifecycle.

    Business case: an organisation always requires a business case for applying linked data. It is not a goal in its own right. For instance plans may exist for providing new services or improving efficiency in existing tasks for which linked data can be employed. For example presenting integrated research information, providing background information about the creation of works of art, or simply eliminating redundant information in multiple databases.

    Tools: the software used must be suited for linked data. Publishing RDF, maintaining a SPARQL endpoint, processing external linked data through URI’s, storing data in a triple store. Specialised expertise is required in the case of homegrown software. For third party software this must be provided by the vendors.

    Data: internal and external data must be available and suitable for publishing and consuming as linked data. The local information architecture and interoperability require profound attention. Excessive focus on individual systems with closed databases prohibits this.

    Workflow: working procedures must be adapted to the processing of linked data. Existing working procedures are targeted at existing objectives, functionality and systems. Because all that changes with implementing linked data, procedures, jobs and the division of tasks will have to be adapted too. Particularly the use, continuity and reliability of internal and external linked data sources will have to be taken into account.

    Lifecycle: new tools, data infrastructures and workflows will have to be secured in the organisation for the long term. It is important to adhere to existing standards and best practices, and to participate in collaboratives like open source communities, library consortia and user groups, if possible.

    For the coming years I expect a number of standards and initiatives in the realm of linked data to reach maturity, which will enable individual libraries, archives and museums to get involved when they have practical implementations in mind, such as the aforementioned new services or efficiency improvements.

  • Analysing library data flows for efficient innovation

    Posted on November 27th, 2014 Lukas Koster 1 comment

    In my work at the Library of the University of Amsterdam I am currently taking a step forward by actually taking a step back from a number of forefront activities in discovery, linked open data and integrated research information towards a more hidden, but also more fundamental enterprise in the area of data infrastructure and information architecture. All for a good cause, for in the end a good data infrastructure is essential for delivering high quality services in discovery, linked open data and integrated research information.
    In my role as library systems coordinator I have become more and more frustrated with the huge amounts of time and effort spent on moving data from one system to another and shoehorning one record format into the next, only to fulfill the necessary everyday services of the university library. Not only is it not possible to invest this time and effort productively in innovative developments, but this fragmented system and data infrastructure is also completely unsuitable for fundamental innovation. Moreover, information provided by current end user services is fragmented as well. Systems are holding data hostage. I have mentioned this problem before in a SWIB presentation. The issue was also recently touched upon in an OCLC Hanging Together blog post: “Synchronizing metadata among different databases” .

    Fragmented data (SWIB12)

    Fragmented data (SWIB12)

    In order to avoid confusion in advance: when using the term “data” here, I am explicitly not referring to research data or any other specific type of data. I am using the term in a general sense, including what is known in the library world as “metadata”. In fact this is in line with the usage of the term “data” in information analysis and system design practice, where data modelling is one of the main activities. Research datasets as such are to be treated as content types like books, articles, audio and people.

    It is my firm opinion that libraries have to focus on making their data infrastructure more efficient if they want to keep up with the ever changing needs of their audience and invest in sustainable service development. For a more detailed analysis of this opinion see my post “(Discover AND deliver) OR else – The future of the academic library as a data services hub”. There are a number of different options to tackle this challenge, such as starting completely from scratch, which would require huge investments in resources for a long time, or implementing some kind of additional intermediary data warehouse layer while leaving the current data source systems and workflows in place. But for all options to be feasible and realistic, a thorough analysis of a library’s current information infrastructure is required. This is exactly what the new Dataflow Inventory project is about.

    The project is being carried out within the context of the short term Action Plans of the Digital Services Division of the Library of the University of Amsterdam, and specifically the “Development and improvement of information architecture and dataflows” program. The goal of the project is to describe the nature and content of all internal and external datastores and dataflows between internal and external systems in terms of object types (such as books, articles, datasets, etc.) and data formats, thereby identifying overlap, redundancy and bottlenecks that stand in the way of efficient data and service management. We will be looking at dataflows in both front and back end services for all main areas of the University Library: bibliographic, heritage and research information. Results will be a logical map of the library data landscape and recommendations for possible follow up improvements. Ideally it will be the first step in the Cleaning-Reconciling-Enriching-Publishing data chain as described by Seth van Hooland and Ruben Verborgh in their book “Linked Data for Libraries, Archives and Museums”.

    The first phase of this project is to decide how to describe and record the information infrastructure in such a form that the data map can be presented to various audiences in a number of ways, and at the same time can be reused in other contexts on the long run, for instance designing new services. For this we need a methodology and a tool.

    At the university library we do not have any thorough experience with describing an information infrastructure on an enterprise level, so in this case we had to start with a clean slate. I am not at all sure that we came up with the right approach in the end. I hope this post will trigger some useful feedback from institutions with relevant experience.

    Since the initial and primary goal of this project is to describe the existing infrastructure instead of a desired new situation, the first methodological area to investigate appears to be Enterprise Architecture (interesting to see that Wikipedia states “This article appears to contain a large number of buzzwords“). Because it is always better to learn from other people’s experiences than to reinvent all four wheels, we went looking for similar projects in the library, archive and museum universe. This proved to be rather problematic. There was only one project we could find that addresses a similar objective, and I happened to know one of the project team members. The Belgian “Digital library system’s architecture study” (English language report here)” was carried out for the Flemish Public Library network Bibnet, by Rosemie Callewaert among others. Rosemie was so kind to talk to me and explain the project objectives, approaches, methods and tools used. For me, two outcomes of this talk stand out: the main methodology used in the project is Archimate, which is an Enterprise Architecture methodology, and the approach is completely counter to our own approach: starting from the functional perspective as opposed to our overview of the actual implemented infrastructure. This last point meant we were still looking at a predominantly clean slate.
    Archimate also turned out to be the method of choice used by the University of Amsterdam central enterprise architecture group, whom we also contacted. It became clear that in order to use Archimate efficiently, it is necessary to spend a considerable amount of time on mastering the methodology. We looked for some accessible introductory information to get started. However the official Open Group Archimate website is not as accessible as desired in more than one way. We managed to find some documentation anyway, for instance the direct linkt to the Archimate specification and the free document “Archimate made practical”. After studying this material we found that Archimate is a comprehensive methodology for describing business, application and technical infrastructure components, but we also came to the conclusion that for our current short term project presentation goals we needed something that could be implemented fairly soon. We will keep Archimate in mind for the intermediate future. If anybody is interested, there is a good free open source modelling tool available, Archi. Other Enterprise Architecture methodologies like Business Process Modelling focus more on workflows than on existing data infrastructures. Turning to system design methods like UML (Unified Modelling Language) we see similar drawbacks.

    An obvious alternative technique to consider is Dataflow Diagramming (DFD) (what’s in a name?), part of the Structured Design and Structured Analysis methodology, which I had used in previous jobs as systems designer and developer. Although DFD’s are normally used for describing functional requirements on a conceptual level, with some tweaking they can also be used for describing actual system and data infrastructures, similar to the Archimate Application and Infrastructure layers. The advantage of the DFD technique is that it is quite simple. Four elements are used to describe the flow of information (dataflows) between external entities, processes and datastores. The content of dataflows and datastores can be specified in more detail using a data dictionary. The resulting diagrams are relatively easy to comprehend. We decided to start with using DFD’s in the project. All we had left to do was find a good and not too expensive tool for it.

    Basic DFD structure

    Basic DFD structure

    There are basically two types of tools for describing business processes and infrastructures: drawing tools, focusing on creating diagrams, and repository based modelling tools, focused on reusing the described elements. The best known drawing tool must be MicroSoft Visio, because it is part of their widely used Office Suite. There are a number of other commercial and free tools, among which the free Google Drive extension Draw.io. Although most drawing tools cover a wide range of methods and techniques, they don’t usually support reuse of elements with consistent characteristics in other diagrams. Also, diagrams are just drawings, they can’t be used to generate data definition scripts or basic software modules or reverse engineering or flexible reporting. Repository based tools can do all these things. Reuse, reporting, generating, reverse engineering and import and export features are exactly the features we need. We also wanted a tool that supports a number of other methods and techniques for employing in other areas of modelling, design and development. There are some interesting free or open source tools, like OpenModelSphere (which supports UML, ERD Data modelling and DFD), and a range of commercial tools. To cut a long story short we selected the commercial design and management tool Visual-Paradigm because it supports a large number of methodologies with an extensive feature set in a number of editions for reasonable fees. An additional advantage is the online shared teamwork repository.

    After acquiring the tool we had to configure it the way we wanted to use it. We decided to try and align the available DFD model elements to the Archimate elements so it would in time be possible to move to Archimate if that would prove to be a better method for future goals. Archimate has Business Service and Business Process elements on the conceptual business level, and Application Component (a “system”), Application Function (a “module”) and Application Service (a “function”) elements on the implementation level.

    Basic Archimate Structure

    Basic Archimate Structure

    In our project we will mainly focus on the application layer, but with relations to the business layer. Fortunately, the DFD method supports a hierarchical process structure by means of the decomposition mechanism, so the two hierarchical structures Business Service – Business Process and Application Component – Application Function – Application Service can be modeled using DFD. There is an additional direct logical link between a Business Process and the Application Service that implements it. By adding the “stereotypes” feature from the UML toolset to the DFD method in Visual Paradigm, we can effectively distinguish between the five process types (for instance by colour and attributes) in the DFD.

    Archimate DFD alignment

    Archimate DFD alignment

    So in our case, a DFD process with a “system” stereotype represents a top level Business Service (“Catalogue”, “Discover”, etc.) and a “process” process within “Cataloguing” represents an activity like “Describe item”, “Remove item”, etc. On the application level a “system” DFD process (Application Component) represents an actual system, like Aleph or Primo, a “module” (Application Function) a subsystem like Aleph CAT or Primo Harvesting, and a “function” (Application Service) an actual software function like “Create item record”.
    A DFD datastore is used to describe the physical permanent and temporary files or databases used for storing data. In Archimate terms this would probably correspond with a type of “Artifact” in the Technical Infrastructure layer, but that might be subject for interpretation.
    Finally an actual dataflow describes the data elements that are transferred between external entities and processes, between processes, and between processes and datastores, in both directions. In DFD, the data elements are defined in the data dictionary in the form of terms in a specific syntax that also supports optionality, selection and iteration, for instance:

    • book = title + (subtitle) + {author} + publisher + date
    • author = name + birthdate + (death date)

    etc.
    In Archimate there is a difference in flows in the Business and Application layers. In the Business layer a flow can be specified by a Business Object, which indicates the object types that we want to describe, like “book”, “person”, “dataset”, “holding”, etc. The Business Object is realised as one or more Data Objects in the Application Layer, thereby describing actual data records representing the objects transferred between Application Services and Artifacts. In DFD there is no difference between a business and a dataflow. In our project we particularly want to describe business objects in dataflows and datastores to be able to identify overlap and redundancies. But besides that we are also interested in differences in data structure used for similar business objects. So we do have to distinguish between business and data objects in the DFD model. In Visual-Paradigm this can be done in a number of ways. It is possible to add elements from other methodologies to a DFD with links between dataflows or datastores and the added external elements. Data structures like this can also be described in Entity Relationship Diagrams, UML Class Diagrams or even RDF Ontologies.
    We haven’t decided on this issue yet. For the time being we will employ the Visual Paradigm Glossary tool to implement business and data object specifications using Data Dictionary terms. A specific business object (“book”) will be linked to a number of different dataflows and datastores, but the actual data objects for that one business object can be different, both in content and in format, depending on the individual dataflows and datastores. For instance a “book” Business Object can be represented in one datastore as an extensive MARC record, and in another as a simple Dublin Core record.

    Example bibliographic dataflows

    Example bibliographic dataflows

    After having determined method, tool and configuration, the next step is to start gathering information about all relevant systems, datastores and dataflows and describing this in Visual Paradigm. This will be done by invoking our own internal Digital Services Division expertise, reviewing applicable documentation, and most importantly interviewing internal and external domain experts and stakeholders.
    Hopefully the resulting data map will provide so much insight that it will lead to real efficiency improvements and really innovative services.

  • Looking for data tricks in Libraryland

    Posted on September 5th, 2014 Lukas Koster No comments

    IFLA 2014 Annual World Library and Information Congress Lyon – Libraries, Citizens, Societies: Confluence for Knowledge

    DSC01405

    After attending the IFLA 2014 Library Linked Data Satellite Meeting in Paris I travelled to Lyon for the first three days (August 17-19) of the IFLA 2014 Annual World Library and Information Congress. This year’s theme “Libraries, Citizens, Societies: Confluence for Knowledge” was named after the confluence or convergence of the rivers Rhône and Saône where the city of Lyon was built.

    This was the first time I attended an IFLA annual meeting and it was very much unlike all conferences I have ever attended. Most of them are small and focused. The IFLA annual meeting is very big (but not as big as ALA) and covers a lot of domains and interests. The main conference lasts a week, including all kinds of committee meetings, and has more than 4000 participants and a lot of parallel tracks and very specialized Special Interest Group sessions. Separate Satellite Meetings are organized before the actual conference in different locations. This year there were more than 20 of them. These Satellite Meetings actually resemble the smaller and more focused conferences that I am used to.

    A conference like this requires a lot of preparation and organization. Many people are involved, but I especially want to mention the hundreds of volunteers who were present not only in the conference centre but also at the airport, the railway stations, on the road to the location of the cultural evening, etc. They were all very friendly and helpful.

    Another feature of such a large global conference is that presentations are held in a number of official languages, not only English. A team of translators is available for simultaneous translations. I attended a couple of talks in French, without translation headset, but I managed to understand most of what was presented, mainly because the presenters provided their slides in English.

    It is clear that you have to prepare for the IFLA annual meeting and select in advance a number of sessions and tracks that you want to attend. With a large multi-track conference like this it is not always possible to attend all interesting sessions. In the light of a new data infrastructure project I recently started at the Library of the University of Amsterdam I decided to focus on tracks and sessions related to aspects of data in libraries in the broadest sense: “Cloud services for libraries – safety, security and flexibility” on Sunday afternoon, the all day track Universal Bibliographic Control in the Digital Age: Golden Opportunity or Paradise Lost?” on Monday and “Research in the big data era: legal, social and technical approaches to large text and data sets” on Tuesday morning.

    Cloud Services for Libraries

    It is clear that the term “cloud” is a very ambiguous term and consequently a rather unclear concept. Which is good, because clouds are elusive objects anyway.

    In the Cloud Services for Libraries session there were five talks in total. Kee Siang Lee of the National Library Board of Singapore (NLB) described the cloud based NLB IT infrastructure consisting of three parts; a private, public and hybrid cloud. The private (restricted access) cloud is used for virtualization, an extensive service layer for discovery, content, personalization, and “Analytics as a service”, which is used for pushing and recommending related content from different sources and of various formats to end users. This “contextual discovery” is based on text analytics technologies across multiple sources, using a Hadoop cluster on virtual servers. The public cloud is used for the Web Archive Singapore project which is aimed at archiving a large number of Singapore websites. The hybrid cloud is used for what is called the Enquiry Management System (EMS), where “sensitive data is processed in-house while the non-sensitive data resides in the cloud”. It seems that in Singapore “cloud” is just another word for a group of real or virtual servers.

    In the talk given by Beate Rusch of the German Library Network Service Centre for Berlin and Brandenburg KOBV the term “cloud” meant: the shared management of data on servers located somewhere in Germany. KOBV is one of the German regional Library Networks involved in the CIB project targeted at developing a unified national library data infrastructure. This infrastructure may consist of a number of individual clouds. Beate Rusch described three possible outcomes: one cloud serving as a master for the others, a data roundabout linking the other clouds, and a cross cloud dataspace where there is an overlapping shared environment between the individual clouds. An interesting aspect of the CIB project is that cooperation with two large commercial library system vendors, OCLC and Ex Libris, is part of the official agreement. This is of interest for other countries that have vested interests in these two companies, like The Netherlands.

    Universal Bibliographic Control in the Digital Age

    The Universal Bibliographic Control (UBC) session was an all day track with twelve very diverse presentations. Ted Fons of OCLC gave a good talk explaining the importance of the transition from the description of records to the modeling of entities. My personal impression lately is that OCLC all in all has been doing a good job with linked data PR, explaining the importance and the inevitability of the semantic web for libraries to a librarian audience without using technical jargon like URI, ontology, dereferencing and the like. Richard Wallis of OCLC, who was at the IFLA 2014 Linked Data Satellite Meeting and in Lyon, is spreading the word all over the globe.

    Of the rest of the talks the most interesting ones were given in the afternoon. Anila Angjeli of the National Library of France (BnF) and Andrew MacEwan of the British Library explained the importance, similarities and differences of ISNI and VIAF, both authority files with identifiers used for people (both real and virtual). Gildas Illien (also one of the organizers of the Linked Data Satellite Meeting in Paris) and Françoise Bourdon, both BnF, described the future of Universal Bibliographic Control in the web of data, which is a development closely related to the topic of the talks by Ted Fons, Anila Angjeli and Andrew MacEwan.

    The ONKI project, presented by the National Library of Finland, is a very good example of how bibliographic control can be moved into the digital age. The project entails the transfer of the general national library thesaurus YSA to the new YSO ontology, from libraries to the whole public sector and from closed to open data. The new ontology is based on concepts (identified by URIs) instead of monolingual text strings, with multilingual labels and machine readable relationships. Moreover the management and development of the ontology is now a distributed process. On top of the ontology the new public online Finto service has been made available.

    The final talk of the day “The local in the global: universal bibliographic control from the bottom up” by Gordon Dunsire applied the “Think globally, act locally” aphorism to the Universal Bibliographic Control in the semantic web era. The universal top down control should make place for local bottom up control. There are so many old and new formats for describing information that we are facing a new biblical confusion of tongues: RDA, FRBR, MARC, BIBO, BIBFRAME, DC, ISBD, etc. What is needed are a number of translators between local and global data structures. On a logical level: Schema Translator, Term Translator, Statement Maker, Statement Breaker, Record Maker, Record Breaker. These black boxes are a challenge to developers. Indeed, mapping and matching of data of various types, formats and origins are vital in the new web of information age.

    DSC01385

    Research in the big data era

    The Research in the big data era session had five presentations on essentially two different topics: data and text mining (four talks) and research data management (one talk). Peter Leonard of Yale University Library started the day with a very interesting presentation of how advanced text mining techniques can be used for digital humanities research. Using the digitized archive of Vogue magazine he demonstrated how the long term analysis of statistical distribution of related terms, like “pants”, “skirts”, “frocks”, or “women”, “girls”, can help visualise social trends and identify research questions. To do this there are a number of free tools available, like Google Books N-Gram Search and Bookworm. To make this type of analysis possible, researchers need full access to all data and text. However, rights issues come into play here, as Christoph Bruch of the Helmholtz Association, Germany, explained. What is needed is “intelligent openness” as defined by the Royal Society: data must be accessible, assessable, intelligible and usable. Unfortunately European copyright law stands in the way of the idea of fair use. Many European researchers are forced to perform their data analysis projects outside Europe, in the USA. The plea for openness was also supported by LIBER’s Susan Reilly. Data and text mining should be regarded as just another form of reading, that doesn’t need additional licenses

    IdeasBox

    IdeasBox

    IdeasBox packed

    A very impressive and sympathetic library project that deserves everybody’s support was not an official programme item, but a bunch of crates, seats, tables and cushions spread across the central conference venue square. The whole set of furniture and equipment, that comes on two industrial pallets, constitutes a self supporting mobile library/information centre to be deployed in emergency areas, refugee camps etc. It is called IdeasBox, provided by Libraries without Borders. It contains mobile internet, servers, power supplies, ereaders, laptops, board games, books, etc., based on the circumstances, culture and needs of the target users and regions. The first IdeasBoxes are now used in Burundi in camps for refugees from Congo. Others will soon go to Lebanon for Syrian refugees. If librarians can make a difference, it’s here. You can support Libraries without Borders and IdeadBox in all kinds of ways: http://www.ideas-box.org/en/support-us.html.

    IdeasBox

    IdeasBox unpacked

    Conclusion

    The questions about data management in libraries that I brought with me to the conference were only partly addressed, and actual practical answers and solutions were very rare. The management and mapping of heterogeneous and redundant types of data from all types of sources across all domains that libraries cover, in a flexible, efficient and system independent way apparently is not a mainstream topic yet. For things like that you have to attend Satellite Meetings. Legal issues, privacy, copyright, text and data mining, cloud based data sharing and management on the other hand are topics that were discussed. It turns out that attending an IFLA meeting is a good way to find out what is discussed, and more importantly what is NOT discussed, among librarians, library managers and vendors.

    The quality and content of the talks vary a lot. As always the value of informal contacts and meetings cannot be overrated. All in all, looking back I can say that my first IFLA has been a positive experience, not in the least because of the positive spirit and enthusiasm of all organizers, volunteers and delegates.

    (Special thanks to Beate Rusch for sharing IFLA experiences)

  • Library Linked Data Happening

    Posted on August 26th, 2014 Lukas Koster 2 comments

    LOD happening

    On August 14 the IFLA 2014 Satellite Meeting ‘Linked Data in Libraries: Let’s make it happen! took place at the National Library of France in Paris. Rurik Greenall (who also wrote a very readable conference report) and I had the opportunity to present our paper ‘An unbroken chain: approaches to implementing Linked Open Data in libraries; comparing local, open-source, collaborative and commercial systems’. In this paper we do not go into reasons for libraries to implement linked open data, nor into detailed technical implementation options. Instead we focus on the strategies that libraries can adopt for the three objectives of linked open data, original cataloguing/creating of linked data, exposing legacy data as linked open data and consuming external linked open data. Possible approaches are: local development, using Free and open Source Software, participating in consortia or service centres, and relying on commercial vendors, or any combination of these. Our main conclusions and recommendations are: identify your business case, if you’re not big enough be part of some community, and take lifecycle planning seriously.

    The other morning presentations provided some interesting examples of a number of approaches we described in our talk. Valentine Charles presented the work in the area of aggregating library and heritage data from a large number of heterogeneous sources in different languages by two European institutions that de facto function as large consortia or service centres for exposing and enriching data, Europeana and The European Library. Both platforms not only expose their aggregated content in web pages for human consumption but also as linked open data, besides other so called machine readable formats. Moreover they enrich their aggregated content by consuming data from their own network of providers and from external sources, for instance multilingual “value vocabularies” like thesauri, authority lists, classifications. The ideas is to use concepts/URIs together with display labels in multiple languages. For Europeana these sources currently are GeoNames, DBPedia and GEMET. Work is being done on including the Getty Art and Architecture Thesaurus (AAT) which was recently published as Linked Open Data. Besides using VIAF for person authorities, The European Library has started adding multilingual subject headings by integrating the Common European Research Classification Scheme, part of the CERIF format. The use of MACS (Multilingual Access to Subjects) as Linked Open Data is being investigated. This topic was also discussed during the informal networking breaks. Questions that were asked: is MACS valuable for libraries, who should be responsible for MACS and how can administering MACS in a Linked Open Data environment best be organized? Personally I believe that a multilingual concept based subject authority file for libraries, archives, museums and related institutions is long overdue and will be extremely valuable, not only in Linked Open Data environments.

    The importance of multilingual issues and the advantages that Linked Open Data can offer in this area were also demonstrated in the presentation about the Linked Open Authority Data project at the National Diet Library of Japan. The Web NDL Authorities are strongly connected to VIAF and LCSH among others.

    The presentation of the Linked Open Data environment of the National Library of France BnF (http://data.bnf.fr) highlighted a very interesting collaboration between a large library with considerable resources in expertise, people and funding on the one hand, and the non-library commercial IT company Logilab. The result of this project is a very sophisticated local environment consisting of the aggregated data sources of the National Library and a dedicated application based on the free software tool Cubicweb. An interesting situation arose when the company Logilab itself asked if the developed applications could be released as Open Source by the National Library. The BnF representative Gildas Illien (also one of the organizers of the meeting together with Emmanuelle Bermes) replied with considerations about planning, support and scalability, which is completely understandable from the perspective of lifecycle planning.

    With all these success stories about exposing and publishing Linked Open Data, the question always remains if the data is actually used by others. It is impossible to incorporate this in project planning and results evaluation. Regarding the BnF data this question was answered in the presentation about Linked Open Data in the book industry. The Electre and Antidot project uses linked open data form among others data.bnf.fr.

    The afternoon presentations were focused on creating, maintaining and using various data models, controlled vocabularies and knowledge organisation sysems (KOS) as Linked Open Data: The EDM Europeana data Model, UNIMARC, MODS. An interesting perspective was presented by Gordon Dunsire on versioning vocabularies in a linked data world. Vocabularies change over time, so an assignment of a URI of a certain vocabulary concept should always contain version information (like timestamps and/or version numbers) in order to be able to identify the intended meaning at the time of assigning.

    The meeting was concluded with a panel with representatives of three commercial companies involved in library systems and Linked Open Data developments: Ex Libris, OCLC and the afore-mentioned Logilab. The fact that this panel with commercial companies on library linked data took place was significant and important in itself, regardless of the statements that were made about the value and importance of Linked Open Data in library systems. After years of dedicated temporarily funded proof of concept projects this may be an indication that Linked Open Data in libraries is slowly becoming mainstream.

  • Roadmaps, roadblocks and data finding users

    Posted on June 19th, 2014 Lukas Koster 2 comments

    Lingering gold at ELAG 2014

    Locks in Bath

    Locks in Bath

    Libraries tend to see themselves as intermediaries between information and the public, between creators and consumers of information. Looking back at the ELAG 2014 conference at the University of Bath however, I can’t get the image out of my head of libraries standing in the way between information and consumers. We’ve been talking about “inside out libraries”, “libraries everywhere”, “rethinking the library” and similar soundbites for some years now, but it looks like it’s been only talk and nothing more. A number of speakers at ELAG 2014 reported that researchers, students and other potential library visitors wanted the library to get out of their way and give them direct access to all data, files and objects. A couple of quotes:

    • We hide great objects behind search forms” (Peter Mayr, “EuropeanaBot”)
    • Give us everything” (Ben O’Steen, “The Mechanical Curator”).

    [Lingering gold: data, objects]
    In a cynical way this observation tightly fits this year’s conference theme “Lingering Gold”, which refers to the valuable information and objects hidden and locked away somewhere in physical and virtual local stores, waiting to be dug up and put to use. In her keynote talk, Stella Wisdom, digital curator at the British Library, gave an extensive overview of the digital content available there, and the tools and services employed to present it to the public. However, besides options for success, there are all kinds of pitfalls in attempting to bring local content to the world. In our performance “The Lord of the Strings”, Karen Coyle, Rurik Greenall, Martin Malmsten, Anders Söderbäck and myself tried to illustrate that in an allegorical way, resulting in a ROADMAP containing guidelines for bringing local gold to the world.
    In recent years it has become quite clear that data, dispersed and locked away in countless systems and silos, once liberated and connected can be a very valuable source of new information. This was very pertinently demonstrated by Stina Johansson in her presentation of visualization of research and related networks at Chalmers University using available data from a number of their information systems. Similar network visualizations are available in the VIVO open source linked data based research information tool, which was the topic of a preconference bootcamp which I helped organize (many thanks especially to Violeta Ilik, Gabriel Birke and Ted Lawless who did most of the work).

    [Systems, apis, technology trap]
    The point made here also implies that information systems actually function as roadblocks to full data access instead of as finding aids. I have come to realize this some time ago, and my perception was definitely confirmed during ELAG 2014. In his lightning talk Rurik Greenall emphasized the fact that what we do in libraries and other institutions is actually technology driven. Systems define the way we work and what we publish. This should be the other way around. Even APIs, intended for access to data in systems without having to use end user system functions, are actually sub-systems, giving non transparent views on the data. When Steve Meyer in his talk “Building useful and usable web services” said “data is the API” he was right in theory, yet in practice the reverse is not necessarily true. Also, APIs are meant to be used by developers in new systems. Non-tech end users have no use for it, as is illustrated by one of the main general reactions from researchers to the British Library Labs surveys, as reported by Ben O’Steen: “API? What’s that? I don’t care. Just give me the files.”.

    Old technologies in new clothes

    Old technologies in new clothes

    [Commercial vs open source]
    This technology critique essentially applies to both commercial/proprietary and open source systems alike. However, it could be that open source environments are more favorable to open and findable data than proprietary ones. Felix Ostrowski talked about the reasons for and outcomes of the Regal project, moving the electronic objects repository of the State Library of Rheinland-Pfalz from an environment based on commercial software to one based on open source tools and linked data concepts. One of the side effects of this move was that complaints were received from researchers about their output being publicly available on the web. This shows that the new approach worked, that the old approach was effectively hiding information and that certain stakeholders are completely satisfied with that.
    On the side: one of the open source components of the new Regal environment is Fedora , only used for digital objects, not any metadata, which is exactly what is currently happening in the new repository project at the Library of the University of Amsterdam. A legitimate question asked by Felix: why use Fedora and not just the file system in this case?

    [Alternative ways]
    All these observations also imply that, if libraries really want to disseminate and share their lingering gold with the world, alternative ways of exposing content are needed, instead of or besides the existing ones. Fortunately some libraries and individuals have been working on providing better direct access and even unguided and unsolicited publication of data and objects that might be available but not really findable with traditional library search tools. The above mentioned EuropeanaBot (and other twitter bots) and the British Library Labs’ Mechanical Curator are a case in point. Every hour EuropeanaBot sends a tweet about a random digital object, enriching it with extra information from Wikipedia and other sources.
    In the case of the British Library Labs Ben O’Steen described an experiment with free access to large amounts of data that by chance led to the observation that randomly excavated images from that vast amount of content drew people’s attention. As all content was in the public domain anyway, they asked themselves “what’s the harm in making it a bit more acessible?”. So the Mechanical Curator was born, with channels on tumblr, twitter and flickr.
    Another alternative way to expose and share library content, a game, was presented by Ciaran Talbot and Kay Munro: LibraryGame. In brief, students are encouraged to use and visit the library and share library content with others by awarding them points and badges as members of an online community. The only two things students didn’t like about the name LibraryGame were “library” and “game”, so the name was changed to “BookedIn”.
    No matter if you like bots and games or not, the important message here is that it is worthwhile exploring alternative ways by which people can find the content that libraries consider so valuable.

    [People]
    In the end, it’s people that libraries work for. At Utrecht University Library they realised that they needed simpler ways to make it possible for people to use their content, not only APIs. Marina Muilwijk described how they are experimenting with the Lean Startup method. In a continuous cycle of building, measuring and learning, simple applications are released to end users in order to test if they use them and how they react to them.
    Focus on the user” was also the theme of the workshop  given by Ken Chad around the Jobs-to-be-done methodology.
    Interestingly, “How people find” instead of: “How people search” was one of the perspectives of the Jisc “Spotlight on the Digital” project, presented by Owen Stephens in his lightning talk.

    [Collections and findability]
    Another perspective of that Jisc project was how to make collections discoverable. It turns out that collections as such are represented on the web quite well, whereas items in these collection aren’t.
    Valentine Charles of The European Library demonstrated the benefits of collection level metadata for the discoverability of hidden content, using the CENDARI project as example.

    [Linking data]
    What’s a library technology conference without linked data? Implicitly and explicitly the instrument of connecting data from different sources relates quite well to most of the topics presented around the theme of lingering gold, with or without the application of the official linked data rules. I have already mentioned most cases, I will only go into a couple of specific sessions here.
    Niklas Lindström and Lina Westerling presented the developments with the new linked data based cataloguing system for the Swedish LIBRIS union catalogue. This approach is not simply a matter of exposing and consuming linked data, but in essence the reconstruction of existing workflows using a completely new architecture.
    The data management and integration platform d:swarm, a joint open source project of SLUB State and University Library Dresden and the commercial company AvantgardeLabs was presented in a lightning talk by Jan Polowinski. This tool aims at harvesting and normalising data from various existing systems and datastores into an intermediate platform that in turn can be used for all kinds of existing and new front end systems and services. The concept looks very useful for library environments with a multitude of legacy systems. Some time ago I visited the d:swarm team in Dresden together with a group of developers from the KOBV library consortium in Berlin, two of whom (Julia Goltz and Viktoria Schubert) presented their own new K2 portal solution for the data integration challenge in a lightning talk.

    Linked data is all about unique identifiers on the web. The recent popular global identifier for researchers ORCiD, at last year’s ELAG topic of one of the workshops, was explained by Tom Demeranville. As it happened, right after the conference it became clear that ORCiD implemented the Turtle linked data format.
    The problem of matching string based personal names from various data sources without matching identifiers was tackled in the workshop “Linking Data with sameAs” which I attended. Jane and Adrian Stevenson of the ArchivesHub UK showed us hands-on how to use tools like LOD-Refine and Silk for reconciling string value data fields and producing “sameAs” relationships/triples to be used in your local triple store. They have had substantial experience with this challenge in their Linking Lives project. I found the workshop very useful. One of the take-aways was that matching string data is hard work.

    [Excavations]
    Hard work also goes on in the caves and basements of the library world, as was demonstrated by Toke Eskildsen in his war stories of the Danish State Library with scanning companies, and by Eva Dahlbäck and Theodor Tolstoy in their account of using smartphones and RFID technology in fetching books from the stacks.

    [PS]
    Once again I have to say that a number of unofficial sessions, at breakfast, dinner, in pubs and hotel bars, were much more informative than the official presentations. These open discussions in small groups, fostering free exchange of ideas without fear of embarrassment, while being triggered by the talks in the official programme, can simply not take place within a tight conference schedule. Nevertheless, ELAG is a conference small and informal enough to attract people inclined to these extracurricular activities. I thank everybody who engaged in this. You know who you are. Or check Rurik Greenall’s conference report, which is a very structured yet personal account of the event.

    Pub talk

    Pub talk

    [PPS]
    Lots of thanks to the dedicated and very helpful local organisation team of the Library of the University of Bath, who have done a wonderful job doing something completely new to them: organising an international conference.

  • Linked data or die!

    Posted on December 1st, 2013 Lukas Koster 1 comment

    Struggling towards usable linked data services at SWIB13

    20131127_091404

    Paraphrasing some of the challenges proposed by keynote speaker Dorothea Salo, the unofficial theme of the SWIB13 conference in Hamburg might be described as “No more ontologies, we want out of the box linked data tools!”. This sounds like we are dealing with some serious confrontations in the linked open data world. Judging by Martin Malmsten’s LIBRIS battle cry “Linked data or die!” you might even think there’s an actual war going on.

    Looking at the whole range of this year’s SWIB pre-conference workshops, plenary presentations and lightning talks, you may conclude that “linked data is a technology that is maturing” as Rurik Greenall rightly states in his conference report. “But it has quite a way to go before we can say this stuff is ready to roll out in libraries” as he continues. I completely agree with this. Personally I got the impression that we are in a paradoxical situation where on the one hand people speak of “we” and “community”, and on the other hand they take fundamentalist positions, unconditionally defending their own beliefs and slandering and ridiculing other options. In my view there are multiple, sometimes overlapping, sometimes irreconcilable “we’s” and “communities”. Sticking to your own point of view without willingness to reason with the other party really does not bring “us” further.

    This all sounds a bit grim, but I again agree with Rurik Greenall when he says that he “enjoyed this conference immensely because of the people involved”. And of course on the whole the individual workshops and presentations were of a high quality.

    Before proceeding to the positive aspects of the conference, let me first elaborate a bit on the opposing positions I observed during the conference, which I think we should try to overcome.

    Developers disagree on a multitude of issues:
    Formats
    Developers hate MARC. Everybody seems to hate RDF/XML, JSON-LD seems to be the thing for RDF, but some say only Turtle should be used, or just JSON.
    Tools and languages
    Perl users hate Java, Jave users hate PHP, there’s Python and Ruby bashing.
    Ontologies
    Create your own, reuse existing ones, yes or no upper ontologies, no ontologies but usable tools.
    Operating systems
    Windows/UNIX/Linux/Apple… it’s either/or.
    Open source vs. commercial software
    Need I say more?
    Beer
    Belgians hate German beer, or any foreign beer for that matter.
    (Not to mention PDF).

    OK, I hope I made myself clear. The point is that I have no problem at all with having diverse opinions, but I dislike it when people are convinced that their own opinion is the only right one and refuse to have a conversation with those who think otherwise, or even respect their choices in silence. The developer “community” definitely has quite a way to go.

    Apart from these internal developer disagreements I noticed, there is the more fundamental gap between developers and users of linked open data. By “users” I do not mean “end users” in this case, but the intermediary deployers of systems. Let’s call them “libraries”.
    Linked Data developers talk about tools and programming languages, metadata formats, open source, ontologies, technology stacks. Librarians want to offer useful services to their end users, right now. They may not always agree on what kind of services and what kind of end users, and they may have an opinion on metadata formats in systems, but their outlook is slightly different from the developers’ horizon. It’s all about expectations and expectation management. That is basically Dorothea Salo’s keynote’s point. Of course theoretical, scientific and technical papers and projects are needed to take linked data further, but libraries need linked data tools, focused on providing new services to their end users/customers in the world of the web, that can easily be implemented and maintained.
    In this respect OCLC’s efforts to add linked data features to WorldCat is praiseworthy. OCLC’s Technology Evangelist Richard Wallis presented his view on the benefits of linked open data for libraries, using Google’s Knowledge Graph as an example. His talk was mainly focused at a librarian audience. At SWIB, where the majority of attendees are developers or technology staff, this seemed somewhat misplaced. By chance I had been present at Richard’s talk at the Dutch National Information Professional annual meeting two weeks earlier, where he delivered almost the same presentation for a large room full of librarians. There and then that was completely on target. For the SWIB audience this all may have been old news, except for the heads up about OCLC’s work on FRBR “Works” BIBFRAME type linked data which will result in published URIs for Works in WorldCat.
    An important point here is that OCLC is a company with many library customers worldwide, so developments like this benefit all of these libraries. The same applies to customers of one of the other big library system vendors, Ex Libris. They have been working on developing linked data features for their so called “next generation” tools since some time now, in close cooperation with the international user groups’ Linked Open Data Special Interest Working Group, as I explained in the lightning talk I gave. Also open source library systems like Koha are working on adding linked open data features to their tools. It’s with tools like these, that reach a large number of libraries, that linked open data for libraries can spread relatively quickly.

    In contrast to this linked data broadcasting, the majority of the SWIB presentations showed local proprietary development or research projects, mostly of high quality notwithstanding. In the case of systems or tools that were built all the code and ontologies are available on GitHub, making them open source. However, while it is commendable, open source on GitHub doesn’t mean that these potentially ground breaking systems and ontologies can and will be adopted as de facto standards in the wider library community. Most libraries, both public and academic, are dependent on commercial system and content providers and can’t afford large scale local system development. This also applies up to a point to libraries that deploy large open source tools like Koha, I presume.
    It would be great if some of these many great open source projects could evolve into commonly used standard tools, like Koha, Fedora and Drupal, just to name a few. Vivo is another example of an open source project rapidly moving towards an accepted standard. It is a framework for connecting and publishing research information of different nature and origin, based on linked data concepts. At SWIB there was a pre-conference “VivoCamp”, organised by Lambert Heller, Valeria Pesce and myself. Research information is an area rapidly gaining importance in the academic world. The Library of the University of Amsterdam, where I work, is in the process of starting a Vivo pilot, in which I am involved. (Yes, the Library of the University of Amsterdam uses both commercial providers like OCLC and Ex Libris, and many open source tools). The VivoCamp was a good opportunity to have a practical introduction in and discussion about the framework, not in the least by the presence of John Fereira of Cornell University, one of the driving forces behind Vivo. All attendees (26) expressed their interest in a follow-up.
    Vivo, although it may be imperfect, represents the type of infrastructure that may be needed for large scale adoption of linked open data in libraries. PUB, the repository based linked data research information project at Bielefeld University presented by Vitali Peil, is aimed at exactly the same domain as Vivo, but it again is a locally developed system, using another smaller scale open source framework (LibreCat/Catmandu of Bielefeld, Ghent and Lund universities) and a number of different ontologies, of which Vivo is just one. My guess is that, although PUB/LibreCat might be superior, Vivo will become the de facto standard in linked data based research information systems.

    Instead of focusing on systems, maybe the library linked data world would be better served by a common user-friendly metadata+services infrastructure. Of course, the web and the semantic web are supposed to be that infrastructure, but in reality we all move around and process metadata all the time, from one system and database to another, in order to be able to offer new legacy and linked data services. At SWIB there was mention of a number of tools for ETL, which is developer jargon for Extract, Transform, Load. By the way, jargon is a very good way to widen the gap between developers and libraries.
    There were pre-conference workshop for the ETL tools Catmandu and Metafacture, and in a lightning talk SLUB Dresden, in collaboration with Avantgarde Labs, presented a new project focused on using ETL for a separate multi-purpose data management platform, serving as a unified layer between external data sources and services. This looks like a very interesting concept, similar to the ideas of a data services hub I described in an earlier post “(Discover AND deliver) OR else”. The ResourceSync project, presented by Simeon Warner, is trying to address the same issue by a different method, distributed synchronisation of web resources.

    One can say that the BIBFRAME project is also focused on data infrastructure, albeit at the moment limited to the internal library cataloguing workflow, aimed at replacing MARC. An overview of the current state of the project was presented by Lars Svensson of the German National Library.
    The same can be said for the National Library of Sweden’s new LIBRIS linked data based cataloguing system, presented by Martin Malmsten (Decentralisation, Distribution, Disintegration – towards Linked Data as a First Class Citizen in Libraryland). The big difference is that they’re actually doing what BIBFRAME is trying to plan. The war cry “Linked data or die!” refers to the fact that it is better to start from scratch with a domain and format independent data infrastructure, like linked data, than to try and build linking around existing rigid formats like MARC. Martin Malmsten rightly stated that we should keep formats outside our systems, as is also the core statement of the MARC-MUST-DIE movement. Proprietary formats can be dynamically imported and exported at will, as was demonstrated by the “MARC” button in the LIBRIS user interface. New library linked data developments will have to coexist with the existing wider library metadata and systems environment for some time.
    Like all other local projects, the LIBRIS source code and ontology descriptions are available on GitHub. In this case the mere scope of the National Library of Sweden and of the project makes it a bit more plausible that this may actually be reused on a larger scale. At least the library cataloguing ontology in JSON-LD there is worth having a look at.
    To return to our starting point, the LIBRIS project acknowledges the fact that we need actual tools besides the ontologies. As Martin Malmsten quoted: “Trying to sell the idea of linked data without interfaces is like trying to sell a fax without the invention of paper”.

    20131127_093330

    The central question in all this: what is the role of libraries in linked data? Developers or implementers, individually or in a community? There is obviously not one answer. Maybe we will know more at SWIB14. Paraphrasing Fabian Steeg and Pascal Christoph of hbz and Dorothea Salo, next years theme might be “Out of the box data knitting for great justice”.

  • Resilience, connections and a clean slate

    Posted on June 10th, 2013 Lukas Koster 17 comments

    The inside-out library at ELAG 2013
    This year marked my fifth ELAG conference since 2008 (I skipped 2009), which is not much if you take into account that ELAG2013 was the 37th one. I really enjoyed the 2013 conference, not in the least because of the wonderful people of the local organising committee at the Ghent University Library, who made ELAG2013 a very pleasant event.This year’s theme was “the inside-out library”, a concept coined by Lorcan Dempsey, which in brief emphasises the need for libraries to shift focus 180 degrees.

    DSC09680

    Sylvia Van Peteghem opening speech

    Before you read any further I strongly suggest you read Rurik Greenall’s post on ELAG 2013 first. He covered most of the programme in his usual thorough and analytical way.

    In my personal overall conference experience major emphasis was on research support in libraries. This was partly due to my attendance of the pre-conference Joint OpenAIRE/LIBER WorkshopDealing with Data – what’s the role for the library?’ on May 28. It was good to have sessions focusing on different perspectives: data management, data publication, the researchers’ needs, library support and training. I was honoured to be invited to participate in the closing round table panel discussion together with two library directors Wilma van Wezenbeek (TU Delft Library) and Wolfram Horstmann (Bodleian Library), under the excellent supervision of Kevin Ashley (DDC). An important central concept in the workshop was the research life cycle, which consists of many different tasks of a very diverse nature. Academic and research libraries should focus on those tasks for which they are or can easily become qualified.

    Looking from another angle we can distinguish two main perspectives in integrating research: the research ecosystem itself, which can be seen as the main topic of the OpenAIRE/LIBER workshop, and the research content, the actual focus of researchers and research projects. I will try to address both perspectives here.

    On the first day of the actual conference Herbert Van de Sompel gave the keynote speech with the title “A clean slate”. Rurik Greenall aptly describes the scope and meaning of Herbert’s argument. Herbert has been involved in a number of important and relevant projects in the domain of scholarly communication. My impression this time was: now he’s bringing it all together around the fairly new concept of the “research object”, integrating a number of projects and protocols, like ORE, Memento, OpenAnnotation, Provenance, ResourceSync. It’s all about connections between all components related to research on the web in all dimensions.

    This linking of input, output, procedures and actors of research projects in various temporal and contextual dimensions in a machine readable way is extremely important in order to be able to process all relevant information by means of computer systems and present it to the human consumer. In this respect I think it is essential that data citations in scholarly articles should not only be made available in the article text, but also as machine readable metadata that can be indexed by external aggregators.
    Moreover, it would be even better if it was possible to provide links to research projects that would serve as central hubs for linking to all associated entities, not only datasets. This is the role that the research object can fulfill. During the OpenAIRE/LIBER workshop I tried to address this issue a number of times, because I am a bit surprised that  both researchers and publishers appear to be satisfied with having text only clickable dataset citations. That is even the case the other way around with links to articles in dataset repositories like Dryad. I think there is a role here for information professionals and metadata experts in libraries. This is exactly the point that Peter van Boheemen made in his talk about producing better metadata for research output. Similarly Jing Wang stressed the importance of investigating the role of metadata specialists and data librarians for interoperability and authority control in her presentation on the open source linked data based research discovery tool Vivo.

    Again there are two perspectives here. Even if we have machine readable metadata on research projects and datasets, most systems are not adequately equipped with functionality to process or present this information. It is not so easy to update complex systems with new functionality. Planned update cycles, including extensive testing, are necessary in order to adhere to the system’s design and architecture and to avoid breaking things. This equally applies to commercial, open source and home grown systems. Joachim Neubert’s presentation of the use of the open source CMS Drupal for linked data enhanced publishing for special collections illustrated this. Some very specialist custom extensions to the essentially quite flexible system were needed to make this a success. (On a different note, it was nice to see that Joachim used a simple triple diagram from my first library linked data blog post to illustrate the use of different types of predicates between similar subjects and objects.)
    Anyway, a similar point can be made about systems and identifiers for people (authors, researchers, etc.). I participated in the workshop on ISNI, ORCID and VIAF : Examining the fundamentals and application of contributor identifiers led by Anila Angjeli and Thom Hickey, one of six ELAG workshops this year. Thom and Anila presented a very complete and detailed overview of the similarities and differences of these three identifier schemes. One of the discussion topics was the difference in adoption of these schemes by the community on the one hand and as machine readable metadata and their application in library systems on the other.

    Here comes “resilience” into play, a concept introduced by Beate Rusch in her talk on the changing roles of the German regional library consortia and service centres in the world of cloud computing and SaaS. Rurik Greenall captures the essence of her talk when he says “… homogenous, generic solutions will not work in practice because they are at odds with how things are done …” and that “messy, imperfect systems… are smart and long lived”. Since Beate’s presentation the term “resilience” popped up in a number of discussions with colleagues, during and after the conference, mainly in the sense that most systems, communities, infrastructures are NOT resilient. Resilience is a concept mainly used in psychology and physics, meaning the ability of someone or something to return to its original state after being subjected to a severe disturbance. Beate’s idea with resilience is that we can adapt better to changing circumstances and needs in the world around us if we are less perfect and rigid than we usually are. In this sense I think resilience can also mean that a structure could permanently change instead of returning to its original state.
    In the library world resilience can be applied to librarians, libraries, library infrastructure and library systems alike. In my view “resilience” might apply to the alternative architecture I have described in a recent blog post, where I argue that we should stop thinking systems and start thinking data. In order to be resilient we need an open, connected infrastructure, that is of the web (not on the web). The SCAPE infrastructure for processing large datasets for long term preservation, presented by Sven Schlarb, might fit this description.

    A  number of presentations focused on infrastructure and architecture. The new version of the Swedish union catalogue LIBRIS could be described as a resilient system. Martin Malmsten, Markus Sköld and Niklas Lindström showed their new linked open data based integrated library framework which was built from the ground up, from ”a clean slate” so to speak. I can only echo Rurik’s verdict “ With this, Libris really are showing the world how things are done”. Contrary to the Library of Congress BibFrame development which started very promising, but now seems to evolve into an inward looking rigid New Marc. This was illustrated by Martin Malmsten when he revealed to us that Marc is undead, and by Becky Yoose, who wrote a very pertinent parable telling the tale of the resurrection of Marc.
    Rurik Greenall described the direction taken at his own institution NTNU Library: getting rid of old legacy library and webpage formats and moving towards being part of the web, providing information for the web, being data driven. It’s a slow and uphill struggle, but better than the alternative. A clean slate again!
    Dave Pattern presented a different approach in connecting data from a number of existing systems and databases by means of APIs, and combining these into a new and well received reading list service at the University of Huddersfield.

    Back to research. In our presentation, or rather performance, Jane Stevenson and I tried to present the conflicting perspectives of collection managers and researchers in a theatrical way, showing parallel developments in the music industry. Afterwards we tried to analyse the different perspectives, argued that researchers need connected information of all types and from all sources and concluded that information professionals should try and learn to take the researcher’s perspective in order to avoid becoming irrelevant in that area.
    The relationship between libraries and researchers was also the subject of the talk “Partners in research. Outside the library, inside the infrastructure“, by Sally Chambers and Saskia Scheltjens. Here the focus was on providing comprehensive infrastructures for research support, especially in the digital humanities. Central question: large top-down institutionalised structures, or bottom-up connected networks? Bottom line is: the researcher’s needs have to be met in the best possible way.
    A very interesting example of an actual digital humanities research and teaching project in collaboration between researchers and the library is the Annotated Books Online project that was presented by Utrecht University staff. The collection of rare books is made available online in order to crowdsource the interpretation of handwritten annotations present in these books.

    Besides research support there were presentations on other “inside out library” topics: publishing, teaching, data analysis and GLAM.
    Anders Söderbäck presented the Stockholm University Press, a new publishing house for open access digital and print on demand books. I was pleasantly surprised that Anders included two quotes of my aforementioned blog post in his talk: “...in the near future we will see the end of the academic library as we know it” and “According to some people university libraries are very suitable and qualified to become scholarly publishers … I am not sure that this is actually the case. Publishing as it currently exists requires a number of specific skills that have nothing to do with librarian expertise“. But of course Anders’ most important achievement was winning the Library Automation Bingo by including all required terms in one slide in a coherent and meaningful way.

    DSC09707

    DSC09708

    Merrilee Proffitt presented an overview of MOOCs and libraries, Sarah Brown described the way that learning materials at the Open University in the UK are successfully connected and integrated in the linked data based STELLAR project. Looking at these developments the question arises if there are already efforts to come to a Teaching Object model, similar to the Research Object?
    Andrew Nagy described the importance of analysing huge amounts of usage data in order to improve the usability and end user front end of the Summon discovery tool. Dan Chudnov presented the Social Media Manager prototype, used for collecting data from twitter in order to be used in social science research.
    Valentine Charles described the activities carried out by Europeana to contribute large amounts of digitised library heritage resources to Wikimedia Commons by means of the GLAMwiki toolset in order to improve visibility of these resources the Open Access way. The GLAMwiki toolset currently appears to offer a number of challenges for the interoperability and integration of metadata standards between the library and the Wikimedia world. Another plea for resilience.

    Then there were the workshops. The combination of these parallel hands-on and engaging group activities and the plenary sessions makes ELAG a unique experience. Although I only participated in one, obviously, I have heard good reports from all other workshops. I would like to give a special mention to Ade and Jane Stevenson’s “Very Gentle Linked Data” workshop, where they managed to teach even non-tech people not only the basic principles of linked data, but also how to create their own triple store and query it with SPARQL.

    Summarising: looking at the ELAG2013 presentations, are we ready for the inside out library? Sometimes we can start with a clean slate, but that is not always possible. Resilience seems to be a requirement if we want to cope with the dramatic changes we are facing. But you can’t simply decide to be resilient, either something is resilient or it isn’t. A clean slate might be the only option. In any case it seems obvious that connections are key. The information profession needs to invest in new connections on every level, creating new forms of knowledge, in order to stay relevant.

  • Beyond The Library

    Posted on March 22nd, 2013 Lukas Koster 46 comments

    bdpdf2The BeyondThePDF2 conference, organised by FORCE11, was held in Amsterdam, March 19-20. From the website: “...we aim to bring about a change in modern scholarly communications through the effective use of information technology”. Basically the conference participants discussed new models of content creation, content dissemination, content consumption, funding and research evaluation.
    Because I work for an academic library in Amsterdam, dealing with online scholarly information systems and currently trying to connect traditional library information to related research information, I decided to attend.
    Academic libraries are supposed to support university students, teaching and research staff by providing access to scholarly information. They should be somewhere in the middle between researchers, authors, publishers, content providers, students and teachers. Consequently, any big changes in the way that scholarly communication is being carried out in the near and far future definitely affects the role of academic libraries. For instance, if the scholarly publication model would change overnight from the current static document centered model to a dynamic linked data model, the academic library discovery and delivery systems infrastructure would grind to a halt.

    © Paul Groth

    © Paul Groth

    So I was surprised to see that the library representation at the conference was so low compared to researchers, publishers, students and tech/tools people (thanks to Paul Groth for the opening slides). No Dutch university library directors were present. Maybe that’s because they all attended the Research Data Alliance launch in Gothenburg which was held at the same time. I know of at least one Dutch university library director who was there. Maybe an official international association is more appealing to managers than an informal hands on bunch like FORCE11.

    A number of questions arise from this observation:

    Are academic libraries talking to researchers?

    Probably (or maybe even apparently) not enough. Besides traditional library services like providing access to publications and collections, academic libraries are more and more asked to provide support for the research process as such, research data management, preservation and reuse, scholarly output repositories and research information systems. In order to perform these new tasks in an efficient way for both the library and the researcher, they need to communicate about needs and solutions.
    I took the opportunity and talked to a couple of scholars/researchers at BeyondThePDF2, asking among other things: “When looking for information relevant to your research topic, do you use (our) library search tools?” Answer: “No. Google.” or similar. Which brings me to the next question.

    Do researchers know what academic libraries are doing?

    Probably (or maybe even apparently) not enough. Same answer indeed. It struck me that of the few times libraries were mentioned in talks and presentations, it was almost always in the form of the old stereotype of the stack of books. Books? I always say: “Forget books, it’s about information!”. One of the presenters whose visionary talk I liked very much even told me that they hoped the new Amsterdam University Library Director would know something about books.That really left me speechless.
    Fortunately the keynote speaker on the second day, Carol Tenopir, had lots of positive things to say about libraries. One remark was made (not sure who said it) that has been made before: “if libraries do their work properly, they are invisible”. This specifically referred to academic libraries’ role in selecting, acquiring, paying for and providing technical access to scholarly publications from publishers and other content providers.
    Another illustration of this invisibility is the in itself great initiative that was started during the conference: “An open alternative to Google Scholar”, which could just as well have been called “An open alternative to Google Scholar, Primo Central, WorldCatLocal, Summon, EDS”. These last four are the best known commercial global scholarly metadata indexes that lots of academic libraries offer.
    Anyway, my impression that academic libraries need to pay attention to their changing role in a changing environment was once again confirmed.

    Publishers and researchers talk to each other!

    (Yes I know that’s not a question). In the light of the recent war between open access advocates and commercial publishers it was good to see so many representatives of Elsevier, Springer etc. actively engaged in discussions with representatives of the scholarly community about new forms of content creation and dissemination. Some of the commercial content providers/aggregators are also vendors of the above mentioned Google Scholar alternatives (OCLCWorldCatLocal, Proquest/SerialsSolutionsSummon, EBSCOEDS). All of these are very reluctant to contribute their own metadata to their competitors’ indexes. Academic libraries are caught in the middle here. They pay lots of money for content that apparently they can only access through the provider’s own channels. And in this case the publishers/providers do not listen to the libraries.

    Why so many tools/tech people?

    Frankly I don’t know. However, I talked to a tools/tech person who worked for one of the publishers. So there obviously is some overlap in the attendee provenance information. Speaking about myself, working for a library, I am not a librarian, but rather a tools/tech person (with an academic degree even). Tools/tech people work for publishers, universities, university libraries and other types of organisations.
    There is a lot of interesting innovative technical work being done in libraries by tools/tech people. We even have our own conferences and unconferences that have the same spirit as BeyondThePDF. If you want to talk to us, come for instance to ELAG2013 in Ghent in May, where the conference theme will be “The inside-out library”. Or have a look at Code4Lib, or the Library Linked Data movement.

    Positive action

    Besides the good presentations, discussions and sessions, the most striking result of BeyondThePDF2 was the start of no less than three bottom-up revolutionary initiatives that draw immediate attention on the web:
    The Scholarly Revolution  – Peter Murray-Rust
    The Open Alternative to Google ScholarStian Håklev
    The Amsterdam Manifesto on Data Citation Principles Merce Crosas, Todd Carpenter, Jody Schneider

    We can make it work.

  • Change or be irrelevant

    Posted on October 10th, 2012 Lukas Koster 28 comments

    Or: Think “different” or paint yourself in a corner

    EMTACL12 – Emerging Technologies in Academic Libraries 2012

    I attended the EMTACL12 conference in Trondheim October 1-3, 2012, organised by the Library of NTNU Norwegian University of Science and Technology, both as a member of the international programme committee and as a speaker. EMTACL stands for “emerging technologies in academic libraries”. Looking back, my impression was that the conference was not so much about emerging technologies, as about emerging tasks using existing technologies. One of the keynote speakers, Rudolf Mumenthaler, expressed similar thoughts in his blog post “No new technologies in libraries”, but some of the other participants disagreed, saying that “being emerging” has more to do with the context of technology than with the technology itself (see the comments on that blog post). Some technologies can be established, but may still be emerging in certain domains. There is something to say for that. Anyway, whatever you say, we all mean the same thing.

    EMTACL12 was the second EMTACL conference. The first one was organised in 2010. One of the presentations that caused a great stir amongst librarians on twitter in the 2010 edition was the one entitled “I’ve got Google, why do I need you? A student’s expectations of academic libraries” by Ida Aalen. Let’s look at this year’s conference with that perspective in mind: is there a future for academic libraries in supporting students and researchers other than just giving access to publications?

    The word “change” best describes the overall impression I got from all EMTACL12 presentations. And “data”. Both concepts involving “support and services for research and education”. Technologies that were mentioned: linked data, apis, mobile computing, visualisation, infrastructure, communication.

    The EMTACL12 programme consisted of 8 plenary keynote presentations by invited speakers, and a number of presentations in two parallel tracks. Let me report on the things that struck me most.

    Keynotes

     

    The title of the opening keynote presentation by Herbert Van de Sompel, “Paint-yourself-in-the-corner Infrastructure” aptly describes the current situation of academic libraries. “Paint yourself in a corner” means something like: “To put yourself in a situation with no visible solution or alternative”. Herbert Van de Sompel talked about the changing nature of the scholarly record: from “fixity” and “boundary” to dynamic and interdependent on the web. Online publications and related information, like research project information, references and data, change over time, so it becomes increasingly difficult to recreate a scholarly record. These are the challenges that academic libraries need to address. Van de Sompel mentioned a couple of new tools and protocols that can help: Memento, DURI (Durable URI), SiteStory. See also the excellent report of this session by Jane Stevenson on the Archives Hub blog.

     

    5 dimensions

    Think “different”’ is what Karen Coyle told us, using the famous Steve Jobs quote. And yes, the quotes around “different” are there for a reason, it’s not the grammatically correct “think differently”, because that’s too easy.  What is meant here is: you have to have the term “different” in your mind all the time. Karen Coyle confronted us with a number of ingrained obsolete practices in libraries. Like the ineradicable need for alphabetic ordering, which only makes sense in physical catalogue card systems. “Alphabetical order is not generally meaningful and an accident of language” she said. Same with page numbers and ebooks: “…it is literally impossible to get everyone ‘on the same page’”. Before printing we already had a perfect reference system for texts, independent of physical appearance: paragraph or verse numbers (like in the Bible).
    Libraries put things on shelves, forcing the user to see individual items, and ignoring the connections there are between them. “Library classification is a knowledge-prevention system, not a knowledge-organisation system”. The focus is still too much on physical items: “The FRBR user tasks drive me insane, as they end with obtain”.  According to Karen Coyle, libraries are two-dimensional linear things. We need to add a third (links), fourth (time) and fifth dimension (the users).

    © Patrick Hochstenbach

    Is linked data the answer? Not as such: “ISBD in RDF is like putting a turbo engine on a dinosaur”. The world is not waiting for libraries’ bibliographic data as Linked Open Data. The web is awash with bibliographic data. But we have holdings information, and that is unique and adds value. We should try and get that information into Google search results rich snippets.

    Karen’s message, which I wholeheartedly support, was: “The mission of the library is not to gather physical things into an inventory, but to organize human knowledge that has been very inconveniently packaged.

    Rurik Greenall’s keynote “Defining/Defying reality: the struggle towards relevance in bibliographic data” also focused on the imminent irrelevancy of libraries, from another perspective. “Outsourcing library business is better called ‘outscarcing’. Libraries are losing skills.”. “You can tell a lot about an organization from the way it treats its data.”. “We see metadata as good and data as bad. The terms are the same.” . “Ideas change, so should your data.”. Buying shelf-ready data means being static. “Data should age like wine, not like fish.”. In this changing environment bibliographic data needs to be enhanced. There is a role for experts, for the library. Final quote: “The semantic web doesn’t exist anymore, it’s been absorbed by the web”.

    Rurik’s world

     

    Rudolf Mumenthaler spoke about “Innovation management in and for libraries”. During and after his talk the big question was: can innovation be promoted by management, or does it need to grow of itself in freedom, by allowing staff to play the Google way? It appears that there may be cultural differences. Main thing is: innovation has to be facilitated in one way or another. See the comments on his blog post.

    Astrophysicist Eirik Newth entertained the audience with his slideless “Forecast for the academic library of 2025: Cloudy with a chance of user participation and content lock-in”.

    Jens Vigen, Head Librarian at CERN, delivered a very entertaining and compelling argumentation for open access with his talk “Connecting people and information: how open access supports research in High Energy Physics. Since 50 years!” The CERN convention of 1953 already effectively contains an Open Access Manifesto. CERN supports SCOAP3, Sponsoring Consortium for Open Access Publishing in Particle Physics.  CERN uses subscription funds for open access. “You librarians today spend money on subscriptions, tomorrow you will spend it on open access.”.
    A couple of very interesting remarks by Jens Vigen that are of direct interest to online library discovery layers:
    A researcher would never go to an institutional repository, they find their colleagues in subject repositories.”.
    A successful digital library: one size does not fit all.”.

    OCLC’s new “Technology evangelist” Richard Wallis‘ talk “OCLC WorldShare and Linked Data” actually was not about WorldShare and Linked data, but consisted of two parts, a WorldShare commercial, and a presentation of WorldCat and linked data, mainly the embedding of additional schema.org markup in WorldCat search results. Richard Wallis also mentioned the WorldCat Linked Data Facebook app, which almost nobody seemed to know. Maybe Facebook isn’t the right platform for things like this after all?

    In his closing keynote “What Next for Libraries? Making Sense of the FutureBrian Kelly, UKOLN, University of Bath, in the UK, made it clear that it is very hard to foresee the future, with Star Trek, monorails and paper planes as evidence.

    Parallel tracks

    Obviously I could only attend half of the parallel tracks sessions. Moreover, I chaired two sessions of two presentations each, in the “Semantic Web” and “Supporting Research” tracks, and I gave one presentation myself.

    In “The winner takes it al? – APIs and Linked Data battle it outJane and Adrian Stevenson (yes, they’re married, and work together) of the MIMAS National Data Centre at the University of Manchester in the UK, performed an actual battle defending the use of the generic linked data protocol versus the more dedicated API approach in making data available for reuse and mashups. Two interesting projects served as an example: the World War 1 Discovery Project (Adrian for APIs) and Linking Lives (Jane for Linked Data). Conclusion: too close to call.

    Black Metal MARC

    Norwegian Black Metal was the intriguing topic of Kim Tallerås’ talk “Using Linked data to harmonize heterogeneous metadata – Modeling the birth of Norwegian black metal”. He and three others combined complicated metadata from two heterogeneous data sources about early Norwegian black metal bands, performances and recordings using linked data ontologies and graph matching techniques. We saw some very interesting slides containing MARC records and some typical Black Metal band and song names.
    Afterwards we had the opportunity to experience the real thing in the Black Metal Room in the Norwegian Rock and Pop Museum Rockheim during our conference excursion.

    Black Metal Room at Rockheim

    Mubil: a digital laboratory” is a project (NTNU Trondheim, PERCRO, Pisa, Italy) aimed at augmenting and enriching rare old books in a digital 3D architecture, ready for all kinds of platforms and devices. Results are touch ebooks, with options for retrieving extra textual information and virtual 3D objects. A very interesting presentation by Alexandra Angeletaki, Marcello Carrozzino and Chiara Evangelista.

    In her talk “Libraries, research infrastructures and the digital humanities: are we ready for the challenge?”, Sally Chambers (DARIAH Göttingen) gave us a very thorough and complete overview of what “Digital Humanities” means and of all organisations and infrastructures currently available to libraries that are charged with supporting digital humanities research.

    The History Engine project was the subject of the presentation “Driving history forward: The History Engine as a vehicle for engaging undergraduate research” by Paulina Rousseau, Whitney Kemble and Christine Berkowitz (University of Toronto Scarborough), as a real example of how libraries can support undergraduate students in their efforts to master research.

    Sharon Favaro, Digital Services Librarian at Seton Hall University in South Orange, USA, showed us the landscape of disconnected tools used in the different stages of research projects: catalogues, databases, writing tools, drawing tools, reference managers, task managers, email; on the web, on internet file sharing tools, on desktop, on flash drives. The topic of her talk “Designing tools for the 21st century workflow of research and how it changes what libraries must do” was: how can research libraries support scholars within the entire lifecycle of the research process? The goal being to identify areas where library tools could be better integrated to support library resource use throughout the lifecycle of research. It was a pity that there was no real view yet on the best way to solve this problem: create a new library based infrastructure platform, use existing linking features, or other options. This will hopefully be the objective of a follow-up project at Seton Hall University Library.

    Publication profiles – presenting research in a new way“: Urban Andersson and Stina Johansson presented the Chalmers University (Gothenburg, Sweden) Publication Profiles Platform, in which all kinds of information related to Chalmers University researchers and publications are linked together. The main objective is to increase the visibility of Chalmers University research. A good example of how university libraries should take care of their own research and publications domain.  A very interesting visualisation feature was shown: Chalmers Geography, or geographical relations between researchers and projects on Google Maps. A question I should have asked (but didn’t) is: how does this project relate to the VIVO project?

    In my own presentation “Primo at the University of Amsterdam – Technology vs Real Life” I tried to show the discrepancies between the in theory unlimited possibilities of the technology used in library discovery layers and the limitations in the actual implementation of these tools, focused on content, indexing and user interface configuration. One of my conclusions was already expressed earlier by Jens Vigen: “A successful digital library: one size does not fit all.”.

    Of the other parallel tracks sessions I heard good reports about Andrew Withworth’s “The triadic model: A holistic view of how digital and information literacy must support each other” and Shun Nagaya and Keizo Itabashi – “Covo.js : A JavaScript Library to Utilize Subject Headings and Thesauri on the Web”. This doesn’t mean that the other talks were bad, I just didn’t manage to talk to people about them. One presentation worth mentioning is Krista Godfrey’s “The QR Question: QR Codes in Academic Libraries”, because it featured QR cows and my own photo of the University of Amsterdam Library’s QR cards.

    Let’s not forget Rune Martin Andersen’s talk of the Bartebuss (Moustache Bus) Trondheim public transport open data app project. This is yet another proof that public transport apps are the killer apps of open data.

    Trondheim moustache men

    Last but not least: the food (delicious and lots of it), the photos, Patrick Hochstenbach’s doodles and the music: the excursion to Rockheim Museum, the conference dinner entertainment by Skrømt, and the afterparty at Ramp bar, resulting in an interesting playlist afterwards.

  • Local library data in the new global framework

    Posted on January 5th, 2012 Lukas Koster 33 comments

    2011 has in a sense been the year of library linked data. Not that libraries of all kinds are now publishing and consuming linked data in great numbers. No. But we have witnessed the publication of the final report of the W3C Library Linked Data Incubator Group, the Library of Congress announcement of the new Bibliographic Framework for the Digital Age based on Linked Data and RDF, the release by a number of large libraries and library consortia of their bibliographic metadata, many publications, sessions and presentations on the subject.

    All these events focus mainly on publishing library bibliographic metadata as linked open data. Personally I am not convinced that this is the most interesting type of data that libraries can provide. Bibliographic metadata as such describe publications, in the broadest sense, providing information about title, authors, subjects, editions, dates, urls, but also physical attributes like dimensions, number of pages, formats, etc. This type of information, in FRBR terms: Work, Expression and Manifestation metadata, is typically shared among a large number of libraries, publishers, booksellers, etc. ‘Shared’ in this case means ‘multiplied and redundantly stored in many different local systems‘. It doesn’t really make sense if all libraries in the world publish identical metadata side by side, does it?

    In essence only really unique data is worth publishing. You link to the rest.

    Currently, library data that is really unique and interesting is administrative information about holdings and circulation. After having found metadata about a potentially relevant publication it is very useful for someone to know how and where to get access to it, if it’s not freely available online. Do you need to go to a specific library location to get the physical item, or to have access to the online article? Do you have to be affiliated to a specific institution to be entitled to borrow or access it?

    Usage data about publications, both print and digital, can be very useful in establishing relevance and impact. This way information seekers can be supported in finding the best possible publications for their specific circumstances. There are some interesting projects dealing with circulation data already, such as the research project by Magnus Pfeffer and Kai Eckert as presented at the SWIB 11 conference, and the JISC funded Library Impact Data project at the University of Huddersfield. The Ex Libris bX service presents article recommendations based on SFX usage log analysis.

    The consequence of this assertion is that if libraries want to publish linked open data, they should focus on holdings and circulation data, and for the rest link to available bibliographic metadata as much as possible. It is to be expected that the Library of Congress’ New Bibliographic Framework will take care of that part one way or another.

    In order to achieve this libraries should join forces with each other and with publishers and aggregators to put their efforts into establishing shared global bibliographic metadata pools accessible through linked open data. We can think of already existing data sources like WorldCat, OpenLibrary, Summon, Primo Central and the like. We can only hope that commercial bibliographic metadata aggregators like OCLC, SerialsSolutions and Ex Libris will come to realise that it’s in everybody’s interest to contribute to the realisation of the new Bibliographic Framework. The recent disagreement between OCLC and the Swedish National Library seems to indicate that this may take some time. For a detailed analysis of this see the blog post ‘Can linked library data disrupt OCLC? Part one’.

     

    An interesting initiative in this respect is LibraryCloud, an open, multi-library data service that aggregates and delivers library metadata. And there is the HBZ LOBID project, which is targeted at ‘the conversion of existing bibliographic data and associated data to Linked Open Data‘.

    So what would the new bibliographic framework look like? If we take the FRBR model as a starting point, the new framework could look something like this. See also my slideshow “Linked Open Data for libraries”, slides 39-42.

    The basic metadata about a publication or a unit of content, on the FRBR Work level, would be an entry in a global datastore identified by a URI ( Uniform Resource Identifier). This datastore could for instance be WorldCat, or OpenLibrary, or even a publisher’s datastore. It doesn’t really matter. We don’t even have to assume it’s only one central datastore that contains all Work entries.

    The thing identified by the URI would have a text string field associated with it containing the original title, let’s say “The Da Vinci Code” as an example of a book. But also articles can and should be identified this way. The basic information we need to know about the Work would be attached to it using URIs to other things in the linked data web. A set of two things linked by a URI is called a ‘triple’. ‘Author’ could for instance be a link to OCLC’s VIAF (http://viaf.org/viaf/102403515 = Dan Brown), which would then constitute a triple. If there are more authors, you simply add a URI for every person or institution. Subjects could be links to DBPedia/Wikipedia, Freebase, the Library of Congress Authority files, etc. There could be some more basic information, maybe a year, or a URI to a source describing the background of the work.

    At the Expression level, a Dutch translation would have it’s own URI, stored in the same or another datastore. I could imagine that the publisher who commissioned the translation would maintain a datastore with this information. Attached to the Expression there would be the URI of the original Work, a URI pointing to the language, a URI identifying the translator and a text string contaning the Dutch title, among others.

    Every individual edition of the work could have it’s own Manifestation level URI, with a link to the Expression (in this case the Dutch translation), a publisher URI, a year, etc. For articles published according to the long standing tradition of peer reviewed journals, there would also be information about the journal. On this level there should also be URIs to the actual content when dealing with digital objects like articles, ebooks, etc., no matter if access is free or restricted.

    So far we have everything we need to know about publications “in the cloud”, or better: in a number of datastores available on a number of servers connected to the world wide web. This is more or less the situation described by OCLC’s Lorcan Dempsey in his recent post ‘Linking not typing … knowledge organization at the network level’. The only thing we need now is software to present all linked information to the user.

    No libraries in sight yet. For accessing freely available digital content on the web you actually don’t need a library, unless you need professional assistance finding the correct and relevant information. Here we have identified a possible role of librarians in this new networked information model.

    Now we have reached the interesting part: how to link local library data to this global shared model? We immediately discover that the original FRBR model is inadequate in this networked environment, because it implies a specific local library situation. Individual copies of a work (the Items) are directly linked to the Manifestation, because FRBR refers to the old local catalogue which describes only the works/publications one library actually owns.

    In the global shared library linked data network we need an extra explicit level to link physical Items owned by the library or online subscriptions of the library to the appropriate shared network level. I suggest to use the “Holding” level. A Holding would have it’s own URI and contain URIs of the Manifestation and of the Library. A specific Holding in this way would indicate that a specific library has one or more copies (Items) of a specific edition of a work (Manifestation), or offers access to an online digital article by way of a subscription.

     

    If a Holding refers to physical copies (print books or journal issues for instance) then we also need the Item level. An Item would have it’s own URI and the URI of the Holding. For each Item, extra information can be provided, for instance ‘availability’, ‘location’, etc. Local circulation administration data can be registered for all Holdings and Items. For online digital content we don’t need Items, only subscription information directly attached to the Holding.

    Local Holding and Item information can reside on local servers within the library’s domain or just as well on some external server ‘in the cloud’.

    It’s on the level of the Holding that usage statistics per library can be collected and aggregated, both for physical items and for digital material.

    Now, this networked linked library data model still allows libraries to present a local traditional catalogue type interface, showing only information about the library’s own print and digital holdings. What’s needed is software to do this using the local Holdings as entry level.

    But the nice thing about the model is that there will also be a lot of other options. It will also be possible to start at the other end and search all bibliographic metadata available in the shared global network, and then find the most appropriate library to get access to a specific publication, much like WorldCat does, but on an even larger scale.

    Another nice thing of using triples, URIs and linked data, is that it allows for adding all kinds of other, non-traditional bibliographic links to the old inward looking library world, making it into a flexible and open model, ready for future developments. It will for instance be possible for people to discover links to publications and library holdings from any other location on the web, for instance a Wikipedia page or a museum website. And the other way around, from an item in local library holdings to let’s say a recorded theatre performance on YouTube.

    When this new data and metadata framework will be in place, there will be two important issues to be solved:

    • Getting new software, systems and tools for both back end administrative functions and front end information finding needs. For this we need efforts from traditional library systems vendors but also from developers in libraries.
    • Establishing future roles for libraries, librarians and information professionals in the new framework. This may turn out to be the most important issue.