Data. The final frontier.
RSS icon Home icon
  • Maps, dictionaries and guidebooks

    Posted on August 3rd, 2015 Lukas Koster 6 comments

    Interoperability in heterogeneous library data landscapes

    Maps, dictionaries, guidebooks

    Libraries have to deal with a highly opaque landscape of heterogeneous data sources, data types, data formats, data flows, data transformations and data redundancies, which I have earlier characterized as a “data maze”. The level and magnitude of this opacity and heterogeneity varies with the amount of content types and the number of services that the library is responsible for. Academic and national libraries are possibly dealing with more extensive mazes than small public or company libraries.

    In general, libraries curate collections of things and also provide discovery and delivery services for these collections to the public. In order to successfully carry out these tasks  they manage a lot of data. Data can be regarded as the signals between collections and services.

    These collections and services are administered using dedicated systems with dedicated datastores. The data formats in these dedicated datastores are tailored to perform the dedicated services that these dedicated systems are designed for. In order to use the data for delivering services they were not designed for, it is common practice to deploy dedicated transformation procedures, either manual ones or as automated utilities. These transformation procedures function as translators of the signals in the form of data.

    Here lies the origin of the data maze: an inextricably entangled mishmash of systems with explicit and

    © Ron Zack

    © Ron Zack

    implicit data redundancies using a number of different data formats, some of which systems are talking to each other in some way. This is not only confusing for end users but also for library system staff. End users lack clarity about user interfaces to use, and are missing relevant results from other sources and possible related information. Libraries need licenses and expertise for ongoing administration, conversion and migration of multiple systems, and suffer unforeseen consequences of adjustments elsewhere.

    To take the linguistic analogy further, systems make use of a specific language (data format) to code their signals in. This is all fine as long as they are only talking to themselves. But as soon as they want to talk to other systems that use a different language, translations are needed, as mentioned. Sometimes two systems use the same language (like MARC, DC, EAD), but this does not necessarily mean they can understand each other. There may be dialects (DANMARC, UNIMARC), local colloquialisms, differences in vocabularies and even alphabets (local fields, local codes, etc.). Some languages are only used by one system (like PNX for Primo). All languages describe things in their own vocabulary. In the systems and data universe there are not many loanwords or other mechanisms to make it clear that systems are talking about the same thing (no relations or linked data). And then there is syntax and grammar (such as subfields and cataloguing rules) that allow for lots of variations in formulations and formats.

    Translation does not only require applying a dictionary, but also interpretation of the context, syntax, local variations and transcriptions. Consequently much is lost in translation.lostintranslation

    The transformation utilities functioning as translators of the data signals suffer from a number of limitations. They translate between two specific languages or dialects only. And usually they are employed by only one system (proprietary utilities). So even if two systems speak the same language, they probably both need their own translator from a common source language. In many cases even two separate translators are needed if source and target system do not speak each other’s language or dialect. The source signals are translated to some common language which in turn is translated into the target language. This export-import scenario, which entails data redundancy across systems, is referred to as ETL (Extract Transform Load). Moreover, most translators only know a subset of the source and target language dependent on the data signals needed by the provided services. In some cases “data mappings” are used as conversion guides. This term does not really cover what is actually needed, as I have tried to demonstrate. It is not enough to show the paths between source and target signals. It is essential to add the selections and transformations needed as well. In order to make sense of the data maze you need a map, a dictionary and a guidebook.

    To make things even more complicated, sometimes reading data signals is only possible with a passport or visa (authentication for access to closed data). Or even worse, when systems’ borders are completely closed and no access whatsoever is possible, not even with a passport. Usually, this last situation is referred to with the term “data silos”, but that is not the complete picture. If systems are fully open, but their data signals are coded by means of untranslatable languages or syntaxes, we are also dealing with silos.

    Anyway, a lot of attention and maintenance is required to keep this Tower of Babel functioning. This practice is extremely resource-intensive, costly and vulnerable. Are there any solutions available to diminish maintenance, costs and vulnerability? Yes there are.

    First of all, it is absolutely crucial to get acquainted with the maze. You need a map (or even an atlas) to be able to see which roads are there, which ones are inaccessible, what traffic is allowed, what shortcuts are possible, which systems can be pulled down and where new roads can be built. This role can be fulfilled by a Dataflow Repository, which presents an up-to-date overview of locations and flows of all content types and data elements in the landscape.

    Secondly it is vital to be able to understand the signals. You need a dictionary to be able to interpret all signals, languages, syntaxes, vocabularies, etc. A Data Dictionary describing data elements, datastores, dataflows and data formats is the designated tool for this.

    And finally it is essential to know which transformations are taking place en route. A guidebook should be incorporated in the repository, describing selections and transformations for every data flow.

    You could leave it there and be satisfied with these guiding tools to help you getting around the existing data maze more efficiently, with all its ETL utilities and data redundancies. But there are other solutions, that focus on actually tackling or even eliminating the translation problem. Basically we are looking at some type of Service Oriented Architecture (SOA) implementation. SOA is a rather broad concept, but it refers to an environment where individual components (“systems”) communicate with each other in a technology and vendor agnostic way using interoperable building blocks (“services”). In this definition “services” refer to reusable dataflows between systems, rather than to useful results for end users. I would prefer a definition of SOA to mean “a data and utilities architecture focused on delivering optimal end user services no matter what”.

    Broadly speaking there are four main routes to establish a SOA-like condition, all of which can theoretically be implemented on a global, intermediate or local level.

    1. Single Store/Single Format: A single universal integrated datastore using a universal data format. No need for dataflows and translations. This would imply some sort of linked (open) data landscape with RDF as universal language and serving all systems and services. A solution like this would require all providers of relevant systems and databases to commit to a single universal storage format. Unrealistic in the short term indeed, but definitely something to aim for, starting at the local level.
    2. Multiple Stores/Shared Format: A heterogeneous system and datastore landscape with a universal communication language (a lingua franca, like English) for dataflows. No need for countless translators between individual systems. This universal format could be RDF in any serialization. A solution like this would require all providers of relevant systems and databases to commit to a universal exchange format. Already a bit less unrealistic.
    3. Shared Store/Shared Format: A heterogeneous system and datastore landscape with a central shared intermediate integrated datastore in a single shared format. Translations from different source formats to only one shared format. Dataflows run to and from the shared store only. For instance with RDF functioning as Esperanto, the artificial language which is actually sometimes used as “Interlingua” in machine translation. A solution like this does not require a universal exchange format, only a translator that understands and speaks all formats, which is the basis of all ETL tools. This is much more realistic, because system and vendor dependencies are minimized, except for variations in syntax and vocabularies. The platform itself can be completely independent.
    4. Multiple Stores/Single Translation Pool: or what is known as an Enterprise Service Bus (ESB). No translations are stored, no data is integrated. Simultaneous point to point translations between systems happen on the fly. Looks very much like the existing data maze, but with all translators sitting together in one cubicle. This solution is not a source of much relief, or as one large IT vendor puts it: “Using an ESB can become problematic if large volumes of data need to be sent via the bus as a large number of individual messages. ESBs should never replace traditional data integration like ETL tools. Data replication from one database to another can be resolved more efficiently using data integration, as it would only burden the ESB unnecessarily.”.

    Overlooking the possible routes out of the data maze, it seems that the first step should be employing the map, dictionary and guidebook concept of the dataflow repository, data dictionary and transformation descriptions. After that the only feasible road on the short term is the intermediate integrated Shared Store/Shared Format solution.

  • Standard deviations in data modeling, mapping and manipulation

    Posted on June 16th, 2015 Lukas Koster 15 comments

    Or: Anything goes. What are we thinking? An impression of ELAG 2015


    Mapping pathways in Stockholm

    This year’s ELAG conference in Stockholm was one of many questions. Not only the usual questions following each presentation (always elicited in the form of yet another question: “Any questions?”). But also philosophical ones (Why? What?). And practical ones (What time? Where? How? How much?). And there were some answers too, fortunately. This is my rather personal impression of the event. For a detailed report on all sessions see Christina Harlow’s conference notes.

    The theme of the ELAG 2015 conference was: “DATA”. This immediately leads to the first question: “What is data?”. Or rather: “What do we mean with data?”. And of course: “Who is ‘we’?”.

    In the current information professional and library perception ‘we’ typically distinguish data created and used for describing stuff (usually referred to as ‘metadata’), data originating from institutions, processes and events (known as ‘usage data’, ‘big data’), and a special case of the latter: data resulting from scholarly research (indeed: ‘research data’). All of these three types were discussed at ELAG.

    It is safe to say however, that the majority of the presentations, bootcamps and workshops focused on the ‘descriptive data’ type. I try to avoid the use of the term ‘metadata’, because it is confusing, and superfluous. Just use ‘data’, meaning ‘artificial elements of information about stuff’. To be perfectly clear, ‘metadata’ is NOT ‘data about data’ as many people argue. It’s information about virtual entities, physical objects, information contained in these objects (or ‘content’), events, concepts, people, etc. We could only rightfully speak of ‘data about data’ in the special case of data describing (research) datasets. For this case ‘we’ have invented the job of ‘data librarian’, which is a completely nonsensical term, because this job is concerned with the storage, discoverability and obtainability of only one single object or entity type: research datasets. Maybe we should start using the job title ‘dataset librarian’ for this activity. But this seems a bit odd, right? On the other hand, should we replace the term ‘metadata librarian’ with ‘data librarian’? Also a bit odd. Data is at this moment in time what libraries and other information and knowledge institutions use to make their content findable and usable to the public. Let’s leave it at that.

    This brings us to the two fundamental questions of our library ecosystem: “What are we describing?” and the mother of all data questions: “Why are we describing?”, which were at the core of what in my eyes was this year’s key presentation (not keynote!) by Niklas Lindström of the Swedish Royal Library/LIBRIS. I needed some time to digest the core assertions of Niklas’ philosophical talk, but I am convinced that ‘we’ should all be aware of the essential ‘truths’ of his exposition.

    First of all: “Why are we describing?“. The objective of having a library in the first place is to provide access in any way possible to the objects in our collections, which in turn may provide access to information and knowledge. So in general we should be describing in order to enable our intended audience to obtain what they need in terms of the collection. Or should that be in terms of knowledge? In real life ‘we’ are describing for a number of reasons: because we follow our profession, because we have always done this, because we are instructed to do so, because we need guidance in our workflows, because the library is indispensable, because of financial and political reasons. In any case we should be clear about what our purposes are, because the purpose influences what we’re describing and how we do that.

    Secondly: “What are we describing?”. Physical objects? Semi-tangible objects, like digital publications? Only outputs of processes, or also the processes themselves? Entities? Concepts? Representations? Relationships? Abstractions? Events? Again, we should be clear about this.

    monty-python-spanish-inquisitionThirdly (a Monty Python Spanish Inquisition moment ;-): “How are we describing?”. We use models, standards, formats, syntax, vocabularies in order to make maps (simplified representations of real world things) for reconciling differences between perceptions, bridging gaps between abstractions and guiding people to obtain the stuff they need. In doing so, Niklas says, we must adhere to Postel’s law, or the Robustness Principle, which states: “Be liberal in what you accept; be conservative in what you send”.

    Back to the technology, and the day to day implementation of all this. ‘We’ use data to describe entities and relationships of whatever nature. We use systems to collect and store the data in domain, service and time dependent record formats in system dependent datastores. And we create flows and transformations of data between these systems in order to fulfill our goals. For all this there are multiple standards.

    Basically, my own presentation “Datamazed – Analysing library dataflows, data manipulations and data redundancies” targeted this fragmented data environment, describing the library of the University of Amsterdam Dataflow Inventory abbasosproject leading to a Dataflow Repository, effectively functioning as a map of mappings. “System of Systems (SoS)” was also the topic of the workshop I participated in, “What is metadata management in net-centric systems?” by Jing Wang.

    Making sense of entities and relationships was the focus of a number of talks, especially the one by Thom Hickey about extending work, expression and author entities by way of data mining in Worldcat and VIAF, and the presentation by Jane Stevenson on the Jisc/Archives Hub project “exploring British Design”, which entailed shifting focus from documents to people and organizations as connected entities. Some interesting observations about the latter project: the project team started with identifying what the target audience actually wanted and how they would go about getting the desired information (“Why are we describing?”) in order to get to the entity based model (“What are we describing?”). This means that any entity identified in the system can become a focus, or starting point for pathways. A problem that became apparent is that the usual format standards for collection descriptions didn’t allow for events to be described.

    Here we arrive at the critique of standards that was formulated by Rurik Greenall in his talk about the Oslo Public Library ILS migration project, where they are migrating from a traditional ILS to RDF based cataloguing. Starting point here is: know what you need in order to support your actual users, not some idealised standard user, and work with a number of simple use cases (“Why are we describing?”). Use standards appropriate for your users and use cases. Don’t be rigid, and adapt. Use enough RDF to support use cases. Use just a part of the open source ILS Koha to support specific use cases that it can do well (users and holdings). Users and holdings are a closed world, which can be dealt with using a part of an existing system. Bibliographic information is an open world which can be taken care of with RDF. The data model zappaagain corresponds to the use cases that are identified. It grows organically as needed. Standards are only needed for communicating with the outside world, but we must not let the standards infect our data model (here we see Postel’s Law again).

    A striking parallel can be distinguished with the Stockholm University Library project for integration of the Open Source ILS Koha, the Swedish LIBRIS Union Catalogue and a locally developed logistics and ILL system. Again, only one part of Koha is used for specific functions, mainly because with commercial ILSes it is not possible to purchase only individual modules. Integrated library systems, which seemed a good idea in the 1980’s, just cannot cope with fragmented open world data environments.

    Dedicated systems, like ILSes, either commercial or open source, tend to force certain standards upon you. These standards not only apply to data storage (record formats etc.) but also to system structure (integrated systems, data silos), etc. This became quite clear in the presentation about the CERN Open Data Portal, where the standard digital library system Invenio imposed the MARC bibliographic format for describing research datasets for high energy physics, which turned out to be difficult if not impossible. Currently they are moving towards using JSON (yet another data standard) because the system apparently supports that too.

    With Open Source systems it is easier to adapt the standards to your needs than with proprietary commercial vendor systems. An example of this was given by the University of Groningen Library project where the Open Source Publication Repository software EPrints was tweaked to enable storage and description of research datasets focused on archeological findings, which require very specific information.

    As was already demonstrated in the two ILS migration projects deviation of standards of any kind can very easily be implemented. This is obviously not always the case. The locally developed Swedish Royal Library system for the legal deposit of electronic documents supports available suitable metadata standards like OAI, METS, MODS, PREMIS.

    For the OER World Map project, presented by Felix Ostrowski we can safely say that no standards were followed whatsoever, except using the discovery data format for storing data, which is basically also an adaptation of a standard. Furthermore the original objective of the project was organically modified by using the data hub for all kinds of other end user services and visualisations than just the original world map of the location of Open Educational Resources.

    It should be clear that every adaptation of standards generates the need for additional mappings and transformations besides the ones already needed in a fragmented systems and data infrastructure for moving around data to various places for different services. Mapping and transformation of data can be done in two ways: manually, in the case of explicit, known items, and by mining, in the case of implicit, unknown items.

    Manual mapping and transformation is of course done by dedicated software. The manual part consists of people selecting specific source data elements to be transformed into target data elements. This procedure is known as ETL (Extract Transform Load), and implies the copying of data between systems and datastores, which always entails some form of data redundancy. Tuesday afternoon was dedicated to this practice with three presentations: Catmandu and Linked Data Fragments by Ruben Verborgh and Patrick Hochstenbach; COMSODE by Jindrich Mynarz; d:swarm by Thomas Gängler. Of these three the first one focused more on efficiently exposing data as Linked Open Data by using the Linked Data Fragments protocol. An important aspect of tools like this is that we can move our accumulated knowledge and investment in data transformation away from proprietary formats and systems to system and vendor independent platforms and formats.

    Data mining and text mining are used in the case of non explicit data about entities and relationships, where bodies of data and text are analyzed using dedicated algorithms in order to find implicit entities and relationships and make them explicit. This was already mentioned in Thom Hickey’s Worldcat Entity Extension project. It is also used in the InFoLiS2 project, where data and text mining is used to find relationships between research projects and scholarly publications.

    Another example was provided by Rob Koopman and Shenghui Wang of OCLC Research, who analyzed keywords, authors and journal titles in the ArticleFirst database to generate proximity visualizations for greater serendipity.

    As long as ‘we’ don’t or can’t describe these types of relationships explicitly, we have to use techniques like mining to extract meaningful entities and relationships and generate data. Even if we do create and maintain explicit descriptions, we will remain a closed world if we don’t connect our data to the rest of the world. Connections can be made by targeted mapping, as in the case of the Finnish FINTO Library Ontology service for the public sector, or by adopting an open world strategy making use of semantic web and linked open data instruments.

    Furthermore, ‘we’ as libraries should continuously ask ourselves the questions “Why and what are we describing?”, but also “Why are we here?”. Should we stick to managing descriptive data, or should we venture out into making sense of big data and research data, and provide new information services to our audience?

    Finally, I thank the local organizers, the presenters and all other participants for making ELAG2015 a smooth, sociable and stimulating experience.

  • Analysing library data flows for efficient innovation

    Posted on November 27th, 2014 Lukas Koster No comments

    In my work at the Library of the University of Amsterdam I am currently taking a step forward by actually taking a step back from a number of forefront activities in discovery, linked open data and integrated research information towards a more hidden, but also more fundamental enterprise in the area of data infrastructure and information architecture. All for a good cause, for in the end a good data infrastructure is essential for delivering high quality services in discovery, linked open data and integrated research information.
    In my role as library systems coordinator I have become more and more frustrated with the huge amounts of time and effort spent on moving data from one system to another and shoehorning one record format into the next, only to fulfill the necessary everyday services of the university library. Not only is it not possible to invest this time and effort productively in innovative developments, but this fragmented system and data infrastructure is also completely unsuitable for fundamental innovation. Moreover, information provided by current end user services is fragmented as well. Systems are holding data hostage. I have mentioned this problem before in a SWIB presentation. The issue was also recently touched upon in an OCLC Hanging Together blog post: “Synchronizing metadata among different databases” .

    Fragmented data (SWIB12)

    Fragmented data (SWIB12)

    In order to avoid confusion in advance: when using the term “data” here, I am explicitly not referring to research data or any other specific type of data. I am using the term in a general sense, including what is known in the library world as “metadata”. In fact this is in line with the usage of the term “data” in information analysis and system design practice, where data modelling is one of the main activities. Research datasets as such are to be treated as content types like books, articles, audio and people.

    It is my firm opinion that libraries have to focus on making their data infrastructure more efficient if they want to keep up with the ever changing needs of their audience and invest in sustainable service development. For a more detailed analysis of this opinion see my post “(Discover AND deliver) OR else – The future of the academic library as a data services hub”. There are a number of different options to tackle this challenge, such as starting completely from scratch, which would require huge investments in resources for a long time, or implementing some kind of additional intermediary data warehouse layer while leaving the current data source systems and workflows in place. But for all options to be feasible and realistic, a thorough analysis of a library’s current information infrastructure is required. This is exactly what the new Dataflow Inventory project is about.

    The project is being carried out within the context of the short term Action Plans of the Digital Services Division of the Library of the University of Amsterdam, and specifically the “Development and improvement of information architecture and dataflows” program. The goal of the project is to describe the nature and content of all internal and external datastores and dataflows between internal and external systems in terms of object types (such as books, articles, datasets, etc.) and data formats, thereby identifying overlap, redundancy and bottlenecks that stand in the way of efficient data and service management. We will be looking at dataflows in both front and back end services for all main areas of the University Library: bibliographic, heritage and research information. Results will be a logical map of the library data landscape and recommendations for possible follow up improvements. Ideally it will be the first step in the Cleaning-Reconciling-Enriching-Publishing data chain as described by Seth van Hooland and Ruben Verborgh in their book “Linked Data for Libraries, Archives and Museums”.

    The first phase of this project is to decide how to describe and record the information infrastructure in such a form that the data map can be presented to various audiences in a number of ways, and at the same time can be reused in other contexts on the long run, for instance designing new services. For this we need a methodology and a tool.

    At the university library we do not have any thorough experience with describing an information infrastructure on an enterprise level, so in this case we had to start with a clean slate. I am not at all sure that we came up with the right approach in the end. I hope this post will trigger some useful feedback from institutions with relevant experience.

    Since the initial and primary goal of this project is to describe the existing infrastructure instead of a desired new situation, the first methodological area to investigate appears to be Enterprise Architecture (interesting to see that Wikipedia states “This article appears to contain a large number of buzzwords“). Because it is always better to learn from other people’s experiences than to reinvent all four wheels, we went looking for similar projects in the library, archive and museum universe. This proved to be rather problematic. There was only one project we could find that addresses a similar objective, and I happened to know one of the project team members. The Belgian “Digital library system’s architecture study” (English language report here)” was carried out for the Flemish Public Library network Bibnet, by Rosemie Callewaert among others. Rosemie was so kind to talk to me and explain the project objectives, approaches, methods and tools used. For me, two outcomes of this talk stand out: the main methodology used in the project is Archimate, which is an Enterprise Architecture methodology, and the approach is completely counter to our own approach: starting from the functional perspective as opposed to our overview of the actual implemented infrastructure. This last point meant we were still looking at a predominantly clean slate.
    Archimate also turned out to be the method of choice used by the University of Amsterdam central enterprise architecture group, whom we also contacted. It became clear that in order to use Archimate efficiently, it is necessary to spend a considerable amount of time on mastering the methodology. We looked for some accessible introductory information to get started. However the official Open Group Archimate website is not as accessible as desired in more than one way. We managed to find some documentation anyway, for instance the direct linkt to the Archimate specification and the free document “Archimate made practical”. After studying this material we found that Archimate is a comprehensive methodology for describing business, application and technical infrastructure components, but we also came to the conclusion that for our current short term project presentation goals we needed something that could be implemented fairly soon. We will keep Archimate in mind for the intermediate future. If anybody is interested, there is a good free open source modelling tool available, Archi. Other Enterprise Architecture methodologies like Business Process Modelling focus more on workflows than on existing data infrastructures. Turning to system design methods like UML (Unified Modelling Language) we see similar drawbacks.

    An obvious alternative technique to consider is Dataflow Diagramming (DFD) (what’s in a name?), part of the Structured Design and Structured Analysis methodology, which I had used in previous jobs as systems designer and developer. Although DFD’s are normally used for describing functional requirements on a conceptual level, with some tweaking they can also be used for describing actual system and data infrastructures, similar to the Archimate Application and Infrastructure layers. The advantage of the DFD technique is that it is quite simple. Four elements are used to describe the flow of information (dataflows) between external entities, processes and datastores. The content of dataflows and datastores can be specified in more detail using a data dictionary. The resulting diagrams are relatively easy to comprehend. We decided to start with using DFD’s in the project. All we had left to do was find a good and not too expensive tool for it.

    Basic DFD structure

    Basic DFD structure

    There are basically two types of tools for describing business processes and infrastructures: drawing tools, focusing on creating diagrams, and repository based modelling tools, focused on reusing the described elements. The best known drawing tool must be MicroSoft Visio, because it is part of their widely used Office Suite. There are a number of other commercial and free tools, among which the free Google Drive extension Although most drawing tools cover a wide range of methods and techniques, they don’t usually support reuse of elements with consistent characteristics in other diagrams. Also, diagrams are just drawings, they can’t be used to generate data definition scripts or basic software modules or reverse engineering or flexible reporting. Repository based tools can do all these things. Reuse, reporting, generating, reverse engineering and import and export features are exactly the features we need. We also wanted a tool that supports a number of other methods and techniques for employing in other areas of modelling, design and development. There are some interesting free or open source tools, like OpenModelSphere (which supports UML, ERD Data modelling and DFD), and a range of commercial tools. To cut a long story short we selected the commercial design and management tool Visual-Paradigm because it supports a large number of methodologies with an extensive feature set in a number of editions for reasonable fees. An additional advantage is the online shared teamwork repository.

    After acquiring the tool we had to configure it the way we wanted to use it. We decided to try and align the available DFD model elements to the Archimate elements so it would in time be possible to move to Archimate if that would prove to be a better method for future goals. Archimate has Business Service and Business Process elements on the conceptual business level, and Application Component (a “system”), Application Function (a “module”) and Application Service (a “function”) elements on the implementation level.

    Basic Archimate Structure

    Basic Archimate Structure

    In our project we will mainly focus on the application layer, but with relations to the business layer. Fortunately, the DFD method supports a hierarchical process structure by means of the decomposition mechanism, so the two hierarchical structures Business Service – Business Process and Application Component – Application Function – Application Service can be modeled using DFD. There is an additional direct logical link between a Business Process and the Application Service that implements it. By adding the “stereotypes” feature from the UML toolset to the DFD method in Visual Paradigm, we can effectively distinguish between the five process types (for instance by colour and attributes) in the DFD.

    Archimate DFD alignment

    Archimate DFD alignment

    So in our case, a DFD process with a “system” stereotype represents a top level Business Service (“Catalogue”, “Discover”, etc.) and a “process” process within “Cataloguing” represents an activity like “Describe item”, “Remove item”, etc. On the application level a “system” DFD process (Application Component) represents an actual system, like Aleph or Primo, a “module” (Application Function) a subsystem like Aleph CAT or Primo Harvesting, and a “function” (Application Service) an actual software function like “Create item record”.
    A DFD datastore is used to describe the physical permanent and temporary files or databases used for storing data. In Archimate terms this would probably correspond with a type of “Artifact” in the Technical Infrastructure layer, but that might be subject for interpretation.
    Finally an actual dataflow describes the data elements that are transferred between external entities and processes, between processes, and between processes and datastores, in both directions. In DFD, the data elements are defined in the data dictionary in the form of terms in a specific syntax that also supports optionality, selection and iteration, for instance:

    • book = title + (subtitle) + {author} + publisher + date
    • author = name + birthdate + (death date)

    In Archimate there is a difference in flows in the Business and Application layers. In the Business layer a flow can be specified by a Business Object, which indicates the object types that we want to describe, like “book”, “person”, “dataset”, “holding”, etc. The Business Object is realised as one or more Data Objects in the Application Layer, thereby describing actual data records representing the objects transferred between Application Services and Artifacts. In DFD there is no difference between a business and a dataflow. In our project we particularly want to describe business objects in dataflows and datastores to be able to identify overlap and redundancies. But besides that we are also interested in differences in data structure used for similar business objects. So we do have to distinguish between business and data objects in the DFD model. In Visual-Paradigm this can be done in a number of ways. It is possible to add elements from other methodologies to a DFD with links between dataflows or datastores and the added external elements. Data structures like this can also be described in Entity Relationship Diagrams, UML Class Diagrams or even RDF Ontologies.
    We haven’t decided on this issue yet. For the time being we will employ the Visual Paradigm Glossary tool to implement business and data object specifications using Data Dictionary terms. A specific business object (“book”) will be linked to a number of different dataflows and datastores, but the actual data objects for that one business object can be different, both in content and in format, depending on the individual dataflows and datastores. For instance a “book” Business Object can be represented in one datastore as an extensive MARC record, and in another as a simple Dublin Core record.

    Example bibliographic dataflows

    Example bibliographic dataflows

    After having determined method, tool and configuration, the next step is to start gathering information about all relevant systems, datastores and dataflows and describing this in Visual Paradigm. This will be done by invoking our own internal Digital Services Division expertise, reviewing applicable documentation, and most importantly interviewing internal and external domain experts and stakeholders.
    Hopefully the resulting data map will provide so much insight that it will lead to real efficiency improvements and really innovative services.

  • Looking for data tricks in Libraryland

    Posted on September 5th, 2014 Lukas Koster No comments

    IFLA 2014 Annual World Library and Information Congress Lyon – Libraries, Citizens, Societies: Confluence for Knowledge


    After attending the IFLA 2014 Library Linked Data Satellite Meeting in Paris I travelled to Lyon for the first three days (August 17-19) of the IFLA 2014 Annual World Library and Information Congress. This year’s theme “Libraries, Citizens, Societies: Confluence for Knowledge” was named after the confluence or convergence of the rivers Rhône and Saône where the city of Lyon was built.

    This was the first time I attended an IFLA annual meeting and it was very much unlike all conferences I have ever attended. Most of them are small and focused. The IFLA annual meeting is very big (but not as big as ALA) and covers a lot of domains and interests. The main conference lasts a week, including all kinds of committee meetings, and has more than 4000 participants and a lot of parallel tracks and very specialized Special Interest Group sessions. Separate Satellite Meetings are organized before the actual conference in different locations. This year there were more than 20 of them. These Satellite Meetings actually resemble the smaller and more focused conferences that I am used to.

    A conference like this requires a lot of preparation and organization. Many people are involved, but I especially want to mention the hundreds of volunteers who were present not only in the conference centre but also at the airport, the railway stations, on the road to the location of the cultural evening, etc. They were all very friendly and helpful.

    Another feature of such a large global conference is that presentations are held in a number of official languages, not only English. A team of translators is available for simultaneous translations. I attended a couple of talks in French, without translation headset, but I managed to understand most of what was presented, mainly because the presenters provided their slides in English.

    It is clear that you have to prepare for the IFLA annual meeting and select in advance a number of sessions and tracks that you want to attend. With a large multi-track conference like this it is not always possible to attend all interesting sessions. In the light of a new data infrastructure project I recently started at the Library of the University of Amsterdam I decided to focus on tracks and sessions related to aspects of data in libraries in the broadest sense: “Cloud services for libraries – safety, security and flexibility” on Sunday afternoon, the all day track Universal Bibliographic Control in the Digital Age: Golden Opportunity or Paradise Lost?” on Monday and “Research in the big data era: legal, social and technical approaches to large text and data sets” on Tuesday morning.

    Cloud Services for Libraries

    It is clear that the term “cloud” is a very ambiguous term and consequently a rather unclear concept. Which is good, because clouds are elusive objects anyway.

    In the Cloud Services for Libraries session there were five talks in total. Kee Siang Lee of the National Library Board of Singapore (NLB) described the cloud based NLB IT infrastructure consisting of three parts; a private, public and hybrid cloud. The private (restricted access) cloud is used for virtualization, an extensive service layer for discovery, content, personalization, and “Analytics as a service”, which is used for pushing and recommending related content from different sources and of various formats to end users. This “contextual discovery” is based on text analytics technologies across multiple sources, using a Hadoop cluster on virtual servers. The public cloud is used for the Web Archive Singapore project which is aimed at archiving a large number of Singapore websites. The hybrid cloud is used for what is called the Enquiry Management System (EMS), where “sensitive data is processed in-house while the non-sensitive data resides in the cloud”. It seems that in Singapore “cloud” is just another word for a group of real or virtual servers.

    In the talk given by Beate Rusch of the German Library Network Service Centre for Berlin and Brandenburg KOBV the term “cloud” meant: the shared management of data on servers located somewhere in Germany. KOBV is one of the German regional Library Networks involved in the CIB project targeted at developing a unified national library data infrastructure. This infrastructure may consist of a number of individual clouds. Beate Rusch described three possible outcomes: one cloud serving as a master for the others, a data roundabout linking the other clouds, and a cross cloud dataspace where there is an overlapping shared environment between the individual clouds. An interesting aspect of the CIB project is that cooperation with two large commercial library system vendors, OCLC and Ex Libris, is part of the official agreement. This is of interest for other countries that have vested interests in these two companies, like The Netherlands.

    Universal Bibliographic Control in the Digital Age

    The Universal Bibliographic Control (UBC) session was an all day track with twelve very diverse presentations. Ted Fons of OCLC gave a good talk explaining the importance of the transition from the description of records to the modeling of entities. My personal impression lately is that OCLC all in all has been doing a good job with linked data PR, explaining the importance and the inevitability of the semantic web for libraries to a librarian audience without using technical jargon like URI, ontology, dereferencing and the like. Richard Wallis of OCLC, who was at the IFLA 2014 Linked Data Satellite Meeting and in Lyon, is spreading the word all over the globe.

    Of the rest of the talks the most interesting ones were given in the afternoon. Anila Angjeli of the National Library of France (BnF) and Andrew MacEwan of the British Library explained the importance, similarities and differences of ISNI and VIAF, both authority files with identifiers used for people (both real and virtual). Gildas Illien (also one of the organizers of the Linked Data Satellite Meeting in Paris) and Françoise Bourdon, both BnF, described the future of Universal Bibliographic Control in the web of data, which is a development closely related to the topic of the talks by Ted Fons, Anila Angjeli and Andrew MacEwan.

    The ONKI project, presented by the National Library of Finland, is a very good example of how bibliographic control can be moved into the digital age. The project entails the transfer of the general national library thesaurus YSA to the new YSO ontology, from libraries to the whole public sector and from closed to open data. The new ontology is based on concepts (identified by URIs) instead of monolingual text strings, with multilingual labels and machine readable relationships. Moreover the management and development of the ontology is now a distributed process. On top of the ontology the new public online Finto service has been made available.

    The final talk of the day “The local in the global: universal bibliographic control from the bottom up” by Gordon Dunsire applied the “Think globally, act locally” aphorism to the Universal Bibliographic Control in the semantic web era. The universal top down control should make place for local bottom up control. There are so many old and new formats for describing information that we are facing a new biblical confusion of tongues: RDA, FRBR, MARC, BIBO, BIBFRAME, DC, ISBD, etc. What is needed are a number of translators between local and global data structures. On a logical level: Schema Translator, Term Translator, Statement Maker, Statement Breaker, Record Maker, Record Breaker. These black boxes are a challenge to developers. Indeed, mapping and matching of data of various types, formats and origins are vital in the new web of information age.


    Research in the big data era

    The Research in the big data era session had five presentations on essentially two different topics: data and text mining (four talks) and research data management (one talk). Peter Leonard of Yale University Library started the day with a very interesting presentation of how advanced text mining techniques can be used for digital humanities research. Using the digitized archive of Vogue magazine he demonstrated how the long term analysis of statistical distribution of related terms, like “pants”, “skirts”, “frocks”, or “women”, “girls”, can help visualise social trends and identify research questions. To do this there are a number of free tools available, like Google Books N-Gram Search and Bookworm. To make this type of analysis possible, researchers need full access to all data and text. However, rights issues come into play here, as Christoph Bruch of the Helmholtz Association, Germany, explained. What is needed is “intelligent openness” as defined by the Royal Society: data must be accessible, assessable, intelligible and usable. Unfortunately European copyright law stands in the way of the idea of fair use. Many European researchers are forced to perform their data analysis projects outside Europe, in the USA. The plea for openness was also supported by LIBER’s Susan Reilly. Data and text mining should be regarded as just another form of reading, that doesn’t need additional licenses



    IdeasBox packed

    A very impressive and sympathetic library project that deserves everybody’s support was not an official programme item, but a bunch of crates, seats, tables and cushions spread across the central conference venue square. The whole set of furniture and equipment, that comes on two industrial pallets, constitutes a self supporting mobile library/information centre to be deployed in emergency areas, refugee camps etc. It is called IdeasBox, provided by Libraries without Borders. It contains mobile internet, servers, power supplies, ereaders, laptops, board games, books, etc., based on the circumstances, culture and needs of the target users and regions. The first IdeasBoxes are now used in Burundi in camps for refugees from Congo. Others will soon go to Lebanon for Syrian refugees. If librarians can make a difference, it’s here. You can support Libraries without Borders and IdeadBox in all kinds of ways:


    IdeasBox unpacked


    The questions about data management in libraries that I brought with me to the conference were only partly addressed, and actual practical answers and solutions were very rare. The management and mapping of heterogeneous and redundant types of data from all types of sources across all domains that libraries cover, in a flexible, efficient and system independent way apparently is not a mainstream topic yet. For things like that you have to attend Satellite Meetings. Legal issues, privacy, copyright, text and data mining, cloud based data sharing and management on the other hand are topics that were discussed. It turns out that attending an IFLA meeting is a good way to find out what is discussed, and more importantly what is NOT discussed, among librarians, library managers and vendors.

    The quality and content of the talks vary a lot. As always the value of informal contacts and meetings cannot be overrated. All in all, looking back I can say that my first IFLA has been a positive experience, not in the least because of the positive spirit and enthusiasm of all organizers, volunteers and delegates.

    (Special thanks to Beate Rusch for sharing IFLA experiences)

  • Library Linked Data Happening

    Posted on August 26th, 2014 Lukas Koster 2 comments

    LOD happening

    On August 14 the IFLA 2014 Satellite Meeting ‘Linked Data in Libraries: Let’s make it happen! took place at the National Library of France in Paris. Rurik Greenall (who also wrote a very readable conference report) and I had the opportunity to present our paper ‘An unbroken chain: approaches to implementing Linked Open Data in libraries; comparing local, open-source, collaborative and commercial systems’. In this paper we do not go into reasons for libraries to implement linked open data, nor into detailed technical implementation options. Instead we focus on the strategies that libraries can adopt for the three objectives of linked open data, original cataloguing/creating of linked data, exposing legacy data as linked open data and consuming external linked open data. Possible approaches are: local development, using Free and open Source Software, participating in consortia or service centres, and relying on commercial vendors, or any combination of these. Our main conclusions and recommendations are: identify your business case, if you’re not big enough be part of some community, and take lifecycle planning seriously.

    The other morning presentations provided some interesting examples of a number of approaches we described in our talk. Valentine Charles presented the work in the area of aggregating library and heritage data from a large number of heterogeneous sources in different languages by two European institutions that de facto function as large consortia or service centres for exposing and enriching data, Europeana and The European Library. Both platforms not only expose their aggregated content in web pages for human consumption but also as linked open data, besides other so called machine readable formats. Moreover they enrich their aggregated content by consuming data from their own network of providers and from external sources, for instance multilingual “value vocabularies” like thesauri, authority lists, classifications. The ideas is to use concepts/URIs together with display labels in multiple languages. For Europeana these sources currently are GeoNames, DBPedia and GEMET. Work is being done on including the Getty Art and Architecture Thesaurus (AAT) which was recently published as Linked Open Data. Besides using VIAF for person authorities, The European Library has started adding multilingual subject headings by integrating the Common European Research Classification Scheme, part of the CERIF format. The use of MACS (Multilingual Access to Subjects) as Linked Open Data is being investigated. This topic was also discussed during the informal networking breaks. Questions that were asked: is MACS valuable for libraries, who should be responsible for MACS and how can administering MACS in a Linked Open Data environment best be organized? Personally I believe that a multilingual concept based subject authority file for libraries, archives, museums and related institutions is long overdue and will be extremely valuable, not only in Linked Open Data environments.

    The importance of multilingual issues and the advantages that Linked Open Data can offer in this area were also demonstrated in the presentation about the Linked Open Authority Data project at the National Diet Library of Japan. The Web NDL Authorities are strongly connected to VIAF and LCSH among others.

    The presentation of the Linked Open Data environment of the National Library of France BnF ( highlighted a very interesting collaboration between a large library with considerable resources in expertise, people and funding on the one hand, and the non-library commercial IT company Logilab. The result of this project is a very sophisticated local environment consisting of the aggregated data sources of the National Library and a dedicated application based on the free software tool Cubicweb. An interesting situation arose when the company Logilab itself asked if the developed applications could be released as Open Source by the National Library. The BnF representative Gildas Illien (also one of the organizers of the meeting together with Emmanuelle Bermes) replied with considerations about planning, support and scalability, which is completely understandable from the perspective of lifecycle planning.

    With all these success stories about exposing and publishing Linked Open Data, the question always remains if the data is actually used by others. It is impossible to incorporate this in project planning and results evaluation. Regarding the BnF data this question was answered in the presentation about Linked Open Data in the book industry. The Electre and Antidot project uses linked open data form among others

    The afternoon presentations were focused on creating, maintaining and using various data models, controlled vocabularies and knowledge organisation sysems (KOS) as Linked Open Data: The EDM Europeana data Model, UNIMARC, MODS. An interesting perspective was presented by Gordon Dunsire on versioning vocabularies in a linked data world. Vocabularies change over time, so an assignment of a URI of a certain vocabulary concept should always contain version information (like timestamps and/or version numbers) in order to be able to identify the intended meaning at the time of assigning.

    The meeting was concluded with a panel with representatives of three commercial companies involved in library systems and Linked Open Data developments: Ex Libris, OCLC and the afore-mentioned Logilab. The fact that this panel with commercial companies on library linked data took place was significant and important in itself, regardless of the statements that were made about the value and importance of Linked Open Data in library systems. After years of dedicated temporarily funded proof of concept projects this may be an indication that Linked Open Data in libraries is slowly becoming mainstream.

  • Roadmaps, roadblocks and data finding users

    Posted on June 19th, 2014 Lukas Koster 1 comment

    Lingering gold at ELAG 2014

    Locks in Bath

    Locks in Bath

    Libraries tend to see themselves as intermediaries between information and the public, between creators and consumers of information. Looking back at the ELAG 2014 conference at the University of Bath however, I can’t get the image out of my head of libraries standing in the way between information and consumers. We’ve been talking about “inside out libraries”, “libraries everywhere”, “rethinking the library” and similar soundbites for some years now, but it looks like it’s been only talk and nothing more. A number of speakers at ELAG 2014 reported that researchers, students and other potential library visitors wanted the library to get out of their way and give them direct access to all data, files and objects. A couple of quotes:

    • We hide great objects behind search forms” (Peter Mayr, “EuropeanaBot”)
    • Give us everything” (Ben O’Steen, “The Mechanical Curator”).

    [Lingering gold: data, objects]
    In a cynical way this observation tightly fits this year’s conference theme “Lingering Gold”, which refers to the valuable information and objects hidden and locked away somewhere in physical and virtual local stores, waiting to be dug up and put to use. In her keynote talk, Stella Wisdom, digital curator at the British Library, gave an extensive overview of the digital content available there, and the tools and services employed to present it to the public. However, besides options for success, there are all kinds of pitfalls in attempting to bring local content to the world. In our performance “The Lord of the Strings”, Karen Coyle, Rurik Greenall, Martin Malmsten, Anders Söderbäck and myself tried to illustrate that in an allegorical way, resulting in a ROADMAP containing guidelines for bringing local gold to the world.
    In recent years it has become quite clear that data, dispersed and locked away in countless systems and silos, once liberated and connected can be a very valuable source of new information. This was very pertinently demonstrated by Stina Johansson in her presentation of visualization of research and related networks at Chalmers University using available data from a number of their information systems. Similar network visualizations are available in the VIVO open source linked data based research information tool, which was the topic of a preconference bootcamp which I helped organize (many thanks especially to Violeta Ilik, Gabriel Birke and Ted Lawless who did most of the work).

    [Systems, apis, technology trap]
    The point made here also implies that information systems actually function as roadblocks to full data access instead of as finding aids. I have come to realize this some time ago, and my perception was definitely confirmed during ELAG 2014. In his lightning talk Rurik Greenall emphasized the fact that what we do in libraries and other institutions is actually technology driven. Systems define the way we work and what we publish. This should be the other way around. Even APIs, intended for access to data in systems without having to use end user system functions, are actually sub-systems, giving non transparent views on the data. When Steve Meyer in his talk “Building useful and usable web services” said “data is the API” he was right in theory, yet in practice the reverse is not necessarily true. Also, APIs are meant to be used by developers in new systems. Non-tech end users have no use for it, as is illustrated by one of the main general reactions from researchers to the British Library Labs surveys, as reported by Ben O’Steen: “API? What’s that? I don’t care. Just give me the files.”.

    Old technologies in new clothes

    Old technologies in new clothes

    [Commercial vs open source]
    This technology critique essentially applies to both commercial/proprietary and open source systems alike. However, it could be that open source environments are more favorable to open and findable data than proprietary ones. Felix Ostrowski talked about the reasons for and outcomes of the Regal project, moving the electronic objects repository of the State Library of Rheinland-Pfalz from an environment based on commercial software to one based on open source tools and linked data concepts. One of the side effects of this move was that complaints were received from researchers about their output being publicly available on the web. This shows that the new approach worked, that the old approach was effectively hiding information and that certain stakeholders are completely satisfied with that.
    On the side: one of the open source components of the new Regal environment is Fedora , only used for digital objects, not any metadata, which is exactly what is currently happening in the new repository project at the Library of the University of Amsterdam. A legitimate question asked by Felix: why use Fedora and not just the file system in this case?

    [Alternative ways]
    All these observations also imply that, if libraries really want to disseminate and share their lingering gold with the world, alternative ways of exposing content are needed, instead of or besides the existing ones. Fortunately some libraries and individuals have been working on providing better direct access and even unguided and unsolicited publication of data and objects that might be available but not really findable with traditional library search tools. The above mentioned EuropeanaBot (and other twitter bots) and the British Library Labs’ Mechanical Curator are a case in point. Every hour EuropeanaBot sends a tweet about a random digital object, enriching it with extra information from Wikipedia and other sources.
    In the case of the British Library Labs Ben O’Steen described an experiment with free access to large amounts of data that by chance led to the observation that randomly excavated images from that vast amount of content drew people’s attention. As all content was in the public domain anyway, they asked themselves “what’s the harm in making it a bit more acessible?”. So the Mechanical Curator was born, with channels on tumblr, twitter and flickr.
    Another alternative way to expose and share library content, a game, was presented by Ciaran Talbot and Kay Munro: LibraryGame. In brief, students are encouraged to use and visit the library and share library content with others by awarding them points and badges as members of an online community. The only two things students didn’t like about the name LibraryGame were “library” and “game”, so the name was changed to “BookedIn”.
    No matter if you like bots and games or not, the important message here is that it is worthwhile exploring alternative ways by which people can find the content that libraries consider so valuable.

    In the end, it’s people that libraries work for. At Utrecht University Library they realised that they needed simpler ways to make it possible for people to use their content, not only APIs. Marina Muilwijk described how they are experimenting with the Lean Startup method. In a continuous cycle of building, measuring and learning, simple applications are released to end users in order to test if they use them and how they react to them.
    Focus on the user” was also the theme of the workshop  given by Ken Chad around the Jobs-to-be-done methodology.
    Interestingly, “How people find” instead of: “How people search” was one of the perspectives of the Jisc “Spotlight on the Digital” project, presented by Owen Stephens in his lightning talk.

    [Collections and findability]
    Another perspective of that Jisc project was how to make collections discoverable. It turns out that collections as such are represented on the web quite well, whereas items in these collection aren’t.
    Valentine Charles of The European Library demonstrated the benefits of collection level metadata for the discoverability of hidden content, using the CENDARI project as example.

    [Linking data]
    What’s a library technology conference without linked data? Implicitly and explicitly the instrument of connecting data from different sources relates quite well to most of the topics presented around the theme of lingering gold, with or without the application of the official linked data rules. I have already mentioned most cases, I will only go into a couple of specific sessions here.
    Niklas Lindström and Lina Westerling presented the developments with the new linked data based cataloguing system for the Swedish LIBRIS union catalogue. This approach is not simply a matter of exposing and consuming linked data, but in essence the reconstruction of existing workflows using a completely new architecture.
    The data management and integration platform d:swarm, a joint open source project of SLUB State and University Library Dresden and the commercial company AvantgardeLabs was presented in a lightning talk by Jan Polowinski. This tool aims at harvesting and normalising data from various existing systems and datastores into an intermediate platform that in turn can be used for all kinds of existing and new front end systems and services. The concept looks very useful for library environments with a multitude of legacy systems. Some time ago I visited the d:swarm team in Dresden together with a group of developers from the KOBV library consortium in Berlin, two of whom (Julia Goltz and Viktoria Schubert) presented their own new K2 portal solution for the data integration challenge in a lightning talk.

    Linked data is all about unique identifiers on the web. The recent popular global identifier for researchers ORCiD, at last year’s ELAG topic of one of the workshops, was explained by Tom Demeranville. As it happened, right after the conference it became clear that ORCiD implemented the Turtle linked data format.
    The problem of matching string based personal names from various data sources without matching identifiers was tackled in the workshop “Linking Data with sameAs” which I attended. Jane and Adrian Stevenson of the ArchivesHub UK showed us hands-on how to use tools like LOD-Refine and Silk for reconciling string value data fields and producing “sameAs” relationships/triples to be used in your local triple store. They have had substantial experience with this challenge in their Linking Lives project. I found the workshop very useful. One of the take-aways was that matching string data is hard work.

    Hard work also goes on in the caves and basements of the library world, as was demonstrated by Toke Eskildsen in his war stories of the Danish State Library with scanning companies, and by Eva Dahlbäck and Theodor Tolstoy in their account of using smartphones and RFID technology in fetching books from the stacks.

    Once again I have to say that a number of unofficial sessions, at breakfast, dinner, in pubs and hotel bars, were much more informative than the official presentations. These open discussions in small groups, fostering free exchange of ideas without fear of embarrassment, while being triggered by the talks in the official programme, can simply not take place within a tight conference schedule. Nevertheless, ELAG is a conference small and informal enough to attract people inclined to these extracurricular activities. I thank everybody who engaged in this. You know who you are. Or check Rurik Greenall’s conference report, which is a very structured yet personal account of the event.

    Pub talk

    Pub talk

    Lots of thanks to the dedicated and very helpful local organisation team of the Library of the University of Bath, who have done a wonderful job doing something completely new to them: organising an international conference.

  • Linked data or die!

    Posted on December 1st, 2013 Lukas Koster 1 comment

    Struggling towards usable linked data services at SWIB13


    Paraphrasing some of the challenges proposed by keynote speaker Dorothea Salo, the unofficial theme of the SWIB13 conference in Hamburg might be described as “No more ontologies, we want out of the box linked data tools!”. This sounds like we are dealing with some serious confrontations in the linked open data world. Judging by Martin Malmsten’s LIBRIS battle cry “Linked data or die!” you might even think there’s an actual war going on.

    Looking at the whole range of this year’s SWIB pre-conference workshops, plenary presentations and lightning talks, you may conclude that “linked data is a technology that is maturing” as Rurik Greenall rightly states in his conference report. “But it has quite a way to go before we can say this stuff is ready to roll out in libraries” as he continues. I completely agree with this. Personally I got the impression that we are in a paradoxical situation where on the one hand people speak of “we” and “community”, and on the other hand they take fundamentalist positions, unconditionally defending their own beliefs and slandering and ridiculing other options. In my view there are multiple, sometimes overlapping, sometimes irreconcilable “we’s” and “communities”. Sticking to your own point of view without willingness to reason with the other party really does not bring “us” further.

    This all sounds a bit grim, but I again agree with Rurik Greenall when he says that he “enjoyed this conference immensely because of the people involved”. And of course on the whole the individual workshops and presentations were of a high quality.

    Before proceeding to the positive aspects of the conference, let me first elaborate a bit on the opposing positions I observed during the conference, which I think we should try to overcome.

    Developers disagree on a multitude of issues:
    Developers hate MARC. Everybody seems to hate RDF/XML, JSON-LD seems to be the thing for RDF, but some say only Turtle should be used, or just JSON.
    Tools and languages
    Perl users hate Java, Jave users hate PHP, there’s Python and Ruby bashing.
    Create your own, reuse existing ones, yes or no upper ontologies, no ontologies but usable tools.
    Operating systems
    Windows/UNIX/Linux/Apple… it’s either/or.
    Open source vs. commercial software
    Need I say more?
    Belgians hate German beer, or any foreign beer for that matter.
    (Not to mention PDF).

    OK, I hope I made myself clear. The point is that I have no problem at all with having diverse opinions, but I dislike it when people are convinced that their own opinion is the only right one and refuse to have a conversation with those who think otherwise, or even respect their choices in silence. The developer “community” definitely has quite a way to go.

    Apart from these internal developer disagreements I noticed, there is the more fundamental gap between developers and users of linked open data. By “users” I do not mean “end users” in this case, but the intermediary deployers of systems. Let’s call them “libraries”.
    Linked Data developers talk about tools and programming languages, metadata formats, open source, ontologies, technology stacks. Librarians want to offer useful services to their end users, right now. They may not always agree on what kind of services and what kind of end users, and they may have an opinion on metadata formats in systems, but their outlook is slightly different from the developers’ horizon. It’s all about expectations and expectation management. That is basically Dorothea Salo’s keynote’s point. Of course theoretical, scientific and technical papers and projects are needed to take linked data further, but libraries need linked data tools, focused on providing new services to their end users/customers in the world of the web, that can easily be implemented and maintained.
    In this respect OCLC’s efforts to add linked data features to WorldCat is praiseworthy. OCLC’s Technology Evangelist Richard Wallis presented his view on the benefits of linked open data for libraries, using Google’s Knowledge Graph as an example. His talk was mainly focused at a librarian audience. At SWIB, where the majority of attendees are developers or technology staff, this seemed somewhat misplaced. By chance I had been present at Richard’s talk at the Dutch National Information Professional annual meeting two weeks earlier, where he delivered almost the same presentation for a large room full of librarians. There and then that was completely on target. For the SWIB audience this all may have been old news, except for the heads up about OCLC’s work on FRBR “Works” BIBFRAME type linked data which will result in published URIs for Works in WorldCat.
    An important point here is that OCLC is a company with many library customers worldwide, so developments like this benefit all of these libraries. The same applies to customers of one of the other big library system vendors, Ex Libris. They have been working on developing linked data features for their so called “next generation” tools since some time now, in close cooperation with the international user groups’ Linked Open Data Special Interest Working Group, as I explained in the lightning talk I gave. Also open source library systems like Koha are working on adding linked open data features to their tools. It’s with tools like these, that reach a large number of libraries, that linked open data for libraries can spread relatively quickly.

    In contrast to this linked data broadcasting, the majority of the SWIB presentations showed local proprietary development or research projects, mostly of high quality notwithstanding. In the case of systems or tools that were built all the code and ontologies are available on GitHub, making them open source. However, while it is commendable, open source on GitHub doesn’t mean that these potentially ground breaking systems and ontologies can and will be adopted as de facto standards in the wider library community. Most libraries, both public and academic, are dependent on commercial system and content providers and can’t afford large scale local system development. This also applies up to a point to libraries that deploy large open source tools like Koha, I presume.
    It would be great if some of these many great open source projects could evolve into commonly used standard tools, like Koha, Fedora and Drupal, just to name a few. Vivo is another example of an open source project rapidly moving towards an accepted standard. It is a framework for connecting and publishing research information of different nature and origin, based on linked data concepts. At SWIB there was a pre-conference “VivoCamp”, organised by Lambert Heller, Valeria Pesce and myself. Research information is an area rapidly gaining importance in the academic world. The Library of the University of Amsterdam, where I work, is in the process of starting a Vivo pilot, in which I am involved. (Yes, the Library of the University of Amsterdam uses both commercial providers like OCLC and Ex Libris, and many open source tools). The VivoCamp was a good opportunity to have a practical introduction in and discussion about the framework, not in the least by the presence of John Fereira of Cornell University, one of the driving forces behind Vivo. All attendees (26) expressed their interest in a follow-up.
    Vivo, although it may be imperfect, represents the type of infrastructure that may be needed for large scale adoption of linked open data in libraries. PUB, the repository based linked data research information project at Bielefeld University presented by Vitali Peil, is aimed at exactly the same domain as Vivo, but it again is a locally developed system, using another smaller scale open source framework (LibreCat/Catmandu of Bielefeld, Ghent and Lund universities) and a number of different ontologies, of which Vivo is just one. My guess is that, although PUB/LibreCat might be superior, Vivo will become the de facto standard in linked data based research information systems.

    Instead of focusing on systems, maybe the library linked data world would be better served by a common user-friendly metadata+services infrastructure. Of course, the web and the semantic web are supposed to be that infrastructure, but in reality we all move around and process metadata all the time, from one system and database to another, in order to be able to offer new legacy and linked data services. At SWIB there was mention of a number of tools for ETL, which is developer jargon for Extract, Transform, Load. By the way, jargon is a very good way to widen the gap between developers and libraries.
    There were pre-conference workshop for the ETL tools Catmandu and Metafacture, and in a lightning talk SLUB Dresden, in collaboration with Avantgarde Labs, presented a new project focused on using ETL for a separate multi-purpose data management platform, serving as a unified layer between external data sources and services. This looks like a very interesting concept, similar to the ideas of a data services hub I described in an earlier post “(Discover AND deliver) OR else”. The ResourceSync project, presented by Simeon Warner, is trying to address the same issue by a different method, distributed synchronisation of web resources.

    One can say that the BIBFRAME project is also focused on data infrastructure, albeit at the moment limited to the internal library cataloguing workflow, aimed at replacing MARC. An overview of the current state of the project was presented by Lars Svensson of the German National Library.
    The same can be said for the National Library of Sweden’s new LIBRIS linked data based cataloguing system, presented by Martin Malmsten (Decentralisation, Distribution, Disintegration – towards Linked Data as a First Class Citizen in Libraryland). The big difference is that they’re actually doing what BIBFRAME is trying to plan. The war cry “Linked data or die!” refers to the fact that it is better to start from scratch with a domain and format independent data infrastructure, like linked data, than to try and build linking around existing rigid formats like MARC. Martin Malmsten rightly stated that we should keep formats outside our systems, as is also the core statement of the MARC-MUST-DIE movement. Proprietary formats can be dynamically imported and exported at will, as was demonstrated by the “MARC” button in the LIBRIS user interface. New library linked data developments will have to coexist with the existing wider library metadata and systems environment for some time.
    Like all other local projects, the LIBRIS source code and ontology descriptions are available on GitHub. In this case the mere scope of the National Library of Sweden and of the project makes it a bit more plausible that this may actually be reused on a larger scale. At least the library cataloguing ontology in JSON-LD there is worth having a look at.
    To return to our starting point, the LIBRIS project acknowledges the fact that we need actual tools besides the ontologies. As Martin Malmsten quoted: “Trying to sell the idea of linked data without interfaces is like trying to sell a fax without the invention of paper”.


    The central question in all this: what is the role of libraries in linked data? Developers or implementers, individually or in a community? There is obviously not one answer. Maybe we will know more at SWIB14. Paraphrasing Fabian Steeg and Pascal Christoph of hbz and Dorothea Salo, next years theme might be “Out of the box data knitting for great justice”.

  • The poor person’s linked open data workbench

    Posted on November 11th, 2013 Lukas Koster 4 comments

    Using discovery tools for presenting integrated information

    There has been a lot of discussion in recent years about library discovery tools. Basically, a library discovery tool provides a centrally maintained shared scholarly material metadata index, a system for searching and an option for adding a local metadata index. Academic libraries use it for providing a unified access platform to subscribed and open access databases and ejournals as well as their own local print and digital holdings.


    © vlashton

    I would like to put forward that, despite their shortcomings, library discovery tools can also be used for finding and presenting other scholarly information in the broadest sense. Libraries should look beyond the narrow focus on limitations and turn imperfection into benefits.

    The two main points of discussion regarding discovery tools are the coverage of the central shared index and relevance ranking. For a number of reasons of a practical, technical and competitive nature, none of the commercial central indexes cover all the content that academic libraries may subscribe to. Relevance ranking of search results depends on so many factors that it is a science in itself to satisfy each and every end user with their own specific background and context. Discovery tool vendors spend a lot of energy in improving coverage and relevance ranking.

    These two problems are the reason that not many academic libraries have been able to achieve the one-stop unified scholarly information portals for their staff and students that discovery tool providers promised them. In most cases the institutional discovery portal is just one of the solutions for finding scholarly publications that are offered by the library. A number of libraries are reconsidering their attitude towards discovery tools, or have even decided to renounce these tools altogether and focus on delivery instead, leaving discovery to external parties like Google Scholar.


    © derletzteschrei

    I fully support the idea that libraries should reconsider their attitude towards discovery tools, but I would like to stress that they should do so with a much broader perspective than just the traditional library responsibility of providing access to scholarly publications. Libraries must not throw away the baby with the bathwater. They should realise that a discovery tool can be used as a platform for presenting connected scholarly information, for instance publications with related research project information and research datasets, based on linked open data principles. You could call this the “poor person’s linked open data platform”, because the library has already paid the license fee for the discovery platform, and it does not have to spend a lot of extra money on additional linked open data tools and facilities.

    Of course this presupposes a number of things: the content to be connected should have identifiers, preferably in the form of URIs, and should be openly available for reuse, preferably via RDF. The discovery tools should be able to process URIs and RDF and present the resolved content in their user interfaces. We all know that this is not the case yet. Long term strategies are needed.

    Content providers must be convinced of the added value of adding identifiers and URIs to their metadata and providing RDF entry points. In the case of publishers of scholarly publications this means identifiers/URIs for the publications themselves, but also for authors, contributors, organisations, related research projects and datasets. A number of international associations and initiatives are already active in lobbying for these developments: OpenAIRE, Research Data Alliance, DataCite, the W3C Research Object for Scholarly Communication Community Group, etc. Universities themselves can contribute by adding URIs and RDF to their own institutional repositories and research information systems. Some universities are implementing special tools for providing integrated views on research information based on linked data, such as VIVO.
    There are also many other interesting data sources that can be used to integrate information in discovery tools, for instance in the government and cultural heritage domain. Many institutions in these areas already provide linked open data entry points. And then there is WikiPedia with its linked open data interface DBpedia.

    On the other side of the scale discovery tool providers must be convinced of the added value of providing procedures for resolving URIs and processing RDF in order to integrate information from internal and external data sources into new knowledge. I don’t know of any plans for implementing linked open data features in any of the main commercial or open source discovery tools, except for Ex LibrisPrimo. OCLC provides a linked data section for each WorldCat search result, but that is mainly focused on publishing their own bibliographic metadata in linked data format, using links to external subject and author authority files. This is a positive development, but it’s not consumption and reuse of external information in order to create new integrated knowledge beyond the bibliographic domain.

    With the joint IGeLU/ELUNA Linked Open Data Special Interest Working Group the independent Ex Libris user groups have been communicating with Ex Libris strategy and technology management on the best ways to implement much needed linked open data features in their products. The Primo discovery tool (with the Primo Central shared metadata index) is one of the main platforms in focus. Ex Libris is very keen on getting actual use cases and scenarios in order to identify priorities in going forward. We have been providing these for some time now through publications, presentations at user group conferences, monthly calls and face to face meetings. Ex Libris is also exploring best practices for the technical infrastructure to be used and is planning pilots with selected customers.

    While this may take some time to mature, in the mean time libraries who have access to their discovery tool’s back office and user interface HTML files can start experimenting with and implementing add-ons for integration of the tool’s metadata index with external information. This should be possible with open source discovery tools like VuFind and local or hosted installations with back office access of commercial products. The only commercial product that offers that option, as far as I know, is Primo. Creating local linked open data add-ons can be done by applying a combination of manipulation of local index metadata fields, JavaScript/jQuery in the front end HTML and the use of any open APIs available for the tool.
    The Austrian national library service OBVSG for instance has integrated WikiPedia/DBpedia information about authors in their Primo results.
    The Saxon State and University Library Dresden (SLUB) has implemented a multilingual semantic search tool for subjects based on DBpedia in their Primo installation.
    At the University of Amsterdam I have been experimenting myself with linking publications from our Institutional Repository (UvA DARE) in Primo with related research project information. This has for now resulted in adding extra external links to that information in the Dutch National Research portal NARCIS, because NARCIS doesn’t provide RDF yet. We are communicating with DANS, the NARCIS provider, about extending their linked open data features for this purpose.
    Of course all these local implementations can serve as use cases for discovery tool providers.

    I have only talked about the options of using discovery tools as a platform for consuming, reusing and presenting external linked open data, but I can imagine that a discovery tool can also be used as a platform for publishing linked open data. It shouldn’t be too hard to add extra RDF options besides the existing HTML and internal record format output formats. That way libraries could have a full linked open data consumption and publishing workbench at their disposal at minimal cost. Library discovery tools would from then on be known as information discovery tools.

  • PhD, the final frontier – Part two: Reconnaissance

    Posted on November 9th, 2013 Lukas Koster 2 comments

    Tools and methods for PhD research and writing

    After my decision was made to attempt a scholarly career and aim for a PhD (see my first post in this series), I started writing a one page proposal. When I finished I had a working title, a working subtitle, a table of content with 6 chapter titles, and a very general and broad text about some contradictions between the state of technology and the integration of information. I can’t really go into any details, because the first thing you learn when you aim for a scholarly publication, is to keep your subject to yourself as much as possible in order to prevent someone else from hijacking it and get there first.


    I used Google Docs/Drive for writing this proposal. I have been using Google Docs since a couple of years for all my writing, both personal and professional, because that way I have access to my documents wherever I go on any device I want, I can easily share and collaborate, and it keeps previous versions.

    I shared my proposal with my advisor Frank Huysmans, using the Google Docs sharing options. We initially communicated about it through Twitter DMs, Google Docs comments and email. Frank’s first reaction was that it looked like a good starting point for further exploration, and he saw two related perspectives to explore: Friedrich Kittler’s media ontology and Niklas Luhmann’s system theory. I had studied a book by Luhmann on the sociology of legal trials during my years in university, but he only became one of the major theoretical sociologists after that. I hadn’t heard of Kittler at all. He is a controversial major media and communications philosopher. Both are (or rather were) Germans.

    Next step: I needed to get hold of literature, books and articles, both by and about Luhmann and Kittler, preferably in English, although I read German very well. Some background information about the scholars and their work would also be useful.
    An important decision I had to make was whether I was going to use print, digital or both formats for my literature collection. I didn’t have to think long. I decided to try and get everything in digital format, either as online material, or downloadable PDFs, EPUBs etc. The main reason is that this way I can store the publications in an online storage facility like Dropbox, and have access to all my literature wherever I am, at work, at home and on the road (either on an ereader or my smartphone).

    Frank, who is a Luhmann expert, gave me some pointers on publications by and about Luhmann. Kittler I had to find on my own. Of course, working for the Library of the University of Amsterdam and being responsible for our Primo discovery tool, I tried finding relevant Luhmann and Kittler publications there. I also tried Google and Google Scholar. I used my own new position as a library consumer to perform a basic comparison test between library discovery and Google. My initial conclusions were that I got better results using Google than our own Primo. This was in March 2013. But in the mean time both Ex Libris and the Library have made some important adjustments to the Primo indexing procedures and relevance ranking algorithm. Repeating similar searches in November 2013 provided much better results.
    Anyway, I mostly ignored print publications. As a staff member of the University of Amsterdam I have access to all subscription online content, whether I find it through our own discovery tools or via Google. What I can’t find in our library discovery tools are ‘unofficial’ digital versions of print books. Here Google can help. For instance I found a complete PDF version of Luhmann’s main work “Die Gesellschaft der Gesellschaft” (“Society of society”, in German). Also Frank was so kind to digitise/copy for me some chapters of relevant print books about Luhmann.

    I discovered that I needed some sort of mechanism to categorise or catalogue my literature in such a way that this is available to me wherever I am, other than just put files in a Dropbox folder. After some research I decided on Mendeley, which was originally presented as a reference manager, but is now more a collaboration, publication sharing and annotation environment. I am not using Mendeley as a reference manager for now, only for categorising digital local and online publications and making annotations attached to the publications.
    An important feature that I need in research is to have access to my notes everywhere. With Mendeley I can attach notes to PDFs in the Mendeley PDF reader which are synchronised with my other Mendeley installations on other workstations, if the ‘synchronise files’ option is checked everywhere. Actually Mendeley distinguishes between “notes” and “annotations”. Notes are made and attached on the level of the publication, annotations are linked to specific text sections in PDFs. Annotations don’t work with EPUB files, because there is no embedded Mendeley EPUB reader, or with online URLs. I can add notes to specific EPUB text sections in my ereader which is synchronised between my ereader software instances. So I still need an independent annotation tool that can link my notes to all my research objects and synchronises them between my devices independent of format or sofware.

    A final note about digital formats. PDFs are the norm in digital scholarly dissemination of publications. PDFs are fine to read on a PC, and attaching notes in Mendeley, but they’re horrible to read on my ereader. There I prefer EPUB. I would really like to have more standard options for downloading publications to select from, at least PDF and EPUB.

    Summarising, the functions I need for my PhD research and writing in this stage, and the tools and methods I am currently using:


    • Find publications, information: library discovery tools, Google Scholar, the web
    • Acquire digital copies of publications: access/authorisation; PDF, EPUB
    • Store material independent of location and device: Dropbox, Mendeley
    • Save references: Mendeley
    • Notes/annotations: Mendeley, Kobo ereader software
    • Write and share documents independent of location and device: Google Docs
    • Communicate: Twitter, email, Google Docs comments
  • PhD, the final frontier – Part one: To boldly go

    Posted on August 18th, 2013 Lukas Koster 1 comment

    My final career? Facing the challenge

    This is the first in hopefully a series of posts about a new phase in my professional life, in which I will try to pursue a scholarly career with a PhD as my first goal. My intention with this series is to document in detail the steps and activities needed to reach that goal. In this introduction I will describe the circumstances and considerations that finally led to my decision to take the plunge.


    First some personal background. I graduated in Sociology at the University of Amsterdam in 1987. I received the traditional Dutch academic title of “Doctorandus” (“Drs.”) which in the current international higher education Bachelor/Masters system is equivalent to a Master of Science (MSc). I specialised in sociology of organisations, labour and industrial relations, with minors in economics and social science “informatics”. I wrote my thesis “Arbeid onder druk” (“Labour under pressure”) on automation, the quality of labour and workers’ influence in the changing printing industry in the 20th century.
    The job market for sociologists in the 1980s and 1990s was virtually non-existent, so I had to find an alternative career. One of the components of the social science informatics minor was computer programming. I learned to program in Pascal on a mainframe using character based monochrome terminals. I actually liked doing that, and I decided to join the PION IT Retraining programme for unemployed (or underemployed) academics organised in the 1980s by the Dutch State to overcome the growing shortage of IT professionals. After a test I was accepted and I finished the course, during which I learned programming in COBOL, with success in 1988. However, it took me another two years to finally find a job. From 1990 I worked as a systems designer and developer (programming in PL/1, Fortran, Java among others) for a number of institutions in the area of higher education and scholarly information until 2002. Then I was sent away with a year’s salary from the prematurely born Institute for Scholarly Information Services NIWI, which was terminated in 2005. From its ruins the now successful Data Archiving and Networking Services institute (DANS) emerged.
    It was then that I tried to take up a new academic career for the first time, and I enrolled in Cultural Studies at the Open University. I enjoyed that very much, I took a lot of courses and even managed to pass a number of exams.
    By the end of that year, after a chance meeting with a former colleague in the tram on my way to take an exam, I found myself working at the Dutch Royal Library in The Hague as temporary project staff member for implementing the new federated search and OpenURL tools MetaLib and SFX from Ex Libris. This marked the start of my career in library technology. It was then that I first learned about bibliographic metadata formats, cataloguing rules and MARC madness.
    I soon discovered that it’s very hard to combine work and studying, and because I started to like working for libraries and networking and exchanging knowledge I silently dropped out of Open University.
    After three years on temporary contracts I moved to the Library of the University of Amsterdam to do the same work as I did at the Royal Library. I got involved in the International Ex Libris User Group IGeLU and started trying to make a difference in library systems, working together with an enthusiastic bunch of people all over the world. I made it to head of department for a while, until an internal reorganisation gave me the opportunity to say goodbye to that world of meetings, bureaucracy and conflicts. Now I am Library Systems Coordinator, which doesn’t mean that I am coordinating library systems, by the way. My main responsibility at the moment is our Ex Libris Primo discovery tool. The most important task with that is coordinating data and metadata streams. A lot of time, money and effort is spent on streamlining the moving around of metadata between a large number of internal and external systems. Since a couple of years I have been looking into, reading about, writing and presenting on linked open data and metadata infrastructures, with now and then a small pilot project. But academic libraries are so slow in realising that they should move away from thinking in new systems for new challenges and investing in metadata and data infrastructures instead, that I am not overly enthusiastic about working in libraries anymore. It was about a year ago that I suddenly realised that with Primo I was still doing exactly the same detailed configuration stuff and vendor communications that I was doing ten years earlier with MetaLib, and nothing had changed.
    It was then that I said to myself: I need to do something new and challenging. If this doesn’t happen at work soon, then I have to find something satisfying to do besides that.

    I can’t reproduce the actual moment anymore, but I think it happened when I was working with getting scholarly publications (dissertations among others) from our institutional repository into Primo. Somehow I thought: I can do that too! Write a dissertation, get a PhD! My topic could be something in the field of data, information, knowledge integration, from a sociological perspective.
    So I started looking into the official PhD rules and regulations at the University of Amsterdam, to find out what possibilities there are for people with a job getting a PhD. It turns out there are options, even with time/money compensation for University staff. But still not everything was clear to me. So I decided to ask Frank Huysmans, part time Library Science professor at the University of Amsterdam, and also active on twitter and in the open data and public libraries movement, if he could help me and explain the options and pros and cons of writing a dissertation. He agreed and we met in a pub in Amsterdam to discuss my ideas over a couple of nice beers.

    The good thing was that Frank thought that I should be able to pull this off, looking at my background, work experience and writing. The bad thing is that he asked me “Are you sure that you don’t just want to write a nice popular science book?”. Apparently scholarly writing is subject to a large amount of rules and formats, and not meant for a pleasant read.
    An encouraging thing that Frank told me is that it is possible to compile a dissertation from a number of earlier published scholarly peer reviewed articles. Now this really appealed to me, because this means that I can attempt to write a scholarly article and try to get it published first, and get the feel of the art of scholarly research, writing and publication, in a relatively short period. This way I can leave the options open to leave it at that or to continue with the PhD procedure later. I agreed to write a short dissertation proposal and send it to Frank and to discuss that in a next meeting.

    My decision was made. Although in the meantime a couple of interesting perspectives at work appeared on the horizon involving research information and linked data, I was going to try and start a scholarly career.

    Next time: the first steps – reading, thinking, writing a draft proposal and how to keep track of everything.