Standard deviations in data modeling, mapping and manipulation

Or: Anything goes. What are we thinking? An impression of ELAG 2015

T-Bana
Mapping pathways in Stockholm

This year’s ELAG conference in Stockholm was one of many questions. Not only the usual questions following each presentation (always elicited in the form of yet another question: “Any questions?”). But also philosophical ones (Why? What?). And practical ones (What time? Where? How? How much?). And there were some answers too, fortunately. This is my rather personal impression of the event. For a detailed report on all sessions see Christina Harlow’s conference notes.

The theme of the ELAG 2015 conference was: “DATA”. This immediately leads to the first question: “What is data?”. Or rather: “What do we mean with data?”. And of course: “Who is ‘we’?”.

In the current information professional and library perception ‘we’ typically distinguish data created and used for describing stuff (usually referred to as ‘metadata’), data originating from institutions, processes and events (known as ‘usage data’, ‘big data’), and a special case of the latter: data resulting from scholarly research (indeed: ‘research data’). All of these three types were discussed at ELAG.

It is safe to say however, that the majority of the presentations, bootcamps and workshops focused on the ‘descriptive data’ type. I try to avoid the use of the term ‘metadata’, because it is confusing, and superfluous. Just use ‘data’, meaning ‘artificial elements of information about stuff’. To be perfectly clear, ‘metadata’ is NOT ‘data about data’ as many people argue. It’s information about virtual entities, physical objects, information contained in these objects (or ‘content’), events, concepts, people, etc. We could only rightfully speak of ‘data about data’ in the special case of data describing (research) datasets. For this case ‘we’ have invented the job of ‘data librarian’, which is a completely nonsensical term, because this job is concerned with the storage, discoverability and obtainability of only one single object or entity type: research datasets. Maybe we should start using the job title ‘dataset librarian’ for this activity. But this seems a bit odd, right? On the other hand, should we replace the term ‘metadata librarian’ with ‘data librarian’? Also a bit odd. Data is at this moment in time what libraries and other information and knowledge institutions use to make their content findable and usable to the public. Let’s leave it at that.

This brings us to the two fundamental questions of our library ecosystem: “What are we describing?” and the mother of all data questions: “Why are we describing?”, which were at the core of what in my eyes was this year’s key presentation (not keynote!) by Niklas Lindström of the Swedish Royal Library/LIBRIS. I needed some time to digest the core assertions of Niklas’ philosophical talk, but I am convinced that ‘we’ should all be aware of the essential ‘truths’ of his exposition.

First of all: “Why are we describing?“. The objective of having a library in the first place is to provide access in any way possible to the objects in our collections, which in turn may provide access to information and knowledge. So in general we should be describing in order to enable our intended audience to obtain what they need in terms of the collection. Or should that be in terms of knowledge? In real life ‘we’ are describing for a number of reasons: because we follow our profession, because we have always done this, because we are instructed to do so, because we need guidance in our workflows, because the library is indispensable, because of financial and political reasons. In any case we should be clear about what our purposes are, because the purpose influences what we’re describing and how we do that.

Secondly: “What are we describing?”. Physical objects? Semi-tangible objects, like digital publications? Only outputs of processes, or also the processes themselves? Entities? Concepts? Representations? Relationships? Abstractions? Events? Again, we should be clear about this.

monty-python-spanish-inquisitionThirdly (a Monty Python Spanish Inquisition moment ;-): “How are we describing?”. We use models, standards, formats, syntax, vocabularies in order to make maps (simplified representations of real world things) for reconciling differences between perceptions, bridging gaps between abstractions and guiding people to obtain the stuff they need. In doing so, Niklas says, we must adhere to Postel’s law, or the Robustness Principle, which states: “Be liberal in what you accept; be conservative in what you send”.

Back to the technology, and the day to day implementation of all this. ‘We’ use data to describe entities and relationships of whatever nature. We use systems to collect and store the data in domain, service and time dependent record formats in system dependent datastores. And we create flows and transformations of data between these systems in order to fulfill our goals. For all this there are multiple standards.

Basically, my own presentation “Datamazed – Analysing library dataflows, data manipulations and data redundancies” targeted this fragmented data environment, describing the library of the University of Amsterdam Dataflow Inventory abbasosproject leading to a Dataflow Repository, effectively functioning as a map of mappings. “System of Systems (SoS)” was also the topic of the workshop I participated in, “What is metadata management in net-centric systems?” by Jing Wang.

Making sense of entities and relationships was the focus of a number of talks, especially the one by Thom Hickey about extending work, expression and author entities by way of data mining in Worldcat and VIAF, and the presentation by Jane Stevenson on the Jisc/Archives Hub project “exploring British Design”, which entailed shifting focus from documents to people and organizations as connected entities. Some interesting observations about the latter project: the project team started with identifying what the target audience actually wanted and how they would go about getting the desired information (“Why are we describing?”) in order to get to the entity based model (“What are we describing?”). This means that any entity identified in the system can become a focus, or starting point for pathways. A problem that became apparent is that the usual format standards for collection descriptions didn’t allow for events to be described.

Here we arrive at the critique of standards that was formulated by Rurik Greenall in his talk about the Oslo Public Library ILS migration project, where they are migrating from a traditional ILS to RDF based cataloguing. Starting point here is: know what you need in order to support your actual users, not some idealised standard user, and work with a number of simple use cases (“Why are we describing?”). Use standards appropriate for your users and use cases. Don’t be rigid, and adapt. Use enough RDF to support use cases. Use just a part of the open source ILS Koha to support specific use cases that it can do well (users and holdings). Users and holdings are a closed world, which can be dealt with using a part of an existing system. Bibliographic information is an open world which can be taken care of with RDF. The data model zappaagain corresponds to the use cases that are identified. It grows organically as needed. Standards are only needed for communicating with the outside world, but we must not let the standards infect our data model (here we see Postel’s Law again).

A striking parallel can be distinguished with the Stockholm University Library project for integration of the Open Source ILS Koha, the Swedish LIBRIS Union Catalogue and a locally developed logistics and ILL system. Again, only one part of Koha is used for specific functions, mainly because with commercial ILSes it is not possible to purchase only individual modules. Integrated library systems, which seemed a good idea in the 1980’s, just cannot cope with fragmented open world data environments.

Dedicated systems, like ILSes, either commercial or open source, tend to force certain standards upon you. These standards not only apply to data storage (record formats etc.) but also to system structure (integrated systems, data silos), etc. This became quite clear in the presentation about the CERN Open Data Portal, where the standard digital library system Invenio imposed the MARC bibliographic format for describing research datasets for high energy physics, which turned out to be difficult if not impossible. Currently they are moving towards using JSON (yet another data standard) because the system apparently supports that too.

With Open Source systems it is easier to adapt the standards to your needs than with proprietary commercial vendor systems. An example of this was given by the University of Groningen Library project where the Open Source Publication Repository software EPrints was tweaked to enable storage and description of research datasets focused on archeological findings, which require very specific information.

As was already demonstrated in the two ILS migration projects deviation of standards of any kind can very easily be implemented. This is obviously not always the case. The locally developed Swedish Royal Library system for the legal deposit of electronic documents supports available suitable metadata standards like OAI, METS, MODS, PREMIS.

For the OER World Map project, presented by Felix Ostrowski we can safely say that no standards were followed whatsoever, except using the discovery data format schema.org for storing data, which is basically also an adaptation of a standard. Furthermore the original objective of the project was organically modified by using the data hub for all kinds of other end user services and visualisations than just the original world map of the location of Open Educational Resources.

It should be clear that every adaptation of standards generates the need for additional mappings and transformations besides the ones already needed in a fragmented systems and data infrastructure for moving around data to various places for different services. Mapping and transformation of data can be done in two ways: manually, in the case of explicit, known items, and by mining, in the case of implicit, unknown items.

Manual mapping and transformation is of course done by dedicated software. The manual part consists of people selecting specific source data elements to be transformed into target data elements. This procedure is known as ETL (Extract Transform Load), and implies the copying of data between systems and datastores, which always entails some form of data redundancy. Tuesday afternoon was dedicated to this practice with three presentations: Catmandu and Linked Data Fragments by Ruben Verborgh and Patrick Hochstenbach; COMSODE by Jindrich Mynarz; d:swarm by Thomas Gängler. Of these three the first one focused more on efficiently exposing data as Linked Open Data by using the Linked Data Fragments protocol. An important aspect of tools like this is that we can move our accumulated knowledge and investment in data transformation away from proprietary formats and systems to system and vendor independent platforms and formats.

Data mining and text mining are used in the case of non explicit data about entities and relationships, where bodies of data and text are analyzed using dedicated algorithms in order to find implicit entities and relationships and make them explicit. This was already mentioned in Thom Hickey’s Worldcat Entity Extension project. It is also used in the InFoLiS2 project, where data and text mining is used to find relationships between research projects and scholarly publications.

Another example was provided by Rob Koopman and Shenghui Wang of OCLC Research, who analyzed keywords, authors and journal titles in the ArticleFirst database to generate proximity visualizations for greater serendipity.

As long as ‘we’ don’t or can’t describe these types of relationships explicitly, we have to use techniques like mining to extract meaningful entities and relationships and generate data. Even if we do create and maintain explicit descriptions, we will remain a closed world if we don’t connect our data to the rest of the world. Connections can be made by targeted mapping, as in the case of the Finnish FINTO Library Ontology service for the public sector, or by adopting an open world strategy making use of semantic web and linked open data instruments.

Furthermore, ‘we’ as libraries should continuously ask ourselves the questions “Why and what are we describing?”, but also “Why are we here?”. Should we stick to managing descriptive data, or should we venture out into making sense of big data and research data, and provide new information services to our audience?

Finally, I thank the local organizers, the presenters and all other participants for making ELAG2015 a smooth, sociable and stimulating experience.

16 thoughts on “Standard deviations in data modeling, mapping and manipulation

Leave a Reply

Your email address will not be published. Required fields are marked *