Interoperability in heterogeneous library data landscapes
Libraries have to deal with a highly opaque landscape of heterogeneous data sources, data types, data formats, data flows, data transformations and data redundancies, which I have earlier characterized as a “data maze”. The level and magnitude of this opacity and heterogeneity varies with the amount of content types and the number of services that the library is responsible for. Academic and national libraries are possibly dealing with more extensive mazes than small public or company libraries.
In general, libraries curate collections of things and also provide discovery and delivery services for these collections to the public. In order to successfully carry out these tasks they manage a lot of data. Data can be regarded as the signals between collections and services.
These collections and services are administered using dedicated systems with dedicated datastores. The data formats in these dedicated datastores are tailored to perform the dedicated services that these dedicated systems are designed for. In order to use the data for delivering services they were not designed for, it is common practice to deploy dedicated transformation procedures, either manual ones or as automated utilities. These transformation procedures function as translators of the signals in the form of data.
Here lies the origin of the data maze: an inextricably entangled mishmash of systems with explicit and
implicit data redundancies using a number of different data formats, some of which systems are talking to each other in some way. This is not only confusing for end users but also for library system staff. End users lack clarity about user interfaces to use, and are missing relevant results from other sources and possible related information. Libraries need licenses and expertise for ongoing administration, conversion and migration of multiple systems, and suffer unforeseen consequences of adjustments elsewhere.
To take the linguistic analogy further, systems make use of a specific language (data format) to code their signals in. This is all fine as long as they are only talking to themselves. But as soon as they want to talk to other systems that use a different language, translations are needed, as mentioned. Sometimes two systems use the same language (like MARC, DC, EAD), but this does not necessarily mean they can understand each other. There may be dialects (DANMARC, UNIMARC), local colloquialisms, differences in vocabularies and even alphabets (local fields, local codes, etc.). Some languages are only used by one system (like PNX for Primo). All languages describe things in their own vocabulary. In the systems and data universe there are not many loanwords or other mechanisms to make it clear that systems are talking about the same thing (no relations or linked data). And then there is syntax and grammar (such as subfields and cataloguing rules) that allow for lots of variations in formulations and formats.
The transformation utilities functioning as translators of the data signals suffer from a number of limitations. They translate between two specific languages or dialects only. And usually they are employed by only one system (proprietary utilities). So even if two systems speak the same language, they probably both need their own translator from a common source language. In many cases even two separate translators are needed if source and target system do not speak each other’s language or dialect. The source signals are translated to some common language which in turn is translated into the target language. This export-import scenario, which entails data redundancy across systems, is referred to as ETL (Extract Transform Load). Moreover, most translators only know a subset of the source and target language dependent on the data signals needed by the provided services. In some cases “data mappings” are used as conversion guides. This term does not really cover what is actually needed, as I have tried to demonstrate. It is not enough to show the paths between source and target signals. It is essential to add the selections and transformations needed as well. In order to make sense of the data maze you need a map, a dictionary and a guidebook.
To make things even more complicated, sometimes reading data signals is only possible with a passport or visa (authentication for access to closed data). Or even worse, when systems’ borders are completely closed and no access whatsoever is possible, not even with a passport. Usually, this last situation is referred to with the term “data silos”, but that is not the complete picture. If systems are fully open, but their data signals are coded by means of untranslatable languages or syntaxes, we are also dealing with silos.
Anyway, a lot of attention and maintenance is required to keep this Tower of Babel functioning. This practice is extremely resource-intensive, costly and vulnerable. Are there any solutions available to diminish maintenance, costs and vulnerability? Yes there are.
First of all, it is absolutely crucial to get acquainted with the maze. You need a map (or even an atlas) to be able to see which roads are there, which ones are inaccessible, what traffic is allowed, what shortcuts are possible, which systems can be pulled down and where new roads can be built. This role can be fulfilled by a Dataflow Repository, which presents an up-to-date overview of locations and flows of all content types and data elements in the landscape.
Secondly it is vital to be able to understand the signals. You need a dictionary to be able to interpret all signals, languages, syntaxes, vocabularies, etc. A Data Dictionary describing data elements, datastores, dataflows and data formats is the designated tool for this.
And finally it is essential to know which transformations are taking place en route. A guidebook should be incorporated in the repository, describing selections and transformations for every data flow.
You could leave it there and be satisfied with these guiding tools to help you getting around the existing data maze more efficiently, with all its ETL utilities and data redundancies. But there are other solutions, that focus on actually tackling or even eliminating the translation problem. Basically we are looking at some type of Service Oriented Architecture (SOA) implementation. SOA is a rather broad concept, but it refers to an environment where individual components (“systems”) communicate with each other in a technology and vendor agnostic way using interoperable building blocks (“services”). In this definition “services” refer to reusable dataflows between systems, rather than to useful results for end users. I would prefer a definition of SOA to mean “a data and utilities architecture focused on delivering optimal end user services no matter what”.
Broadly speaking there are four main routes to establish a SOA-like condition, all of which can theoretically be implemented on a global, intermediate or local level.
- Single Store/Single Format: A single universal integrated datastore using a universal data format. No need for dataflows and translations. This would imply some sort of linked (open) data landscape with RDF as universal language and serving all systems and services. A solution like this would require all providers of relevant systems and databases to commit to a single universal storage format. Unrealistic in the short term indeed, but definitely something to aim for, starting at the local level.
- Multiple Stores/Shared Format: A heterogeneous system and datastore landscape with a universal communication language (a lingua franca, like English) for dataflows. No need for countless translators between individual systems. This universal format could be RDF in any serialization. A solution like this would require all providers of relevant systems and databases to commit to a universal exchange format. Already a bit less unrealistic.
- Shared Store/Shared Format: A heterogeneous system and datastore landscape with a central shared intermediate integrated datastore in a single shared format. Translations from different source formats to only one shared format. Dataflows run to and from the shared store only. For instance with RDF functioning as Esperanto, the artificial language which is actually sometimes used as “Interlingua” in machine translation. A solution like this does not require a universal exchange format, only a translator that understands and speaks all formats, which is the basis of all ETL tools. This is much more realistic, because system and vendor dependencies are minimized, except for variations in syntax and vocabularies. The platform itself can be completely independent.
- Multiple Stores/Single Translation Pool: or what is known as an Enterprise Service Bus (ESB). No translations are stored, no data is integrated. Simultaneous point to point translations between systems happen on the fly. Looks very much like the existing data maze, but with all translators sitting together in one cubicle. This solution is not a source of much relief, or as one large IT vendor puts it: “Using an ESB can become problematic if large volumes of data need to be sent via the bus as a large number of individual messages. ESBs should never replace traditional data integration like ETL tools. Data replication from one database to another can be resolved more efficiently using data integration, as it would only burden the ESB unnecessarily.”.
Overlooking the possible routes out of the data maze, it seems that the first step should be employing the map, dictionary and guidebook concept of the dataflow repository, data dictionary and transformation descriptions. After that the only feasible road on the short term is the intermediate integrated Shared Store/Shared Format solution.