Data. The final frontier.
RSS icon Home icon
  • Relevance in context

    Posted on August 11th, 2009 Lukas Koster 5 comments

    Search! © Jeffrey Beall

    If you do a search in a bibliographic database, you should find what you need, not just what you are looking for, or what the database “thinks” you are looking for. If you find what you are looking for, then you will not be surprised and you will not discover anything new. And that’s not what you want, is it? But if you find things you did not look for but also do not need, you’re not just surprised, you are confused! And that’s not what you want either.

    You want the results that are the most relevant for your search, with your specific objectives, at that specific point in time time, for your specific circumstances, and you want them immediately.

    So, how should search systems behave to make you find what you need? There are two conditions that need to be met:

    • The search terms must be interpreted correctly
    • The most relevant search results must be presented

    The Problem
    First of all, let’s take a look at current practice.

    Search systems cannot cope with ambiguous search terms. My favorite example and test search term is “voc“. This can stand for a number of things in various disciplines: V.O.C. (Dutch: “Verenigde Oostindische Compagnie”  or “Dutch United East Indies Company”) in historical databases; “vocals” in musical databases; “volatile organic compounds” in physics databases. So if you do a search for “voc” in a standard library catalogue, you get all kinds of results. Even more so if you use a metasearch or federated search tool for searching several databases simultaneously.

    Search for "voc" in British Library catalogue

    Search for "voc" in British Library Integrated Catalogue

    You are confused. You would like the system to “understand” which one of these concepts you are referring to instead of just using the literal string. You would like the system to take into account your context.

    In most databases search results can be sorted or filtered by a number of fields, most commonly by year, title, author, and also by more specific fields in dedicated databases.  But unless you are interested in a specific year, author or title, this will not do. Recently many systems have implemented “faceted” and “clustered” browsing of results, enabling “drilling down” on specific terms or subjects. This basically comes down to setting the context after the fact.

    But after the system has interpreted your search terms, the  results should also be ordered in a specific way, the ones you need most should be on top. This is where “relevance ranking” of search results comes in. Most catalogues and databases use a system specific default relevance ranking algorithm. Search results are assigned a rank, based on a number of criteria, that can differ between databases, depending on the nature of the database.
    Some databases just present the most recent results on top. For medical and physical sciences this may be right, but for history and literature databases this may just be wrong.
    Sometimes the search terms are taken into account: the number of times the given search terms are present in the result fields is important, but also the specific fields in which search terms appear. The appearance of search terms in “Title” and “Subject” may rank higher than in “Abstract” or “Publisher”. Moreover, the search indexes used can have a major influence on rank: if you search for “Subject” = “flu”, then results with “flu” as subject will be ranked higher than results with “flu” in the title only.
    To come back to my example, with ambiguous search terms like “voc” this type of relevance ranking will definitely not be enough, because results from the three different conceptual areas will be completely mixed up.

    Faceted/clustered search results in Amsterdam Univerity Digital Library

    Clustered search results in Amsterdam University Library MetaLib portal

    When searching with a metasearch or federated search tool things get even more complicated. Each of the remote databases that you search in has its own default way of ranking. Usually the metasearch tool fetches the first 30 or so results from each remote database (one set sorted by date, the other by internal rank, the next by title), merges these into one list and then applies its own local ranking mechanism to this subset only. Confusion! And I did not even mention searching databases with metadata in multiple languages. Moreover, databases containing only metadata will produce different results and relevance than databases with full text articles. There is absolutely no way of telling if you actually have the most relevant results for your situation.

    Again, with relevance ranking search systems do not take into account the context either. You could say it is an introverted, internally focused way of ranking, the confusing results of which are multiplied in case of metasearching.

    Most metasearch tools give users the option of searching in sets of pre-selected databases, based on subject or type. This way you can limit your search to those databases that are known to have data about that specific subject. You more or less set the context in advance. But this mechanism only eliminates results from databases that probably do not have data on your subject at all, so they would not have shown up in the results anyway. Moreover, the same issues that were discussed above apply to this limited set of databases.

    The metasearch tool that I know best (MetaLib) offers the option of setting a relative rank per database, so results from databases with a higher rank will have a higher relevance in merged result sets. But this is a system wide option, set by system administration, so it is not taking into account any context at all. It would be better if you could make the relative database rank dependent on the set or subject the search is done from (for instance: if a history database is searched in the context of a “History” set, the results get a higher rank than in a search from a “Music” set).

    The best solution for this “internal” relevance problem regarding distributed databases is a central database of harvested indexes. In this case all harvested metadata is normalised and ranked in a uniform way, and users do not have to select databases in advance. But these systems still do not take into account “external” relevance: there is no context!

    A very interesting and intelligent solution for the problem of pre-selecting databases is provided in PurpleSearch, the integrated front end to MetaLib (among other things), developed by the Library of the University of Groningen. The system records which databases actually produce results for specific search terms. As soon as the user enters search terms in the single search box, the system knows which databases will have results, and the search is automatically carried out in these databases, without asking the user to select the databases or subject area he or she wants to search in. Simultaneously a background search in all other databases is performed in order to check additional new results, and the information about results in databases is updated.
    Of course, all other usual options are available as well, like pre-selecting databases (setting context in advance) and faceted results drilling down (setting context after the fact). But again, no external contextual settings.

    Search "voc" in PurpleSearch

    Search "voc" in PurpleSearch

    • Conclusion: the only way to find what you need, is to make search systems take into account the context in which the search is done, both for searching and for relevance ranking.

    Now, let’s have a look at a couple of conditions that would make contextual searches possible.

    Personal context: a system should “know” about your personal interests, field of study, job situation, age, etc. so it can “decide” which databases to search in and which results are the most relevant for you. Some systems, like university systems, have access to information about their users. Once you log in, the system potentially knows which subjects you are studying or teaching and could use this information for setting the context for searching and ranking.
    But what if you are a student in Law AND Social Siences, which subject area should the system choose? Or: if you are a History teacher, and you have a personal interest in Ecology, which the system does not know about, what then? Somehow you still need to set context yourself.

    Some systems also offer the opportunity of setting personal preferences, like: area of interest, specific databases, type of material (only digital or print), only recent material, etc. Again: you must be able to deviate from these preferences, depending on your situation, which means setting context manually.

    Different search systems will have different user profiles (user data and preferences). It would be nice if search systems could take advantage of universal external personal profiles (like Google Profiles for instance) using some kind of universal API.

    Situational context: a system should also “know” about the situation you are in, both in a functional sense and in a physical sense.

    Functional context means: wich role are you playing? Are you in your law student role or in your social sciences student role? Are you in your professional role or in your private role? But also: to which resources do you have access?
    An interesting idea: if you work Monday to Friday during office hours, study in the evenings and spend time on your personal interests on the weekends, it would be nice if you could link times of day and days of the week to your different roles, so search systems could use the correct context for your searches depending on time and date: “if it’s Tuesday evening then use study profile and search in ‘History’; if it’s Sunday, use private profile and search in ‘Ecology’“.

    It's the Great Pumpkin Charlie Brown

    This temporal context was also referred to by Till Kinstler in a (German) blog post about the new “Suchkiste” search system prototype of the German Union Library Network (GBV): ‘the search for “Charlie Brown” in October should result in “It’s the Great Pumpkin, Charlie Brown” at number 1, and in December in “A Charlie Brown Christmas“‘.

    Physical context means: where are you? It would be nice if a library catalogue search system would take into account your actual location, so it could show you the records of the copies of the FRBR-ized results available in the library locations nearest to you (this idea came up in a recent Twitter discussion between @librarianbe and @gbierens). This is what Worldcat does when you supply it with your location manually. In Worldcat this is a static preference. But it would be nice if it would respond to your actual location, for instance by using the GPS coordinates of your mobile phone. Alternatively, search systems could derive your location from the IP address you are sending your search from.
    This information could also be used to determine if records for digital or physical copies should be ranked the most relevant in this case. If you are inside the library building and you have a preference for physical books and journals, then records for available print copies should be on top of the results list. If you are at home, then records for digital copies that you have access to should come first.

    Contextual searching and ranking should always be a combination of all possible conditions, personal, situational and internal system ones.

    Of course it goes without saying that it would be great if metasearch tools were able to convey the search context to the remote databases and get contextual results back, using some kind of universal serach context API!

    Last but not least, each search system should show the context of the search, and explain how it got to the results in the presented order. Something like: based on your personal preferences, the time of day and day of the week, and your location, the search was done in these databases, with this subject area, and the physical copies of the nearest location are shown on top.
    This context area on the results screen could then be used as a kind of inverted faceted search, drilling “up” to a broader level or “sideways” to another context.


    4 responses to “Relevance in contextRSS icon

    • Well written, good ideas! Frustrating in all of this is that technology to do all of this is available, but no party seems to be interested (or capable) to actually build a library-system with these kind of features.

    • Etienne Posthumus

      Great summary Lukas!

      In response to the comment by Gerard, the technology is available but the old problem is ‘budget’. Putting these great ideas into practice requires a scary amount of effort on many fronts, technology and content-wrangling.

    • “The system records which databases actually produce results for specific search terms. ”

      I’m very interested in learning more about this. Who is behind PurpleSearch, are you involved in that Lukas?

      If whoever is involved in it wants to write a simple article for Code4Lib Journal on this specific feature and how it’s done, I think we’d be interested.

    1 Trackbacks / Pingbacks

    Leave a reply