Data. The final frontier.
RSS icon Home icon
  • Explicit and implicit metadata

    Posted on August 20th, 2009 Lukas Koster 12 comments

    Tagged © funkandjazz

    On August 17, after I tested a search in our new Aleph OPAC and mentioned my surprise on Twitter, the following discussion unfolded between me (lukask), Ed Summers of the Library of Congress and Till Kinstler of GBV (German Union Library Network):

    • lukask: Just found out we only have one item about RDF in our catalogue:
    • edsu: @lukask broaden that search 🙂
    • lukask: @edsu Ha! Thanks! But I’m sure that RDF will be mentioned in these 29 titles! A case for social tagging!
    • edsu: @lukask or better cataloging 🙂
    • edsu: @lukask i guess they both amount to the same thing eh?
    • lukask: @edsu That’s an interesting position…”social tagging=better cataloging”. I will ask my cataloguing co-workers about this specific example
    • edsu: @lukask make sure to wear body-armor
    • lukask: @edsu Yes I know! I will bring it up at tomorrow’s party for the celebration of our ALEPH STP (after some drinks…)
    • tillk: @edsu @lukask or fulltext search… 🙂 SCNR…
    • edsu: @tillk yeah, totally — with projects like @googlebooks and @hathitrust we may look back on the age of cataloging with different eyes …
    • lukask: @tillk @edsu Fulltext search yes, or “implicit automatic metadata generation”?

    What happened here was:

    • A problem with findability of specific bibliographic items was observed: although it is highly unlikely that books about the Semantic Web will not cover RDF-Resource Description Framework, none of the 29 titles found with “Semantic Web” could be found with the search term “Resource Description Framework“; on the other side, the only item found with “Resource Description Framework” was NOT found with “Semantic Web“. I must add that the “Semantic web” search was an “All words” search. Only 20 of the results were indexed with the Dutch subject heading “Semantisch web” (which term is never used in real life as far as I know; the English term is an international concept). Some results were off topic, they just happened to have “semantic” and “web” somewhere in their metadata. A better search would have been a phrase search (adjacent) with “semantic web” in actual quotes, which gives 26 items. But of these, a small number were not indexed with subject heading “Semantisch web“. Another note: searching with “RDF” gets you all kinds of results. Read more on the issue of searching and relevance in my post Relevance in context.
    • Four possible solutions were suggested:
    1. social tagging
    2. better cataloging
    3. fulltext searching
    4. automatic metadata generation

    Social tagging
    Clearly, the 26 items found with the search “Semantic web” are not indexed by the “Resource description framework” or “RDF” subject heading. There is not even a subject heading for “Resource description framework” or “RDF“. In my personal view, from my personal context, this is an omission. Mind you, this is not only an issue in the catalogue of the Library of the University of Amsterdam, it is quite common. I tried it in the British Library Integrated Catalogue with similar results. Try it in your own OPAC!
    I presume that our professional cataloging colleagues can’t know everything about all subjects. That is completely understandable. I would not know how to catalog a book about a medical subject myself either! But this is exactly the point. If you allow end users to add their own tags to your bibliographic records, you enhance the findability of these records for specific groups of end users.
    I am not saying that cataloguing and indexing by library specialists using controlled vocabularies should be replaced by social tagging! No, not at all. I am just saying that both types of tagging/indexing are complementary. Sure, some of the tags added by end users may not follow cataloging standards, but who cares? Very often the end users adding tags of their own will be professional experts in their field. In any case, items with social tags will be found more often because specific end user groups can find them searching with their own terms.

    Better cataloging
    I suppose Ed Summers was trying to say the same thing as I just did above, when he commented “or better cataloging, I guess they both amount to the same thing eh?“, which I summarised as “social tagging=better cataloging“, but he can correct me if I’m wrong.
    Anyway, I hope I made it clear that I would not say “social tagging=better cataloging“, but rather “controlled vocabularies+social tagging=better cataloging“.
    Or alternatively, could we improve cataloging by professional library catalogers? I must admit I do not know enough about library training and practice to say something about that. I am not a trained librarian. Don’t hesitate to comment!

    Fulltext searching
    Is fulltext searching the miracle cure for findability problems, as Till Kinstler seems to suggest? Maybe.
    Suppose all our print material was completely digitised and available for fulltext search, I have no doubt that all 26 items mentioned above (the results of the “semantic web” all words search) would be found with the “resource description framework” or “rdf” search as well. But because fulltext search is by its very nature an “all words” search, the “rdf” fulltext search would also give a lot of “noise”, or items not having any relation to “semantic web” at all (author’s initials “R.D.F”, other acronyms “R.D.F.”, just see RDF in the BL catalogue). Again, see my post Relevance in context for an explanation of searching without context.
    Also, there will be books or articles about a subject that will not contain the actual subject term at all. With fulltext search these items will not be found.
    Moreover, fulltext searching actually limits the findable items to text, excluding other types, like images, maps, video, audio etc.
    This brings me to the “final solution”:

    Automatic metadata generation
    Of course this is mostly still wishful thinking. But there are a number of interesting implementations already.
    What I mean when I say “(implicit) automatic metadata generation” is: metadata that is NOT created deliberately by humans, but either generated and assigned as static metadata, or generated on the fly, by software, applying intelligent analysis to objects, of all types (text, images, audio, video, etc.).
    In the case of our “rdf” example, such a tool would analyse a text and assign “rdf” as a subject heading based on the content and context of this text, even if the term “rdf” does not appear in the text at all. It would also discard texts containing the string “rdf” that refer to something completely different. Of course for this to succeed there should be some kind of contextual environment with links to other records or even other systems to be able to determine if certain terminology is linked to frequently used terms not mentioned in the text itself (here the Linked Data developments could play a major role).
    The same principle should also apply to non-textual objects, so that images, audio, video etc. about the same subject can be found in one go. Google has some interesting implementations in this field already: image search by colour and content type: see for example the search for “rdf” in Google Images with colour “red”and content type “clip art”.
    But of course there still needs a lot to be done.

  • Relevance in context

    Posted on August 11th, 2009 Lukas Koster 5 comments

    Search! © Jeffrey Beall

    If you do a search in a bibliographic database, you should find what you need, not just what you are looking for, or what the database “thinks” you are looking for. If you find what you are looking for, then you will not be surprised and you will not discover anything new. And that’s not what you want, is it? But if you find things you did not look for but also do not need, you’re not just surprised, you are confused! And that’s not what you want either.

    You want the results that are the most relevant for your search, with your specific objectives, at that specific point in time time, for your specific circumstances, and you want them immediately.

    So, how should search systems behave to make you find what you need? There are two conditions that need to be met:

    • The search terms must be interpreted correctly
    • The most relevant search results must be presented

    The Problem
    First of all, let’s take a look at current practice.

    Search systems cannot cope with ambiguous search terms. My favorite example and test search term is “voc“. This can stand for a number of things in various disciplines: V.O.C. (Dutch: “Verenigde Oostindische Compagnie”  or “Dutch United East Indies Company”) in historical databases; “vocals” in musical databases; “volatile organic compounds” in physics databases. So if you do a search for “voc” in a standard library catalogue, you get all kinds of results. Even more so if you use a metasearch or federated search tool for searching several databases simultaneously.

    Search for "voc" in British Library catalogue

    Search for "voc" in British Library Integrated Catalogue

    You are confused. You would like the system to “understand” which one of these concepts you are referring to instead of just using the literal string. You would like the system to take into account your context.

    In most databases search results can be sorted or filtered by a number of fields, most commonly by year, title, author, and also by more specific fields in dedicated databases.  But unless you are interested in a specific year, author or title, this will not do. Recently many systems have implemented “faceted” and “clustered” browsing of results, enabling “drilling down” on specific terms or subjects. This basically comes down to setting the context after the fact.

    But after the system has interpreted your search terms, the  results should also be ordered in a specific way, the ones you need most should be on top. This is where “relevance ranking” of search results comes in. Most catalogues and databases use a system specific default relevance ranking algorithm. Search results are assigned a rank, based on a number of criteria, that can differ between databases, depending on the nature of the database.
    Some databases just present the most recent results on top. For medical and physical sciences this may be right, but for history and literature databases this may just be wrong.
    Sometimes the search terms are taken into account: the number of times the given search terms are present in the result fields is important, but also the specific fields in which search terms appear. The appearance of search terms in “Title” and “Subject” may rank higher than in “Abstract” or “Publisher”. Moreover, the search indexes used can have a major influence on rank: if you search for “Subject” = “flu”, then results with “flu” as subject will be ranked higher than results with “flu” in the title only.
    To come back to my example, with ambiguous search terms like “voc” this type of relevance ranking will definitely not be enough, because results from the three different conceptual areas will be completely mixed up.

    Faceted/clustered search results in Amsterdam Univerity Digital Library

    Clustered search results in Amsterdam University Library MetaLib portal

    When searching with a metasearch or federated search tool things get even more complicated. Each of the remote databases that you search in has its own default way of ranking. Usually the metasearch tool fetches the first 30 or so results from each remote database (one set sorted by date, the other by internal rank, the next by title), merges these into one list and then applies its own local ranking mechanism to this subset only. Confusion! And I did not even mention searching databases with metadata in multiple languages. Moreover, databases containing only metadata will produce different results and relevance than databases with full text articles. There is absolutely no way of telling if you actually have the most relevant results for your situation.

    Again, with relevance ranking search systems do not take into account the context either. You could say it is an introverted, internally focused way of ranking, the confusing results of which are multiplied in case of metasearching.

    Most metasearch tools give users the option of searching in sets of pre-selected databases, based on subject or type. This way you can limit your search to those databases that are known to have data about that specific subject. You more or less set the context in advance. But this mechanism only eliminates results from databases that probably do not have data on your subject at all, so they would not have shown up in the results anyway. Moreover, the same issues that were discussed above apply to this limited set of databases.

    The metasearch tool that I know best (MetaLib) offers the option of setting a relative rank per database, so results from databases with a higher rank will have a higher relevance in merged result sets. But this is a system wide option, set by system administration, so it is not taking into account any context at all. It would be better if you could make the relative database rank dependent on the set or subject the search is done from (for instance: if a history database is searched in the context of a “History” set, the results get a higher rank than in a search from a “Music” set).

    The best solution for this “internal” relevance problem regarding distributed databases is a central database of harvested indexes. In this case all harvested metadata is normalised and ranked in a uniform way, and users do not have to select databases in advance. But these systems still do not take into account “external” relevance: there is no context!

    A very interesting and intelligent solution for the problem of pre-selecting databases is provided in PurpleSearch, the integrated front end to MetaLib (among other things), developed by the Library of the University of Groningen. The system records which databases actually produce results for specific search terms. As soon as the user enters search terms in the single search box, the system knows which databases will have results, and the search is automatically carried out in these databases, without asking the user to select the databases or subject area he or she wants to search in. Simultaneously a background search in all other databases is performed in order to check additional new results, and the information about results in databases is updated.
    Of course, all other usual options are available as well, like pre-selecting databases (setting context in advance) and faceted results drilling down (setting context after the fact). But again, no external contextual settings.

    Search "voc" in PurpleSearch

    Search "voc" in PurpleSearch

    • Conclusion: the only way to find what you need, is to make search systems take into account the context in which the search is done, both for searching and for relevance ranking.

    Now, let’s have a look at a couple of conditions that would make contextual searches possible.

    Personal context: a system should “know” about your personal interests, field of study, job situation, age, etc. so it can “decide” which databases to search in and which results are the most relevant for you. Some systems, like university systems, have access to information about their users. Once you log in, the system potentially knows which subjects you are studying or teaching and could use this information for setting the context for searching and ranking.
    But what if you are a student in Law AND Social Siences, which subject area should the system choose? Or: if you are a History teacher, and you have a personal interest in Ecology, which the system does not know about, what then? Somehow you still need to set context yourself.

    Some systems also offer the opportunity of setting personal preferences, like: area of interest, specific databases, type of material (only digital or print), only recent material, etc. Again: you must be able to deviate from these preferences, depending on your situation, which means setting context manually.

    Different search systems will have different user profiles (user data and preferences). It would be nice if search systems could take advantage of universal external personal profiles (like Google Profiles for instance) using some kind of universal API.

    Situational context: a system should also “know” about the situation you are in, both in a functional sense and in a physical sense.

    Functional context means: wich role are you playing? Are you in your law student role or in your social sciences student role? Are you in your professional role or in your private role? But also: to which resources do you have access?
    An interesting idea: if you work Monday to Friday during office hours, study in the evenings and spend time on your personal interests on the weekends, it would be nice if you could link times of day and days of the week to your different roles, so search systems could use the correct context for your searches depending on time and date: “if it’s Tuesday evening then use study profile and search in ‘History’; if it’s Sunday, use private profile and search in ‘Ecology’“.

    It's the Great Pumpkin Charlie Brown

    This temporal context was also referred to by Till Kinstler in a (German) blog post about the new “Suchkiste” search system prototype of the German Union Library Network (GBV): ‘the search for “Charlie Brown” in October should result in “It’s the Great Pumpkin, Charlie Brown” at number 1, and in December in “A Charlie Brown Christmas“‘.

    Physical context means: where are you? It would be nice if a library catalogue search system would take into account your actual location, so it could show you the records of the copies of the FRBR-ized results available in the library locations nearest to you (this idea came up in a recent Twitter discussion between @librarianbe and @gbierens). This is what Worldcat does when you supply it with your location manually. In Worldcat this is a static preference. But it would be nice if it would respond to your actual location, for instance by using the GPS coordinates of your mobile phone. Alternatively, search systems could derive your location from the IP address you are sending your search from.
    This information could also be used to determine if records for digital or physical copies should be ranked the most relevant in this case. If you are inside the library building and you have a preference for physical books and journals, then records for available print copies should be on top of the results list. If you are at home, then records for digital copies that you have access to should come first.

    Contextual searching and ranking should always be a combination of all possible conditions, personal, situational and internal system ones.

    Of course it goes without saying that it would be great if metasearch tools were able to convey the search context to the remote databases and get contextual results back, using some kind of universal serach context API!

    Last but not least, each search system should show the context of the search, and explain how it got to the results in the presented order. Something like: based on your personal preferences, the time of day and day of the week, and your location, the search was done in these databases, with this subject area, and the physical copies of the nearest location are shown on top.
    This context area on the results screen could then be used as a kind of inverted faceted search, drilling “up” to a broader level or “sideways” to another context.