Data. The final frontier.
RSS icon Home icon
  • Explicit and implicit metadata

    Posted on August 20th, 2009 Lukas Koster 12 comments

    Tagged © funkandjazz

    On August 17, after I tested a search in our new Aleph OPAC and mentioned my surprise on Twitter, the following discussion unfolded between me (lukask), Ed Summers of the Library of Congress and Till Kinstler of GBV (German Union Library Network):

    • lukask: Just found out we only have one item about RDF in our catalogue:
    • edsu: @lukask broaden that search 🙂
    • lukask: @edsu Ha! Thanks! But I’m sure that RDF will be mentioned in these 29 titles! A case for social tagging!
    • edsu: @lukask or better cataloging 🙂
    • edsu: @lukask i guess they both amount to the same thing eh?
    • lukask: @edsu That’s an interesting position…”social tagging=better cataloging”. I will ask my cataloguing co-workers about this specific example
    • edsu: @lukask make sure to wear body-armor
    • lukask: @edsu Yes I know! I will bring it up at tomorrow’s party for the celebration of our ALEPH STP (after some drinks…)
    • tillk: @edsu @lukask or fulltext search… 🙂 SCNR…
    • edsu: @tillk yeah, totally — with projects like @googlebooks and @hathitrust we may look back on the age of cataloging with different eyes …
    • lukask: @tillk @edsu Fulltext search yes, or “implicit automatic metadata generation”?

    What happened here was:

    • A problem with findability of specific bibliographic items was observed: although it is highly unlikely that books about the Semantic Web will not cover RDF-Resource Description Framework, none of the 29 titles found with “Semantic Web” could be found with the search term “Resource Description Framework“; on the other side, the only item found with “Resource Description Framework” was NOT found with “Semantic Web“. I must add that the “Semantic web” search was an “All words” search. Only 20 of the results were indexed with the Dutch subject heading “Semantisch web” (which term is never used in real life as far as I know; the English term is an international concept). Some results were off topic, they just happened to have “semantic” and “web” somewhere in their metadata. A better search would have been a phrase search (adjacent) with “semantic web” in actual quotes, which gives 26 items. But of these, a small number were not indexed with subject heading “Semantisch web“. Another note: searching with “RDF” gets you all kinds of results. Read more on the issue of searching and relevance in my post Relevance in context.
    • Four possible solutions were suggested:
    1. social tagging
    2. better cataloging
    3. fulltext searching
    4. automatic metadata generation

    Social tagging
    Clearly, the 26 items found with the search “Semantic web” are not indexed by the “Resource description framework” or “RDF” subject heading. There is not even a subject heading for “Resource description framework” or “RDF“. In my personal view, from my personal context, this is an omission. Mind you, this is not only an issue in the catalogue of the Library of the University of Amsterdam, it is quite common. I tried it in the British Library Integrated Catalogue with similar results. Try it in your own OPAC!
    I presume that our professional cataloging colleagues can’t know everything about all subjects. That is completely understandable. I would not know how to catalog a book about a medical subject myself either! But this is exactly the point. If you allow end users to add their own tags to your bibliographic records, you enhance the findability of these records for specific groups of end users.
    I am not saying that cataloguing and indexing by library specialists using controlled vocabularies should be replaced by social tagging! No, not at all. I am just saying that both types of tagging/indexing are complementary. Sure, some of the tags added by end users may not follow cataloging standards, but who cares? Very often the end users adding tags of their own will be professional experts in their field. In any case, items with social tags will be found more often because specific end user groups can find them searching with their own terms.

    Better cataloging
    I suppose Ed Summers was trying to say the same thing as I just did above, when he commented “or better cataloging, I guess they both amount to the same thing eh?“, which I summarised as “social tagging=better cataloging“, but he can correct me if I’m wrong.
    Anyway, I hope I made it clear that I would not say “social tagging=better cataloging“, but rather “controlled vocabularies+social tagging=better cataloging“.
    Or alternatively, could we improve cataloging by professional library catalogers? I must admit I do not know enough about library training and practice to say something about that. I am not a trained librarian. Don’t hesitate to comment!

    Fulltext searching
    Is fulltext searching the miracle cure for findability problems, as Till Kinstler seems to suggest? Maybe.
    Suppose all our print material was completely digitised and available for fulltext search, I have no doubt that all 26 items mentioned above (the results of the “semantic web” all words search) would be found with the “resource description framework” or “rdf” search as well. But because fulltext search is by its very nature an “all words” search, the “rdf” fulltext search would also give a lot of “noise”, or items not having any relation to “semantic web” at all (author’s initials “R.D.F”, other acronyms “R.D.F.”, just see RDF in the BL catalogue). Again, see my post Relevance in context for an explanation of searching without context.
    Also, there will be books or articles about a subject that will not contain the actual subject term at all. With fulltext search these items will not be found.
    Moreover, fulltext searching actually limits the findable items to text, excluding other types, like images, maps, video, audio etc.
    This brings me to the “final solution”:

    Automatic metadata generation
    Of course this is mostly still wishful thinking. But there are a number of interesting implementations already.
    What I mean when I say “(implicit) automatic metadata generation” is: metadata that is NOT created deliberately by humans, but either generated and assigned as static metadata, or generated on the fly, by software, applying intelligent analysis to objects, of all types (text, images, audio, video, etc.).
    In the case of our “rdf” example, such a tool would analyse a text and assign “rdf” as a subject heading based on the content and context of this text, even if the term “rdf” does not appear in the text at all. It would also discard texts containing the string “rdf” that refer to something completely different. Of course for this to succeed there should be some kind of contextual environment with links to other records or even other systems to be able to determine if certain terminology is linked to frequently used terms not mentioned in the text itself (here the Linked Data developments could play a major role).
    The same principle should also apply to non-textual objects, so that images, audio, video etc. about the same subject can be found in one go. Google has some interesting implementations in this field already: image search by colour and content type: see for example the search for “rdf” in Google Images with colour “red”and content type “clip art”.
    But of course there still needs a lot to be done.


    11 responses to “Explicit and implicit metadataRSS icon

    • “but rather “controlled vocabularies+social tagging=better cataloging“.”

      So another possible solution is “better controlled vocabulary”, which is a bit different than “better cataloging” because while both are done by ‘catalogers’, CV maintenance happens at a different point in the process.

      If you’re doing social tagging from a CV, but there’s no term in the CV for “RDF”, then… where are we?

      Perhaps there should be a term for RDF. And/or perhaps the concept “Semantic Web” should have a lead-in term “RDF”, or a related term, or something to say that items on “Semantic Web” are likely to be on RDF too.

      Perhaps the controlled vocabulary itself should be maintained by a more ‘social’ method. (Ie, more open to non-certified volunteers). So someone could add a term for RDF, and/or add those relationships and lead-in terms. How do we keep the CV somewhat sensible while opening up it’s maintenance to more non-certified volunteers, or even the general public at large?

      (Incidentally, I’m still interested in the idea of considering the list of Wikipedia article topics to BE a controlled vocabulary of concepts, something I thought of before. What if you could tag an item in the library not with free text, but with any concept that exists in Wikipedia?)

      And of course software should get better at using these controlled vocabularies. If “Semantic Web” DID have a lead in term “RDF”, then when you enter RDF in the catalog, the catalog should (automatically or offering you the option to) expand your search to include “Semantic Web” as well.

      • Jonathan, you are right again, with “better controlled vocabulary”. But you could also consider this a part of “better cataloging”. But, yes, maybe the controlled vocabularies aren’t flexible enough?

        Your point about “tagging an item in the library not with free text, but with any concept that exists in Wikipedia”, well, that is Linked Data!

    • I assume that like social tagging, you are thinking of automatic metadata generation as a complement to cataloging by human specialists? Especially if the goal is “applying intelligent analysis to objects, of all types.”

      • Shannon, good question. Well, yes, that would be an option, especially as long as the analysing software is not intelligent enough. But suppose the software was perfect, or let’s say better than human catalogers, we wouldn’t need humans anymore, would we?

    • LCSH offers RDF (Document markup language) (, although should the cross reference be start with “Resources” or “Resource.” I have only seen the singular in print. Anyway, it’s something to build on.

      • Bryan, thanks! So there are actually differences in controlled vocabularies worldwide. Apparently the LoC Subject Headings Authority file is better in this respect than the Dutch National Authority file.
        Note that your link points to the recent Linked Data/RDF interface to the LoC Authority files!

    • What about catalogue enrichment with tables of content? We scan the tocs of all new books an build search indexes over the ocr. TOCs give more relevant information than a full text search over all the text of the book.

      • Hi Anette, Interesting idea! a kind of “concise” full text index. Can we try that? What is the url to your catalogue?
        Works only if you have ToC’s of course 😉

    • As a cataloger myself, I appreciate that you regard controlled vocabulary subject headings and social tagging as complementary. Too many want to replace one with the other — and I completely agree that social tagging could add to a record, but never truly replace controlled subject headings.

      I do think that there are greater problems with the use of social tagging than many people realize. The optimism is based on the assumption that people are interested in everything — yet what you need is a large enough group, interested in a specific topic, enough of whom are interested in tagging, to reach a large enough grouping of tags in order to actually add value. ONE person adding a tag to a record is not necessarily as valuable as ten persons adding the same or similar tags. Yet many items/records will have only that ONE tag.

      I actually attended a presentation at LC a few years ago — dates blur — by a librarian from the World Bank. She was cataloging a lot of electronic documents for their collection and they had developed, as a result, a piece of software that would search the text and suggest subject heading or keywords. That sounds a lot like your dream idea there at the end. And it takes care of the fulltext-is-not-keyword problem, as explained here:

      I do know that LCSH is working toward the possibility of social tagging enriching LCSH subject headings. But I think the technicalities of it, as well as the level of human involvement, are still being discussed. My awareness of this is based on a recent chat, not thorough knowledge.

      • Thanks mpol. I agree that, if “social tagging” is enabled in a catalog, you will probably see groups of experts (virtual communities, “tribes”) emerging who will be taking care of this, within their specific subject area. This could take care of the problem of “lagging” controlled vocabularies, and librarians unable to keep up with raid developments.

        But I also think that one person’s contribution may be as valuable as a large group’s. In my example of “semantic web”, if you would search for subject heading “semantic web” in a Dutch catalog, you would not find anything, because the official controlled vocabulary term is “Semantisch web”. Then if one end user would add the “semantic web” tag, this would benefit everybody.

        Also, if we want to attract more young people to our catalogs and libraries, then it definitely is time to allow social tagging on our systems. A lot of terms in the Dutch National Subject Headings controlled vocabulary appear very archaic, even to me! Next generations of ens users would not think of using words like that to search for information, because they are using completely different (modern) terminology. So it is either: allow social tagging or completely revolutionise the way libraries are usingcontrolled vocabularies, if we do not want our catalogs to loose it to Google.

    • Automatic metadata generation. Great, but already very hard to do with text. Does enable context relevant indexing, when you will use a specialized thesaurus to do this for example.
      With other material this does not work very well. Unless you are interested in any picture containing a lot of red. (Could be interesting if you are looking for horror movies)

    1 Trackbacks / Pingbacks

    Leave a reply