Library2.0 and beyond
RSS icon Home icon
  • Explicit and implicit metadata

    Posted on August 20th, 2009 Lukas Koster 12 comments
    tagged

    Tagged © funkandjazz

    On August 17, after I tested a search in our new Aleph OPAC and mentioned my surprise on Twitter, the following discussion unfolded between me (lukask), Ed Summers of the Library of Congress and Till Kinstler of GBV (German Union Library Network):

    • lukask: Just found out we only have one item about RDF in our catalogue: http://tinyurl.com/lz75c4
    • edsu: @lukask broaden that search :-) http://is.gd/2l6vB
    • lukask: @edsu Ha! Thanks! But I’m sure that RDF will be mentioned in these 29 titles! A case for social tagging!
    • edsu: @lukask or better cataloging :-)
    • edsu: @lukask i guess they both amount to the same thing eh?
    • lukask: @edsu That’s an interesting position…”social tagging=better cataloging”. I will ask my cataloguing co-workers about this specific example
    • edsu: @lukask make sure to wear body-armor
    • lukask: @edsu Yes I know! I will bring it up at tomorrow’s party for the celebration of our ALEPH STP (after some drinks…)
    • tillk: @edsu @lukask or fulltext search… :-) SCNR…
    • edsu: @tillk yeah, totally — with projects like @googlebooks and @hathitrust we may look back on the age of cataloging with different eyes …
    • lukask: @tillk @edsu Fulltext search yes, or “implicit automatic metadata generation”?

    What happened here was:

    • A problem with findability of specific bibliographic items was observed: although it is highly unlikely that books about the Semantic Web will not cover RDF-Resource Description Framework, none of the 29 titles found with “Semantic Web” could be found with the search term “Resource Description Framework“; on the other side, the only item found with “Resource Description Framework” was NOT found with “Semantic Web“. I must add that the “Semantic web” search was an “All words” search. Only 20 of the results were indexed with the Dutch subject heading “Semantisch web” (which term is never used in real life as far as I know; the English term is an international concept). Some results were off topic, they just happened to have “semantic” and “web” somewhere in their metadata. A better search would have been a phrase search (adjacent) with “semantic web” in actual quotes, which gives 26 items. But of these, a small number were not indexed with subject heading “Semantisch web“. Another note: searching with “RDF” gets you all kinds of results. Read more on the issue of searching and relevance in my post Relevance in context.
    • Four possible solutions were suggested:
    1. social tagging
    2. better cataloging
    3. fulltext searching
    4. automatic metadata generation

    Social tagging
    Clearly, the 26 items found with the search “Semantic web” are not indexed by the “Resource description framework” or “RDF” subject heading. There is not even a subject heading for “Resource description framework” or “RDF“. In my personal view, from my personal context, this is an omission. Mind you, this is not only an issue in the catalogue of the Library of the University of Amsterdam, it is quite common. I tried it in the British Library Integrated Catalogue with similar results. Try it in your own OPAC!
    I presume that our professional cataloging colleagues can’t know everything about all subjects. That is completely understandable. I would not know how to catalog a book about a medical subject myself either! But this is exactly the point. If you allow end users to add their own tags to your bibliographic records, you enhance the findability of these records for specific groups of end users.
    I am not saying that cataloguing and indexing by library specialists using controlled vocabularies should be replaced by social tagging! No, not at all. I am just saying that both types of tagging/indexing are complementary. Sure, some of the tags added by end users may not follow cataloging standards, but who cares? Very often the end users adding tags of their own will be professional experts in their field. In any case, items with social tags will be found more often because specific end user groups can find them searching with their own terms.

    Better cataloging
    I suppose Ed Summers was trying to say the same thing as I just did above, when he commented “or better cataloging, I guess they both amount to the same thing eh?“, which I summarised as “social tagging=better cataloging“, but he can correct me if I’m wrong.
    Anyway, I hope I made it clear that I would not say “social tagging=better cataloging“, but rather “controlled vocabularies+social tagging=better cataloging“.
    Or alternatively, could we improve cataloging by professional library catalogers? I must admit I do not know enough about library training and practice to say something about that. I am not a trained librarian. Don’t hesitate to comment!

    Fulltext searching
    Is fulltext searching the miracle cure for findability problems, as Till Kinstler seems to suggest? Maybe.
    Suppose all our print material was completely digitised and available for fulltext search, I have no doubt that all 26 items mentioned above (the results of the “semantic web” all words search) would be found with the “resource description framework” or “rdf” search as well. But because fulltext search is by its very nature an “all words” search, the “rdf” fulltext search would also give a lot of “noise”, or items not having any relation to “semantic web” at all (author’s initials “R.D.F”, other acronyms “R.D.F.”, just see RDF in the BL catalogue). Again, see my post Relevance in context for an explanation of searching without context.
    Also, there will be books or articles about a subject that will not contain the actual subject term at all. With fulltext search these items will not be found.
    Moreover, fulltext searching actually limits the findable items to text, excluding other types, like images, maps, video, audio etc.
    This brings me to the “final solution”:

    Automatic metadata generation
    Of course this is mostly still wishful thinking. But there are a number of interesting implementations already.
    What I mean when I say “(implicit) automatic metadata generation” is: metadata that is NOT created deliberately by humans, but either generated and assigned as static metadata, or generated on the fly, by software, applying intelligent analysis to objects, of all types (text, images, audio, video, etc.).
    In the case of our “rdf” example, such a tool would analyse a text and assign “rdf” as a subject heading based on the content and context of this text, even if the term “rdf” does not appear in the text at all. It would also discard texts containing the string “rdf” that refer to something completely different. Of course for this to succeed there should be some kind of contextual environment with links to other records or even other systems to be able to determine if certain terminology is linked to frequently used terms not mentioned in the text itself (here the Linked Data developments could play a major role).
    The same principle should also apply to non-textual objects, so that images, audio, video etc. about the same subject can be found in one go. Google has some interesting implementations in this field already: image search by colour and content type: see for example the search for “rdf” in Google Images with colour “red”and content type “clip art”.
    But of course there still needs a lot to be done.

    Share