Why good subject tagging is hard
An introductory guide to tagging: Part 3
It has been my experience that many organisations struggle with high quality tagging. In this post I am going to reflect on some of the causes of this problem.
As discussed in earlier posts, the value of tagging is often not properly understood. Most often it is considered to be there to support keyword search alone. Even the taggers themselves are not aware of how the tags are going to be used, or why they need to be of good quality to add value.
The next issue is around the guidelines given to taggers. Typically they are extremely poor, and go along the lines of “tag what the content is about”. This question of “what is it about?” is harder than it seems at first glance and not very helpful. It usually results in a different answer from each tagger. To get good quality tagging you need to ask good quality questions.
As the tagging quality goes down, the ability to drive useful features from the tags also diminishes. This in turn does nothing to reassure people of the importance of good tags and you end up in a vicious cycle of ever decreasing quality and value.
What good guidance on tagging looks like should be the point where Library Science can come to the rescue. But the challenge of how to identify the subject of a work is a surprisingly under represented topic in Library Science literature.
Back in the the 1970s, Hutchins observed:
THE LITERATURE OF INDEXING AND CLASSIFICATION contains remarkably little discussion of the processes of indexing and classifying...we find very little about how indexers and classifiers decide what the subject of a document is, how they decide what it is 'about'.
The concept of 'aboutness' in subject indexing: W. J. Hutchins
Little has changed. Taggers are often told to imagine "how might the user search for this piece of content" or "take the author's view and characterise what they are trying to say". Neither is very useful as a framing question.
Hutchins, in his paper, tackles this problem and introduces a distinction between summarisation and aboutness. Summarisation is the representation of the total subject content of a document and the statements made therein. While aboutness is about capturing the thing that has been written about.
A very simple example might be a sport story about a player and whether they will move clubs, their struggles with injury and their performance this year. Many arguments are developed, making up the narrative of the story. Summarisation would capture not only the player as a tag but also attempt to capture something about the claims made about the subject of fitness, performance and likely future at a club. On the other hand, aboutness only attempts to capture the thing being talked about and not what claims are made about them. In this case just the player.
This distinction is rarely made in tagging guidelines and the tagger is left to balance the level of summarisation and aboutness on gut feel.
Hutchins goes on to characterise this distinction in another interesting way. The aboutness represents those things that are expected to already be part of the readers knowledge. Every piece of content assumes a level of knowledge of the reader with which the narrative then builds. Summarisation on top of this attempts to capture what is new and builds on that existing knowledge in the narrative of the content. The readers are familiar with the player, but not about the subtle changes in his fitness and future prospects.
With this in mind, the questions we might ask a tagger to answer are: What is the thing being talked about? What claims or facts are made about a thing in the content? What might the reader be presumed to know before reading this?
Whether we choose to do summarisation or capture aboutness is a decision related to the desired outcome of the system. But we should be clear about which one it is.
A more recent analysis by Birger Hjorland (The concept of subject in Information Science) revisits the problem. He cites many examples of Guidelines expecting taggers to do little more than perform "mind reading" on prospective future consumers of the content.
Instead, he argues, we must think about tags as a tool in enabling the full potential of the document within the context of the overarching system. The desired outcomes of this system should then dictate the tagging strategies and guidelines.
It is this domain and system that must be crystal clear in taggers' minds as opposed to the idea of aligning subject to a theoretical searcher. They need to also understand the value that they are creating and how.
Sadly, this is the opposite of what we see in many tagging tools and guidelines today.
I would argue that tagging should be a contract between taggers and the system much in the way we see in the design of data services. Tagging guidance needs to be managed with a vision of the desired outcomes combined with asking clear and simple questions of the taggers. This is as true for humans as machines.
In summary, tagging outside of libraries and archives is often of poor quality. Value is rarely understood, content creators are often asked to tag in ways that are a chore and cognitively awkward. Worst of all, the question of what good tagging looks like has rarely been asked.
In the next post I will look at some potential solutions to the challenge of better tagging.
Part 1: Why the web turned its back on the librarians. And why we need them back.
Part 2: The importance of tagging when publishing on the web
Part 4: How to use domain modelling to improve subject tagging