images.jpeg
Tagmatic

Benchmarking Tagmatic

October 10, 2019
3 mins
Tagmatic goes head to head with a number of standard multi-label classification implementations. The result: don't build, buy.

It may be argued that customer satisfaction is the only and ultimate KPI that really matters when judging the success of a product. In which case Tagmatic would pass with flying colors, given the fantastic feedback we are receiving from clients.

Remarkably enough, we found that sheer inferencing performances are often at the edges of our clients' considerations: this is partly due to the intrinsically tailored nature of Tagmatic's models, but mostly attributable to a plethora of invaluable qualities that go beyond the naked F1 scores: being able to quickly spin up a model on your own content and vocabulary, and then interact with it via a minimalistic HTTP interface, zero maintenance, on-the-fly learning of new classes, low latency, security, scalability and high availability out of the box, are crucial factors to enterprise-scale deliveries.

Your Intellectual Property lies in your content, your data, your data models, and how you use the classifications. It is not in commoditized compute solutions.
Your resources are thus more effectively spent developing and exploiting your market differentiating IP, rather than spending way too much time and money on trying to replicate the engineering complexities of off-the-shelf MLaaS solutions which may not be your technical team's core competence.

This is where Tagmatic comes in. The hard work and complex engineering has been done, just bring your content!

The datasets

RCV1 is arguably the most popular dataset for multi-label classification, comprising 800,000 newswires by Reuters tagged with 103 classes. Oddly enough for such a totemic dataset, the breadth of the vocabulary is anachronistic when compared to some of our clients, modern global publishers who routinely maintain taxonomies with thousands of topics.
The dataset is available here although unfortunately not in its original XML format with full HTML markup, which we would normally leverage as part of the classification task. Also the corpus comes already tokenised and train/test datasets pre-canned. Besides taking away part of the fun, this has also negative implications on the vectorisation logic Tagmatic implements, making it largely ineffective unfortunately. For the purpose of this benchmark, we have used the full train set (23,000) and a reduced test set (100,000).
AAPD (Arxiv Academic Paper Dataset) is a collection of research papers covering a number of subjects organised in several classification systems. The dataset here is annotated with a number of classification schemes, of varying level of quality. We have carved out a subset corpus in the Computer Science domain, comprising of 32,000 papers and 39 subjects in total.

The benchmark

We have considered a number of models amongst the most widely employed for multi-label classification tasks. The reference evaluation metric is weighted-F1.
RCV1AAPDOne-vs-Rest - LinearSVC0.780.71One-vs-Rest - SGD0.740.62One-vs-Rest - XGBoost0.750.64Classifier Chain - LinearSVC0.790.71Label Powerset - MultinomialNB 0.550.55Label Powerset - RandomForest0.600.55 Tagmatic0.800.74

Tagmatic offers out-of-the-box improvement over a number of basic implementations. Granted, a team of Data Scientists might possibly come up with better optimised flavours of such models, pushing performances up. But at what cost? And by how much? And what then? There is still the much bigger task of bringing that model to fruition in a real world environment, requiring non-trivial development effort. Is that really worth it?
Unless text classification is either your USP or a strong market differentiator, then the answer ought to be a resounding no.
Tagmatic is production-ready, enterprise-grade, and available to you right now - see our Text AI Services product.

Lastly - Using Tagmatic (Text AI Services) on Hansard

We have also benchmarked Tagmatic against Hansard data: a corpus of ~200,000 parliamentary interventions for the 2005/6 period , classified with ~7200 categories, scoring a whopping 0.83. More on this on the next blog post.
Make sure you sign up for our newsletter and follow @datalanguage on Twitter for all the latest.