Digital Publishing SVG
Training Corpus

Preparing a Training Corpus

February 22, 2022
2 mins
Good training corpora are essential for ensuring AI automation services perform to their full potential. Here's how to prepare your 'Golden Corpus'.

In order to train AI automation services to work optimally for your own specialized domain, a training corpus will be required. While you can make headway with generic solutions trained on generic datasets, each business has its specializations and core structures and these should be addressed.
Our AI Services are optimized to work with small training sets, but still, a training set is required.

  • Some customers have not assembled their corpus yet
  • We can help with our corpus preparation - however much or little you need

Preparing your Golden Corpus

A Golden Corpus (GC) represents what you perceive to be the ground truth for a model to learn from. It is also used as a test to judge the extent to which a model has learned to produce the expected output.
By going through the works of assembling a GC, you are forced to both reflect and convey in explicit terms what your expected output is. This is a great exercise to go through, as often there is no single “right answer”, numerous equally valid opinions, edge cases, etc…
The size of your golden Corpus depends on the task at hand.
The Text Analytics use cases we work on for our customers are:

  1. Text Classification using true "aboutness".
  2. Named Entity Recognition ("NER") including inline entity extraction.
  3. Relationship Extraction ("RE"), optimized for knowledge graphs.
  4. Content Recommendations for unstructured content in your domain.

We can help you size up the requirements around the Golden Corpus you need for your specific requirements.

Rapid launch using a 'Silver Corpus'

Depending on the type of service required (classification/NER/RE), we might be able to help bootstrap a 'Silver Corpus' - a machine-generated, simplified version of a GC. Whilst this doesn’t remove the need for a proper Golden Corpus (and the benefits of going through the process discussed above), it can be useful to jump-start a simple implementation of a model for demonstration and/or integration purposes.

Bespoke Corpuses

Pre-canned generalist solutions can be helpful to bootstrap new predictive capabilities in your organization... But we all know each business is different. So if you're looking for a more bespoke service that will offer you uniquely tailored models, then we can help.