The constant growth of unstructured content remains a challenge to information management programs as these often lack solutions and time to determine which documents have important content and require retention, curation, disposal or other actions. A key step in addressing this challenge is enhancing content by classifying, tagging or categorizing it with a controlled set of domain-specific terms. This enables improved indexing, search, access to and navigation of content and is an important step toward the goals of delivering the right content to the right people in the right context and integrating it with other enterprise information. TopBraid EVN Tagger with AutoClassifier implements an automated solution to classifying content with “human in the loop” mediation to ensure quality results. Its functioning was earlier introduced in the blog post Deep Dives into TopBraid EVN — Part 1: Automated tagging with the New AutoClassifier.

EVN’s AutoClassifier employs machine learning technologies to analyze and classify natural documents using a user-provided vocabulary as its source of tags. Since machine learning requires training data, a key common question is: “how does one create a training set?” In this blog we address this question by describing steps that we took to automate creation of training data for a set of documents extracted from a public website.

A training set doesn’t need to be large. Quite often 100 or even less documents will suffice. The best training sets are created by people that have a good understanding of content. Chances are some of the content is already tagged, possibly using free-form tagging techniques that allow authors to define new tags instead of merely selecting existing ones. Any such information may be used to create training data for auto-classification.

We successfully experimented with auto-tagging a large collection of Investopedia documents using our own vocabulary that we call Finance Data Governance. All Investopedia articles are already free-form tagged by their authors. We automated creation of the training data by trying to lexically match these pre-existing tags with concept labels from our controlled vocabulary, and when applicable replacing the free tag with the corresponding concept. The success rate of this operation, i.e. the amount of documents that end up being tagged with vocabulary concepts rather than mere labels, was sufficient to spare us from manually creating any training data.

The informal sketch below illustrates the steps we took. This approach can be adapted to simplify and speedup the creation of training sets for various sources of natural language resources from corporate document management systems (Sharepoint farms, etc.).

Classify Through RDF

Our tagging vocabulary, Finance Data Governance, is a taxonomy consisting of 96 classes and 1,131 concepts across 6 concept schemes covering most aspects of professional finance. These include pure financial entities (such as transactions, assets, instruments, measures), financial governance subjects (events, processes, policies, strategies), government subjects (regulations, programs, taxes), information technology subjects and systems (limited to the finance vertical), legal subjects (liabilities, obligations, contracts, codes), and organization subjects (most levels of public and private organizations involved in finance). About 10% of these concepts have several labels. That is, in addition to a preferred label, they have one or more alternative labels that further extend the set of matchable terms with acronyms or synonyms. We are currently working on increasing the number of available alternative labels. The overall taxonomy was built by integrating and eventually structuring publicly available vocabularies, glossaries and terminologies from the Dodd-Frank Act, the International Swaps and Derivatives Association, the Basel III International regulatory framework for banks and EU Regulation on markets and financial instruments.

Our first step was to make the text content from documents available to AutoClassifier. This is done by creating an EVN Content Tag Set, which can differentiate, as desired, between different types of documents to be tagged:

Investopedia Documents TypesWe derived the lightweight RDFS schema describing these types of documents from their source system, the Investopedia website, which mainly hosts articles, tutorials, definitions, or frequently asked questions. For this exercise, we retrieved the natural language content of over 13,000 articles and 7,000 Q&A documents. Hence, our Content Tag Set was limited to two classes of documents. Extracting documents bodies, the text to be tagged, from Investopedia pages was done by scraping the website and creating an RDF graph according to the schema earlier defined. These two tasks were performed by a Python script using respectively Scrapy and RDFLib libraries. All target pages are retrieved as HTTP responses stored in transient objects, before being transposed into RDF. In addition to documents’ bodies and titles, we also included their URLs and preassigned free tags in the resulting RDF resources. Storing the document URL lets us have a link to the original document in the final auto-classification report, as explained in the product’s documentation. Preassigned tags are needed to create the training data.

We were then able to match pre-existing tags for 3,558 documents to concepts from our vocabulary. We only used some of these matched documents as our training set: in general, we do not recommend having more than 1,000 training documents as a larger number than that will have little impact on the quality of results. We then ran TobBraid EVN Tagger’s AutoClassifier on all documents. Here are some examples of the generated tags:

TopBraid EVN offers RESTful APIs that make it possible to integrate AutoClassifier with other applications, using instructions to trigger auto-classification for a collection of documents or a single one at a time. For example, API calls can be used to provide content authors with a suggested list of controlled tags directly within their authoring environment, thereby increasing consistency of the keywords they select. Alternatively, auto-tagging can be used to both fully automate keyword creation without involving authors to create them and to add auto-generated keywords to those provided by the authors.

The approach we outlined here demonstrates that if you already have some tagged content, you are likely to be able to use it for training, even if the tags don’t fully correspond to those from the vocabulary you’ll use for auto-tagging. It is important to note that this approach benefited from the vocabulary’s terms being richly labelled, as it depends on having a significant overlap between the vocabulary concepts’ labels and pre-existing tags. This blog has illustrated how to use a corpus of Web documents that already bear authors’ manual tags annotations to establish a training dataset for TopBraid EVN Tagger’s AutoClassifier – making the overall process more practical.