Deep Dives into TopBraid EVN -- Part 1: Automated tagging with the New AutoClassifier

TopQuadrant’s TopBraid EVN Tagger has a great new feature: the AutoClassifier.

See a short, informative webinar featuring a presentation and demo of EVN’s AutoClassifer!
Access webinar recording >>

TopBraid EVN is a web-based taxonomy and ontology editor, built on the SKOS and OWL standards. EVN Tagger is an add-on that lets EVN users annotate documents and other content resources with topics chosen from EVN-managed taxonomies. The new AutoClassifier feature automates, or at least semi-automates, this process of topic annotation. It looks at content resources, and recommends topics automatically. It does this using an interesting mix of natural language processing and machine learning techniques. By helping to add structure to large document collections, AutoClassifier enables better search and navigation.

To use auto-tagging in EVN, you will typically have a taxonomy and some manually tagged training documents. After some initial setup, you can select documents that have not been tagged yet, and receive recommended concepts from the taxonomy that are likely to be those documents’ main topics. You can review these recommendations using a report generated by AutoClassifier or Tagger’s regular interface before creating a production version of these tags.

Bringing Maui to EVN

The actual magic in AutoClassifier is done using an implementation called Maui. Maui is the product of the PhD research of Alyona Medelyan at the University of Waikato in New Zealand. The university’s machine learning group is perhaps best known for the Weka framework.

Maui is usually accessed through the command line. We’ve built something called Maui Server that wraps Maui into RESTful HTTP APIs, and extended EVN Tagger so that it can communicate with Maui Server. This is all nicely integrated with Tagger’s user experience and workflows, making Maui’s abilities available to EVN users. After the initial setup in EVN, Tagger sends the taxonomy (represented in SKOS) and training documents (represented in JSON) to Maui Server. Maui analyzes the training documents and their relationship to the taxonomy. Once this has been done, Maui Server provides a web service that can generate recommendations for additional documents. Tagger uses results from that service to produce the various reports and recommendations.

The auto-tagging problem

How does Maui work its magic? The problem at hand is that we have a document, and we have a collection of concepts arranged in a hierarchical structure, and by only looking at the text of the document we want to automatically determine which of those concepts are most likely to be its topics.

The most straightforward approach to this problem is to train a model for each concept. The machine learning algorithm looks at all the training documents that were tagged with that concept and learns what words or phrases are typical for these documents. Now, when we analyze a new document, we only need to find the concepts whose training document collections statistically most resemble the new document.

The problem with this approach is that it requires a lot of training data. There has to be enough training data to build a statistical model of each concept in the taxonomy. If there are no documents for a given concept, that concept will never be proposed as a topic. On the other hand, this approach works well on shallow classifications with a small number of high-level categories.

The Maui approach

Maui uses another approach that is rooted in a different NLP problem: keyphrase extraction. With Maui, the following process predicts the main topics of a document:

Statistically interesting phrases are extracted from the document.
Those phrases that might refer to a taxonomy concept are identified. This is done by comparing the phrase to the labels (including alternate and hidden labels) of concepts in the taxonomy, after stemming, stopword removal, and so on. The result of this is a set of concepts that are mentioned in the document. For a long document, this can be a large number of concepts.
The remaining challenge is to identify the most important among these candidate concepts, that is, the main topics. This is done by computing a number of features for each candidate, including things like tf-idf scores, position of the phrase in the text (for example, whether it occurs near the beginning), length of the phrase, depth of the concept in the taxonomy, and so on.
The training data is used to train a decision model that assigns weights to these different factors. This model can then predict whether any of the candidate concepts is likely to be a main topic. The result is a confidence score for each candidate. We use this score to order the candidates, and might apply a cutoff. The result is an ordered set of recommended topics.

This approach has the advantage that it works with modest amounts of training data, as the training data is only used to rank the candidates. The disadvantage is that it relies directly on the labels in the taxonomy, rather than on statistical text similarity, for identifying topics. In other words, it works best on rich domain-specific taxonomies that are generously equipped with synonyms and alternate labels, and that go deep enough to contain the concepts that are actually discussed in the text.

For example, a document about politics may never mention the term “politics”, but may mention “primaries”, “polls” and “candidates”. A taxonomy that only contains the concept “Politics”, but not sub-topics like “Primary elections”, would be too shallow to work well with the Maui approach.

Leveraging structured data to make sense of unstructured data

Unstructured information is abundant. But the problem with unstructured information, such as large text collections, is that finding anything in there is hard.

Structured data makes finding things easier. But the problem with with structured data, such as taxonomies, is that it is expensive to create.

Mining and machine learning are the critical ingredients to solve this conundrum. They leverage structured data to make sense of the unstructured. Automatic tagging, as implemented in EVN Tagger with AutoClassifier, is an example of this process in action. It allows our customers to make the most of their investment in structured data.

See a short, informative webinar featuring a presentation and demo of EVN’s AutoClassifer!
Access webinar recording >>

Deep Dives into TopBraid EVN — Part 1: Automated tagging with the New AutoClassifier

Bringing Maui to EVN

The auto-tagging problem

The Maui approach

Leveraging structured data to make sense of unstructured data

Recent Posts

Categories

Meta