Using AutoClassifier

See Running the AutoClassifier .

Tagger’s AutoClassifier feature can tag a set of documents to a specified taxonomy after you train it with an appropriate set of tagged sample documents. It stores the automatically added tags in a working copy of a tag set where you can review the tags before committing them to production.

This section assumes that AutoClassifier’s indexing server is installed. If a Content Tag Set’s > Manage > Configure AutoClassifier is missing, then contact your EDG administrator regarding both (1) the licensing and installation of the Maui Indexer server and (2) the EDG administrative configuration of the AutoClassifier parameters for Maui Server.

Setting up a Content Tag Set for Automated Tagging

The steps of creating a Content Tag Set for automated tagging are the same as for manual tagging: create a Content Tag Set, specify the content graph and tagging vocabulary, and then select a default tag property as described above. The default tag property is the one that AutoClassifier will use when tagging content items with vocabulary terms.

If instances in the content graph include dc:source values with URLs pointing to actual documents, Tagger will use these URLs to turn content titles in reports on AutoClassifier activity into hypertext links.

There are a few things to keep in mind about the vocabulary that you use for automated tagging:

  • Large taxonomies that cover the domain of interest in detail are good, as they give AutoClassifier more terms to choose from.

  • Smaller, shallow taxonomies with general concepts work less well, as the concepts are less likely to directly appear as keywords in the text.

  • Taxonomies that have rich skos:altLabel and skos:hiddenLabel values will work best.

  • Taxonomies can be improved to work better with AutoClassifier by adding more specific sub-concepts and by adding alternate labels to existing concepts. The goal is to have keywords or phrases that occur in the document corpus appear as labels in the taxonomy.

  • Some special considerations apply when using EDG-managed ontologies (as opposed to EDG-managed taxonomies) as concept vocabularies.

Training the AutoClassifier

After either manually tagging some of the content with the configured default tag property, or after importing a set of tags from a file, the next step is to have the AutoClassifier analyze this set of tags to identify patterns that it can use when tagging additional content. To configure this, first ensure (per above) that the AutoClassifier server (Maui) is both installed and configured, then select Configure AutoClassifier from the content tag set’s Manage tab.

TopBraid EDG Configure AutoClassifier Example

TopBraid EDG Configure AutoClassifier Example

As shown above, the configuration screen includes descriptions of the properties that you can set.

The Content language offers a choice of English, French, German, or Spanish parsing of the content. This setting also affects the handling of language-tagged literals in the content graph and concept vocabulary graph. If a language is chosen here, non-matching language-tagged literals will be ignored during auto-classification. Leaving the setting on (default) will use the system language (as configured in Maui Server) and is not recommended for multi-lingual content graphs.

Amount of training data: As a guideline, we recommend having at least 100 tagged content resources as training data. AutoClassifier will work with lower numbers, but the quality of recommended tags will be lower. Adding more training data is good, but there are diminishing returns. If 1000 tagged content resources are available as training data, adding more may make little difference to the quality of results, but will still increase the amount of computing resources required for training. If you wish to limit the training set to a random sample, you may specify a Training sample size and enter the desired number of content resources.

Only taggings that use the default tag property will be used as training data.

After choosing appropriate options, click Save Changes and then the Start Training button. Training may take a while. After training is complete, the AutoClassifier is ready to recommend tags, as described in the next chapter.

Re-training: The training process can be repeated at any time. Re-training is recommended in the following situations:

  1. After the taxonomy has changed.

  2. When the nature of the content resources changes significantly, e.g., many new content resources that differ in length or in used properties are being added since the last training.

  3. If the amount of available training data has increased significantly since the last training, but is still below the numbers where we see diminishing returns.

Note

Re-training is not necessary when only a few content resources were added, or when incremental changes were made to content resources.

Evaluation: AutoClassifier works by using the training data to train a machine learning model. AutoClassifier analyses each document in the training data and its associated training tags. If the learning process works well, AutoClassifier is then able to predict similar tags for any other input document. AutoClassifier has an evaluation function that can be used to quantify how well this training process works, by computing precision and recall scores. These scores can be helpful to understand the effect of activities such as adding more training data, adding more skos:altLabels, or adding more concepts to the taxonomy.

Click Calculate to compute precision and recall. This calculation is done using 10-fold cross-validation, in other words, the collection of training documents is split into 10 parts, and then 9 parts are used for training, and AutoClassifier is run with the resulting training model on the documents in the 10th part (the test set). This is repeated ten times, with a different part as test set each time. The results are averaged. This process means that the training process never sees the documents that are used for calculating the scores.

Keeping training data separate from AutoClassifier results: The recommended workflow for AutoClassifier is to keep training data in one content tag set (the training tag set), and AutoClassifier results in a different content tag set (the target tag set). This ensures that manually created training tags are not mixed with lower-quality auto-generated tags. This setup also makes it possible to do training on one set of documents, and auto-classification on a different (larger) set of documents.

In this scenario, the AutoClassifier configuration and training described above would only be done within the training tag set. After this has been done, navigate to the AutoClassifier configuration screen of the target tag set, and select the training tag set from the AutoClassifier Training Model dropdown. This will enable AutoClassifier on the target tag set, using the training result from the training tag set.

It is recommended that training tag set and target tag set should share the same concept vocabulary. If they use different concept vocabularies, then only terms that are shared between the two vocabularies will be found by the AutoClassifier. If they don’t share any terms, AutoClassifier will be unable to generate any recommendations.

The Manage Tab also has the option to run AutoClassifier automatically when any content changes on the Content Tag Set. Check the box in the item as shown here:

TopBraid EDG Run AutoClassifier Automatically on Content Changes Option

TopBraid EDG Run AutoClassifier Automatically on Content Changes Option

Running the AutoClassifier

Once you’ve done your training, you can return to the AutoClassifier tab and click the Run AutoClassifier button. You will see a new row added to the table on that page, showing that your job is running and giving you the opportunity to cancel the job if you wish, as shown in the “Running” row of the table below:

TopBraid EDG AutoClassifier Jobs

TopBraid EDG AutoClassifier Jobs

If you refresh your browser before the job is completed, you will see the Duration value for the running job updated in the table. The next time that you refresh your browser after the job is finished, you will see that the status has changed from “Running” to “Completed,” and a “Results” link will appear in the final column. This link leads to a report similar to the following:

TopBraid EDG AutoClassifier Results

TopBraid EDG AutoClassifier Results

Clicking the little triangle to the left of any Tag (as someone has done with the “Refibrillation managed by” one above) expands in the second column the list of tags proposed for assignment to those documents, along with tags already in the production copy (if this is the first time AutoClassifier runs on this Content Tag Set, it is likely that tags already in production copy are part of the training data).

Note

From the header of this report that the new tags have not been applied to the production tag set, but are instead stored in a working copy named “AutoClassifier Results” plus a date and time stamp.

In the last two columns of the report, each proposed tag can be approved, which commits it to the production tag set, or rejected, which removes it from the working copy. Links in the lower-right do these for all the generated tags.

You can also edit the tags added by the AutoClassifier in the regular Tagger interface. You can delete, add, and otherwise edit these tags as you normally would in a tag set working copy.

TopBraid EDG Regular Tagger Interface

TopBraid EDG Regular Tagger Interface

AutoClassifier Integration Points

While several entry points are provided to make use of AutoClassifier with different scopes across Tagger, its further integration with external systems extending EDG is possible with two options: a web service and a SPARQL property function.

AutoClassifier Web Services

AutoClassifier can be controlled and invoked through a collection of web APIs. This includes an API function that recommends tags for a submitted text snippet and returns JSON results. Documentation for the AutoClassifier web services is available in the EDG Server Administration area, under Available Web Services.

Autoclassifying SPARQL property function

Auto-classification can also be triggered from within a SPARQL query with the property function autotagger:autoClassify, which enables further customization of its behaviour. The example below does showcase it:

PREFIX autotagger: <http://evn.topbraidlive.org/autotagger#>
SELECT * WHERE {
    GRAPH ?contentGraph {
        ?cts autotagger:autoClassify (?doc ?concept ?probability)
    }
    GRAPH ?taxon {
        OPTIONAL { ?concept skos:prefLabel ?label }
    }
}

Values shall be bound (either hardcoded or in other parts of the query) to the following variables:

  • ?contentGraph is the content graph where unstructured text resides in documents;

  • ?cts is the EDG Content Tag Set project URI, e.g. <urn:x-evn-master:DocsSet>

  • ?doc is an instance of document, e.g. <http://example.org/trec/doc/87049087>. These instances must have triples for properties configured as content properties in the Content Tag Set’s AutoClassifier configuration

  • ?taxon is the EDG Taxonomy project URI, e.g. <urn:x-evn-master:MyTaxonomy>

Preparing Ontologies and Taxonomies for use as Tagging Vocabularies with AutoClassifier

AutoClassifier automatically assigns tags to documents. The tags are chosen from the pool of concepts in a tagging vocabulary. The tagging vocabulary can be a taxonomy expressed in SKOS. In this case, instances of skos:Concept form the pool of possible concepts. A taxonomy can also be based on a model that extends SKOS with some subclasses of skos:Concept. Or alternatively, the tagging vocabulary can be some an ontology. In such cases, the modeler has some control over the pool of resources that are used as tags.

The rules used by AutoClassifier to determine the pool of resources to be used as tags are as follows:

  1. If a tagging vocabulary contains any instances of skos:Concept or any of its subclasses, then these concept instances become the pool of potential tags.

  2. If the ontology’s root class is owl:Thing (this is the default), then the classes defined in the ontology will become the pool of potential tags. Otherwise, any subclasses of the root class will become the pool of potential tags.

You may have a candidate for a tagging vocabulary that does not use either direct instances of skos:Concept nor a class hierarchy to capture the concepts to be used as tags by AutoClassifier. Instead, it may contain instances of some other class, such as Topic or Location. In this case, you should make this other class a subclass of skos:Concept. This will tell AutoClassifier to include the instances of that class in the pool of tags.

AutoClassifier makes use of preferred and alternate labels as well as concept relationships in the tagging vocabulary. It will consider the following properties:

  • Preferred label: skos:prefLabel and sub-properties, or if that property is not defined, rdfs:label and sub properties

  • Alternate label: skos:altLabel and skos:hiddenLabel

  • Hierarchy: skos:broader, skos:narrower, rdfs:subClassOf

  • General relationships: skos:related, skos:hasTopConcept, owl:sameAs, owl:equivalentClass

Other properties in the ontology can be added to these lists of considered properties by declaring them as subproperties of skos:prefLabel, skos:altLabel, skos:broader and skos:related. For more information, see Working with Ontologies.

Additionally, follow the advice provided in the Setting up a Content Tag Set for Automated Tagging section of this document.

Providing Training Tag Set Data

To train AutoClassifier, you will need a set of sample tags. You can create them by using the Content Tag Set editor and manually adding tags. Alternatively, you may already have some tags established for a curated body of content. If so, you can prepare a file for import.

You can import and export tag set RDF data in standard serialization formats. When you use Tagger to tag content with a concept, it is stored in the tag set as actual triples, which are statements expressed using the W3C standard RDF. RDF uses URIs to represent resources such as content resources, tag properties, and the concepts.

For example, if the URI associated with the news story “‘Gangnam Style’ becomes most watched YouTube video ever” is http://en.wikinews.org/w/index.php?&oldid=1711859, and you use Tagger to tag it as having a Dublin Core subject of “dance” from the IPTC set of news codes, the triple created by Tagger is:

{ <http://en.wikinews.org/w/index.php?&oldid=1711859>
  <http://purl.org/dc/elements/1.1/subject>
  <http://cv.iptc.org/newscodes/subjectcode/01006000> }

The RDF file you prepare for import of tags, should follow this format.