Our first of a series of Deep Dives into TopBraid EVN webinars featured the recently released AutoClassifier module. We demonstrated how AutoClassifier uses machine learning technology to automatically tag content with terms from your controlled vocabularies. This ability to automate this tag assignment lets you scale up much further in how your vocabulary assets enhance the value of your content.
The webinar was very well attended with people coming from industries as varied as financial services, healthcare, publishing, and manufacturing. A 38-minute recording is now available.
During the webinar attendees submitted several interesting questions. We were only able to partially answer a few of them during the webinar, so we’d like to address a few more here.
Are training sets necessary and why? How large should they be?
In the words of Pedro Domingos, author of the best selling book “The Master Algorithm,” “The fundamental goal of machine learning is to generalize beyond the examples in the training set.” Most machine learning algorithms look at patterns in a set of data known as the training set and then try to reproduce those patterns in a new set of data, called the test or target set. To train EVN’s AutoClassifier, you point it to some documents that are already tagged so that it can analyze which terms from a given vocabulary were used to tag which content in that set. Then, based on what it “learned,” it can assign terms from the same vocabulary to similar new documents, and you’re off and running with auto-classification.
The training set size is not as important as its quality. The documents should represent your content well, and the tagging vocabulary needs to be representative of the document content. If you tag a document with a term, but no form of the term name nor its synonyms ever appear in the document, there isn’t any input for the AutoClassifier to learn from.
How do you create training sets?
Training set documents can be created with EVN’s manual Tagger. Some content may already have keywords assigned within the CMS or other system where they are stored. (We recommend limiting the training set to a maximum of 1,000 documents. Larger training sets consume more computing resources without significantly improving quality.) Training set documents can be as few as 5 or 10 documents. They will work better if there are more; one approach is to create a small training set manually and then use that to auto-classify a somewhat larger set. Look through the results and make corrections as needed by approving, rejecting, and adding any tags you think are missing. Then, you can use that as the training set for your full corpus. If needed, you can repeat this process.
Should the documents in the training set be the same as the documents in the content being auto-classified?
The current release of AutoClassifier makes it particularly easy if the training documents are a subset of the target documents. This is not strictly required, because the contents of the corpus can be switched after training. The next release (5.1, due by the year end) will make this kind of switch easier.
That being said, to get quality results, the general characteristics of the target documents (for example, the length, tone, and metadata fields) should be similar to those in the training documents. A content tag set trained on news articles wouldn’t work well if applied to reader comments.
Can AutoClassifier support different languages?
The algorithm used requires a language-specific stemmer and stopword list. AutoClassifier currently includes these for English, French, German and Spanish, and more can be added. The rest of the algorithm is language-agnostic. The quality of results should therefore be fairly independent of the language.
Is there an auto-classification web service available?
Yes, an installation of EVN can be configured to expose an auto-classification web service. Tags assigned by the AutoClassifier can also be queried via web services, whether you use the built-in SPARQL endpoint or would like to configure your own web services to return the data you need in the formats that work best for your systems that are using the data.
If you have more questions, or would like to arrange a demo of TopBraid EVN Tagger AutoClassifier, let us know at evn_demo@topquadrant.com.