Content Classification with TopBraid EDG

Licensing and Enablement

The availability of any asset collection is determined by what is (a) licensed and (b) configured under Server Administration. To install a license or to view the currently licensed features, see Product Registration. To configure which licensed collection types are currently enabled or disabled, see EDG Configuration Parameters. For general licensing information and available asset collections and packages, see TopQuadrant.com.

TopQuadrant Data Governance Packages

Introduction

This document describes TopBraid Tagger and AutoClassifier, an add-on TopBraid EDG module designed to help you analyze your information assets –web pages and documents in many formats – to identify the context and meaning of each asset. This module can catalog your content, automatically extracting available metadata and help you add new metadata.

TopBraid Tagger and AutoClassifier uses state-of-the art AI to automatically enrich the content by assigning to them the most relevant tags from your controlled vocabularies (Taxonomies and Ontologies). Auto-classifying your content resources by connecting them to relevant vocabulary terms, delivers improved search, navigation and automates lifecycle management of content.

In addition to automatically assigning relevant terms to content, users can also manually tag or annotate content resources. In both cases, the result of tagging is a set of connections between the content and a vocabulary. For example, a resource representing a news story can be linked to vocabulary terms of Election and Weather through a property named “has subject”, stating that a given news story has topics of Election and Weather.

These relationships—tags—can be used to enrich search, browsing, and other applications by managing metadata on concept-to-vocabulary relationships. The role of Tagger is to make it easy to manage and create these relationships. Asset collections that store these relationships are called Content Tag Sets. This guide focuses on working with Content Tag Sets. Corpora is the second type of asset collection that is enabled when you add the TopBraid Tagger and AutoClassifier module.

Create New Content Tag Set

When a new Content Tag Set is created, EDG requires the user to select a Content Graph and a Tagging Vocabulary.

The drop-down for the selection of the content graph will show all corpora, data graphs, taxonomies and ontologies. Typically, you would select a Corpus.

As mentioned above you could use Content Tag Sets to connect two taxonomies or two ontologies. However, Crosswalks are a better option for this. The capability to select a taxonomy and/or an ontology remains available to support compatibility with versions that did not offer Crosswalks. This may be changed in the future versions.

Additionally, the role of a content graph may be played by a file in EDG’s workspace. EDG Administrators can identify files that are to be used as content graphs. They would then also show up in the drop-down selection for a content graph.

Tagging vocabulary is either a taxonomy or an ontology. The drop-down selection will show available ontologies and taxonomies.

User will also be asked to:

Select a Default Tag Property – a property that will, by default, be used to connect assets in the content graph with the vocabulary terms. Auto-tagging will always use the default property. When tagging manually, users can select among other properties. After creation, the Manage tab will let you select other properties that may be used for tagging.
Optionally, select a Root Class for the content types tree – this is similar to selecting a Main Entity for other asset collections. By default, if your content graph is a Corpus, the root will be Document.

Please see the Working with Asset Collections for all the general features of asset collections such as import/export, user permissions, reports and settings. Specific Content Tag Set only information is contained within this page.

Managing Tag Sets

The Manage tab for Content Tag Sets contains an additional feature:

Select Tag Properties lets you modify, by checking or unchecking checkboxes, the list of which properties from the property graphs should be available on the Tagger drop-down property list.

The Export tab for Content Tag Sets contains this additional feature:

Normalized Concepts (Troubleshooting) Generates a normalized version of the Tagging Vocabulary used in this Content Tag Set, as it would be seen by AutoClassifier, and returns it in the Turtle format. This can be used to check, for example, if desired language-specific labels are selected over other languages when dealing with multilingual vocabularies.

Tagging Documents

To manually tag documents click on the Taggings tab at the top of the page. This will take you to the editor application. This guide describes the new editor released as part of TopBraid EDG 6.3. Please refer to doc.topquadrant.com for the “old” editors.

The Tagging editor has very similar features to the other collection editors for searching, panel and layout options.

Configuring Content Graphs

Typically, you will use a Corpus as a content graph and you will either use a connector or enter/upload/import, documents. Corpora asset collections in EDG are based on a model that is called TopBraid Simple Corpus and Document Schema.

If you need to configure this model, you can create an ontology and, using Settings>Includes include TopBraid Simple Corpus and Document Schema. Then, you can extend it if needed. If you plan on using a Data Graph as your content graph, then you would need to create an ontology for it. This ontology may be, but not necessarily have to be, an extension of the TopBraid Simple Corpus and Document Schema.

If you are creating a file that will be used as a content graph (e.g., a file obtained from some third party), there are a few notes to keep in mind:

For the Content Types hierarchy to display properly in a Content Tag Set, the root classes must be subclasses of rdfs:Resource.
The Content Tag Set editor displays titles of resources in the content graph using the rdfs:label property, or any subproperty of rdfs:label. If the title (label) uses a property other than rdfs:label, such as dc:title, define dc:title as a subproperty of rdfs:label. The title will then display properly. Additional properties about each content resource will be displayed on the form when the resource is selected. The form (and AutoClassifier result reports) will display any dc:source properties of those graphs as hypertext links to the URLs provided as values, so these are useful for providing easy access to such a resource for the users tagging them.
The documents should all be typed with some class (or one if its subclasses), and that class should then later be used as the Root Content Type when creating a Content Tag Set.
All properties that occur on any document resource (for example, date or author) should have an rdfs:label, or a subclass of rdfs:label, so that they can be displayed better. This can be achieved by importing an ontology that defines such labels into the content graph.
Content graphs and tag property graphs should have rdfs:labels for the graph URI so that they are displayed nicely in the dropdowns when creating new Content Tag Sets.

Ensuring these conditions may require slight customization of the graphs obtained from third parties. This is usually achieved most easily by creating a new graph, importing the the third party graph, adding customizations to this new one, and then selecting that one on this configuration screen.

For content graphs that will be used by Tagger’s AutoClassifier feature, also keep that the actual text content of the documents should be kept in the content graph, in a property of the document resource such as “fullText” or “content”. The text in this property, possibly along with other text-containing properties such as title or abstract, will be analyzed by the AutoClassifier, both in training and when generating tag recommendations for documents.

If you are planning to auto-create files, the TopBraid platform used to develop EDG provides tools such as SPARQLMotion for automating conversion of the documents. However, this can also be done using a programming language of the user’s choice. Typically, a conversion script runs on a scheduled basis (for example, nightly) or on demand. After the initial run, a conversion script can process only newly added or changed documents. Once the documents are tagged, information about tags can be provided to the content management and enterprise search systems through export or, in real time, through web services and queries.