Working with Corpora

Licensing and Enablement

The availability of any asset collection is determined by what is (a) licensed and (b) configured under Server Administration. To install a license or to view the currently licensed features, see Administrator Guide section on Product Registration` . To configure which licensed collection types are currently enabled or disabled, see Administrator Guide section on EDG Configuration Parameters.

For general licensing information and available asset collections and packages, see the TopQuadrant website.

TopQuadrant Data Governance Packages

Overview of Corpora

A Corpus is a collection of textual assets, such as documents, excerpts, web pages, etc.. The original items are typically imported from external sources, such as content management systems or web sites and typically are not created nor edited within EDG. The textual content of Corpus assets provides the foundation for manual or automated tagging and annotation using Content Tag Sets. Corpora serve as the content graphs for Content Tag Sets.

Selecting the Corpora link in the left-navigation pane of TopBraid EDG lists all of the Corpus collections currently available to the user and allows authorized users to create new ones.

When working with Corpus collections users will have access to the same functionality as the functionality available for other asset collections e.g., ability to search, import, export, etc. If a Corpus is created without specifying a connector to some content repository, users will be able to use EDG editor to create and edit documents in the Corpus. Otherwise, the documents will come from an external repository and users will not be able to modify them.

See also

Please see the Working with Asset Collections for all the general features of asset collections such as import/export, editing, user permissions, reports and settings.

Create New Corpus

When a new Corpus is created, EDG requires the user to select its data source, as Corpora can be configured to connect to an external source. EDG will then harvest content from that external source and store it in the project graph. Harvesting can be repeated later on demand, and changes to the external source’s documents will be picked up. Harvesting needs to be triggered manually in the Corpora management UI. A Corpus does not synchronize automatically with its external source, but only when requested.

Seven types of connectors to such sources are currently offered, their respective creation wizard pages showing different forms depending on parameters that must me indicated for the connector to operate. These parameters can be adjusted later on by accessing the Manage Tab -> Corpus-Connector-Type Configuration.

  • No connector if content documents are available as RDF already, these can be imported into the Corpus with the usual RDF import function. Similarly, raw documents can be imported singly from local files as described in Import Single Document. No external source will be configured with this connector type. Users will also be able to use the editor app to create new documents.

  • sitemap.xml If a website supports the sitemaps protocol, a configured sitemap.xml connector will harvest its content accordingly.

  • URL list This connector will simply fetch content from all of the URLs listed in its configuration.

Note

The site must not block crawlers or they will be skipped.

  • CMIS If a website is an interface to a Content Management System and offers a Content Management Interoperability Services (CMIS) service endpoint as defined by the standard, a configured CMIS connector will harvest its content accordingly.

  • Amazon S3 Creates a new Corpus and imports documents from a(n) S3 bucket(s). (Make sure you have pre-configured S3 integration from External System Integration Management)

  • SharePoint Creates a new Corpus and imports documents from SharePoint Document Libraries. The configuration page will display a tree reflecting all sites and document libraries defined in your SharePoint server. Check the document libraries you would like to include this corpus. (Make sure you have configured your Microsoft 365 tenent from Microsoft 365 Authentication)

  • Local directory Creates a new Corpus from a directory of files in the local files system. Can only be created by system administrators.

Depending on the connector type, you will be asked to provide different settings as required to connect to the source.

Using Manage tab of a Corpus, you could later modify connector parameters specified during the creation process – as external data sources can be on remote networks not necessarily under the creator’s control and connectors should reflect these changes. Connector-specific configuration panel is not available for Corpora configured with No Connector. While the connector’s parameters can be edited post creation, the type of data source (one of the six options above) cannot be changed after creation.

If you will be using this Corpus for Auto Classification, you will need to keep the box checked for Store copy of all documents in EDG. The documents are stored in a “Corpora” folder in the workspace for EDG. Be sure your server has enough disk space for this storage.

Once you have finished configuring your new Corpus, it will appear on the page listing all Corpora. This is the page displayed when you click on the Corpora link in the blue left-navigation pane in EDG. Clicking on an individual Corpus will display its content in the Corpus editor. The first tab for Corpus (the tab that gives you access to the editor) is called Documents.

Importing for a Corpus

Import Single Document is available under the Imports tab only for Corpora collections. It supports manually importing an external file into the corpus, rather than going through a connector or importing an existing RDF representation of the corpus.

When selected, use the Browse… button to select a source file. Its text and metadata will be parsed by the Apache Tika content analysis toolkit, which can handle these supported formats. The Show Imported data button on the next screen allows reviewing retrieved information. Most supported file formats will present three sections:

  1. common Metadata Properties such as file name, media type, title, creator;

  2. Content, which is the actual document’s text (where applicable);

  3. Other Properties, which include various ones the importer was unable to label and are therefore referred to with their URIs.

Corpus Manage Tab

In addition to the settings available with other collection types, Corpus collection has a few additional options:

TopBraid EDG Corpus Collection Additional Options

TopBraid EDG Corpus Collection Additional Options

Note

These are not available for Corpora configured with No Connector or local directory.

To pick up new or changed files for local directory after the initial creation, clear the corpus on the Manage tab.

Corpus contents Report

In addition to the normal reports for an asset collection, for any Corpus with a connected data source (production or workflow), the Reports tab > Corpus contents action lists all documents that were either manually imported or retrieved from a remote location with a connector.

TopBraid EDG External Content Table

TopBraid EDG External Content Table

Each line in the table represents a single document, identified with its URL to the original document no matter it being a web page or downloadable file, its media type, the date of the last time it was downloaded from its remote location to the EDG cache, and a hyperlink shown as a page icon to download this cached copy.

Note

This report is not available for Corpora configured with No Connector. The Corpus editor can be used instead.