Create New Corpus

When a new Corpus is created, EDG requires the user to select its data source, as Corpora can be configured to connect to an external source. EDG will then harvest content from that external source and store it in the project graph. Harvesting can be repeated later on demand, and changes to the external source’s documents will be picked up. Harvesting needs to be triggered manually in the Corpora management UI. A Corpus does not synchronize automatically with its external source, but only when requested.

Seven types of connectors to such sources are currently offered, their respective creation wizard pages showing different forms depending on parameters that must me indicated for the connector to operate. These parameters can be adjusted later on by accessing the Manage Tab -> Corpus-Connector-Type Configuration.

  • No connector if content documents are available as RDF already, these can be imported into the Corpus with the usual RDF import function. Similarly, raw documents can be imported singly from local files as described in Import Single Document. No external source will be configured with this connector type. Users will also be able to use the editor app to create new documents.

  • sitemap.xml If a website supports the sitemaps protocol, a configured sitemap.xml connector will harvest its content accordingly.

  • URL list This connector will simply fetch content from all of the URLs listed in its configuration.

Note

The site must not block crawlers or they will be skipped.

  • CMIS If a website is an interface to a Content Management System and offers a Content Management Interoperability Services (CMIS) service endpoint as defined by the standard, a configured CMIS connector will harvest its content accordingly.

  • Amazon S3 Creates a new Corpus and imports documents from a(n) S3 bucket(s). (Make sure you have pre-configured S3 integration from External Systems Integration)

  • SharePoint Creates a new Corpus and imports documents from SharePoint Document Libraries. The configuration page will display a tree reflecting all sites and document libraries defined in your SharePoint server. Check the document libraries you would like to include this corpus. (Make sure you have configured your Microsoft 365 tenent from Microsoft 365 Authentication Parameters)

  • Local directory Creates a new Corpus from a directory of files in the local files system. Can only be created by system administrators.

Depending on the connector type, you will be asked to provide different settings as required to connect to the source.

Using Manage tab of a Corpus, you could later modify connector parameters specified during the creation process – as external data sources can be on remote networks not necessarily under the creator’s control and connectors should reflect these changes. Connector-specific configuration panel is not available for Corpora configured with No Connector. While the connector’s parameters can be edited post creation, the type of data source (one of the six options above) cannot be changed after creation.

If you will be using this Corpus for Auto Classification, you will need to keep the box checked for Store copy of all documents in EDG. The documents are stored in a “Corpora” folder in the workspace for EDG. Be sure your server has enough disk space for this storage.

Once you have finished configuring your new Corpus, it will appear on the page listing all Corpora. This is the page displayed when you click on the Corpora link in the blue left-navigation pane in EDG. Clicking on an individual Corpus will display its content in the Corpus editor. The first tab for Corpus (the tab that gives you access to the editor) is called Documents.