We fielded several questions as part of our recent webinar (recording and slides available here): Guided ML: Intelligent Autoclassification of content using managed vocabularies |
Questions from the webinar included:
Q1: Can EDG environment return documents ordered on either or both of: (1) Number of matching tags; (2) Tag weights?
Not at the moment but we have all the information available. We are looking at alternative approaches to present the suggested tags to the user and we’ll look at integrating this information.
Q2: You spoke of training on the “parsed content”. How is the parsed content of a scientific paper generated?
It is generated automatically. EDG connects to a content repository (e.g., file system, content management system, etc.) and parses documents. Hundreds of document formats are supported – MS Office formats, PDFs, etc.
Q3: If a corpus has existing tags (as was said a few minutes ago), what is the (corpus) format in which the existing tags have to be provided?
If existing tags are stored as document metadata, then when EDG connect to the corpus it will get them – just like it gets the author, create date, etc.
If the tags are available separately, then any structured format will work. For example, a spreadsheet. In the end, the source tags will be captured as a fully managed controlled vocabulary in EDG.
Q4: Let’s assume that in the process we see, we have in the “consumption” set the input system (e.g. a CMS) as the target. That is, we want to enrich content that is in a CMS with tags. To do that, we need interfaces to store the tagged content back into the CMS. Are there off-the-shelf APIs for this storage step for widely available CMS, e.g. Drupal, Sharepoint, data bases like MarkLogic, …?
All extracted metadata and new metadata that is added in tagging is stored as knowledge graphs managed by TopBraid EDG. This information can be accessed using GraphQL or SPARQL, an industry standard query language for RDF graphs. TopBraid EDG provides both types of endpoints. It also supports RESTFull APIs.
One option is to use these APIs to read the data from EDG and load into your CMS or database of choice. Another option is to export from EDG in a format that your system can already accepts. For example, MarkLogic will load RDF and that is directly available. EDG can also export data in a tabular format e.g., CSV or Excel. JSON export is also available. TopQuadrant can also implement custom connections to specific CMS (using the CMS native API) as customizations for specific clients.
Q5: After autoclassification, is it possible to have the complementary view, i.e. for each term/topic, which documents were tagged with that topic?
For now, you can use the available search in the Content Tag Set to filter on specific topic, this will show you the list of Documents which have that topic assigned.
Q6: Can documents be annotated according to different properties beyond “topic”?
Yes, although we didn’t show it in the demo, the slides provide this example. Users can make different properties available to the Content Tag Set by creating an ontology that defines the desired properties and using it in the Content Tag Set.
Q7: Can you show a taxonomy, ontology, glossary in EDG, please?
We did show a taxonomy during the demo. If you want to see an ontology and how to work with it, take a look at one of the product videos e.g., https://www.topquadrant.com/project/ontology_modeling_overview/
Q8: Why is “Mortgage” listed twice on the topic list?
This is because the term is present twice in the MAECO taxonomy, since we combined several available taxonomies to create MAECO, some terms are repeated.
Q9: Do you have any plans to use something like Graph2vec, node2vec or semantic vectors to discover knowledge based upon the structure of the RDF graph?
For auto-classification, the content of a document is not being stored as a graph, it is stored as a text value of a content property and it is passed as such to the machine learning algorithms. In general, our strategy for integrating ML algorithms has been to pass the data over in a format required by an algorithm.
We can definitely discover/connect more based on the structure of the RDF graph. We would do this using SHACL. The SHACL engine can be backed by custom logic though which is perfect for including new/different algorithms.
Q10: Can you only tag per document or also eg per section in the document?
You can only tag a full document. We currently represent the document as a whole, if the use case requires specific parts of the document this representation would have to be revised.
Q11: Would it be useful to recourse to a data lake (HDFS…) instead of a CMIS to store the corpora?
We would not be able to use HDFS to store the document metadata as we need this in our internal database. Reading documents from HDFS can be implemented as custom connectors.
Q12: So, Autoclassifier is the way to create a tagset as well as document sets automatically. But it is still far from a taxonomy/ontology? Or does the new tagset recourse to an existing manually designed taxonomy/ontology?
Correct, it will not build a taxonomy. Organizing terms requires human decisions and these decisions are very context specific. The goal of the Autoclassifier is to enrich unstructured data (documents) with some structured information.
Q13: What techniques does « Autoclassifier » rely on?
Bagged decision tree model that evaluates candidate terms obtained from documents and taxonomy after the usual text preprocessing (stemming, stop-word removal, etc.) with various features based on taxonomy structure and document structure.
Q14: Is the AutoClassifier based on Maui?
Maui is one of the underlying open source technologies used in the process.
Q15: When you reject suggested tags, does this feed back into the model, or only affect that particular document?
No, rejected suggestions do not currently feed back into the model.
Q16: What languages are supported in AutoClassifier?
By default, out of the box, English, French, Spanish and German. The machine learning algorithms used is language independent. Other languages can be added with minimum effort. Adding support for another language involves adding a language specific parsers (many are available as open source) and, optionally, language specific “stop words” list.
Q17: If you train a model but it gives a lot of bad suggestions, what can you do to improve results?
Strategies to improve the quality of AutoClassifier results include: using a larger or higher-quality training set; ensuring the taxonomy is sufficiently deep and detailed; adding more alternate or hidden labels to the taxonomy to improve match with the terminology used in documents.
Q18: Do you have to use the Tag Manager to view how content has been tagged or can you use another application via an API?
You can use another application.
All extracted metadata and new metadata that is added in tagging is stored as knowledge graphs managed by TopBraid EDG. This information can be accessed using GraphQL or SPARQL, an industry standard query language for RDF graphs. TopBraid EDG provides both types of endpoints. It also supports RESTFull APIs.
Q19: For content that is tagged with a specific label, if it is changed from A to B then when I look at how the content is viewed that was tagged with A will all the content then be shown as B?
A tag is a link to a resource, its URI. Typically, these resources are terms or concepts in some taxonomy, ontology or glossary. When auto-tagging happens, labels, including all available synonyms, of the resources along with other information about them are used to choose the most likely tags.
Once tagged, in the faceted search and other displays, the link is shown based on its label. If a display label of the resource changes, then what is shown in facets will change as well.
For example, lets say there is a resource with the URI www.example.org/United_States and a label ‘United States”. If a document was tagged with this resource, we will see the name of the tag as “United States”. If the name was changed to “USA”, the tag will now show as “USA”.
Q20: In the corpus, is the creator and title metadata extracted automatically from the document, or did you have to enter that data manually?
All metadata is extracted automatically.
Q21: The human intervention is required if we know what is the content of the document; if we do not know the content, then is this step necessary?
This step is only useful if there is a human curator who can make a good judgement on the applicability of tags. If such person is not available, then there is no reason to have the review and approval step. As shown during the demo, the review is optional and can be skipped.
It is important to do your best early on in training to make sure tags are relevant/desired, but you can then turn it on ‘automatic’ mode and then all generated tags will enter the system.
Q22: What other OS components?
The Supported Platforms page provides a list of other systems utilized with EDG.