We fielded several questions as part of our recent webinar (recording and slides available here): Automating the Mapping of Data Elements to Business Terms.
The list below includes questions we were able to answer during the webinar as well as questions we did not get to in the webinar.
1. What TopQuadrant modules do I need to work with data catalogues?
Data catalogs are supported by the TopBraid EDG Metadata Management (EDG-MM) package.
2. There are various standards for data catalogues: DCAT, DCAT2, schema.org. Which of these do you recommend to use?
TopBraid EDG-MM comes with pre-built EDG ontologies describing data sources. For example, a segment of the model for a dataset is shown below:
Data cataloging capabilities in EDG use this model. You can extend it with additional properties as desired, including DCAT properties. You could also deactivate some of these properties if you do not need them and/or replace them with your own equivalents. However, if a property gets auto-populated by the EDG cataloging process, it must be EDG provided property. For example, size (bytes) or format or name.
Additionally, you may want to take a look at our recent blogs about DCAT data catalogs: https://www.topquadrant.com/exploring-eu-open-data-portal/, https://www.topquadrant.com/eu-open-data-in-topbraid-edg/ and https://www.topquadrant.com/dcat-data-catalogs-with-topbraid-edg/
3. Which standards are supported by TopBraid?
TopBraid EDG supports RDF and SHACL for data representation. SPARQL and GraphQL for query. As described in the answer to Q2, TopBraid EDG provides ontologies defined using SHACL for capturing catalog information. These ontologies are open to extension.
4. I was unclear about “Crosswalk”. Could you say/show a little more on that?
A Crosswalk is an asset collection in TopBraid EDG containing links between assets stored in two other asset collections. For example, links between two taxonomies or between two reference datasets. These links are mappings that identify equivalent or similar assets.
As a concrete example:
- An enterprise may have two alternative sets of codes for product categories. Some data sources may use one set of codes while other data sources use another set of codes.
- Each of these code sets would be captured in EDG as a reference dataset.
- Additionally, a crosswalk would be establish to map these codes.
For more information on Crosswalks, see this video.
5. Is there a process to connect the technical glossary to business glossary?
Not a specific predefined process, but TopBraid EDG supports having multiple glossaries and you can combine and connect these glossaries as needed.
6. Would you include a definition for each column in a data catalog/dictionary?
There are cases where a common shared definition from a glossary term is sufficient i.e., a general definition that comes from the “mapped to” term. There are also cases where you want to supplement it or clarify it with the column specific definition. TopBraid EDG lets you “inherit” term definitions to the columns and it also supports providing additional definitions directly for a column.
7. Is the glossary a logical view, and the catalogue physical?
Catalog is a physical view. Glossary is more like a business concept view.
8. Is there a standard metadata model for glossaries, catalogues etc or do tools like TopBraid each have their own proprietary model and metadata terminology?
There is more than one related standard. TopBraid EDG uses W3C standards RDF and SHACL for describing metadata models. This makes models used in TopBraid EDG fully open for extension, modification and query.
Additionally, please see answer to Q2 above.
9. Can the catalog and glossary be exported to pdf format?
You can export then to Excel and then, if desired, PDF the Excel
10. Is there versioning on the information entered in the catalog / glossary?
All information in TopBraid EDG is audit trailed with the identity of a user that made a change, the nature of the change and when it was made. To learn more, take a look at this short video about change history.
11. Can we define custom tags representing an application component on each data asset or data element?
Yes, certainly.
12. Can you demo, define / create a new glossary and enter terms manually and link terms from a data source?
Yes, you can definitely create terms manually. Terms in a glossary are typically created by the stewards. And you can manually link data sources to terms. See this short video about creating and editing a business glossary.
13. Can ETL tools connect to the catalog / glossary for helping to define mappings?
Information from the ETL tools can assist in providing the lineage of data elements e.g., the fact that data in field A of one data source is derived from field B in another data source. Having this information in a catalog can be quite useful.
14. Can you map a source column to more than 1 column?
Sure, there is no limit to the number of mappings. You can also include for information for a mapping e.g., any transformation that may have occurred between the source and the target.
15. Can you map a hyperledger catalog dataset to a relational table?
Yes, TopBraid EDG can be used to catalog different types of data sources and they can all be connected together.
16. What tool do you use for Glossary and Catalog?
17. Can you please clarify what a successful match looks like based on data sampling? I saw the data profiling tab, what is that compared to?
Data Profiling section (tab) has statistical and other types of information gathered from data in a data source. For example, what type of values are contained in a column (strings, dates, numbers), how many distinct values are contained in a column, what are the min and max values, etc.
This information is compared to the rules describing a glossary term. If the information satisfy the rule, then mapping is suggested. Data samples are used in a similar way.
18. What is difference among glossary and vocabulary and terms?
Vocabulary is a collection of terms in a particular domain (i.e., field, subject) of knowledge with the definitions for those terms. There are many use cases for vocabularies in general and for the machine processable vocabularies in particular.
A Business Glossary is a vocabulary that is developed to help users improve their understanding of data’s context and usage. Glossary terms not only have descriptions of their meaning, but they also define business context of use and can be linked to the underlying technical metadata to provide a direct association between business terms and data sources and data elements. In TopBraid EDG, glossary terms include description of business rules and permissible values – both, in plain English as well as in structured, executable rules that are used to automate connections between data elements and business terms. They may also connect to reference datasets and enumerations that hold lists of values specific to a given term such as “customer status”.
19. Are relationships between terms an ontology and term is interpretable as entity?
Ontology is a schema or the underlying data model. For example, a data element may have a physical datatype (be a string or an integer), it may be nullable, it may have a maximum and minimum number of characters, etc. A dataset may have a format, a title, be of a certain size, etc. A glossary term can also have a set of appropriate properties. The definition of what properties (attributes and relationships) exist for a data element or a dataset is captured in an ontology. Please also see the answer to Q2 above.
Things like Data Element, Dataset, Business Term are classes in the ontology. Then, there are instances that are members of these classes i.e., specific data elements from some specific data source, specific glossary terms, etc.
20. Terms can be complex combinations of data elements, how do you maintain the relationship of terms to relationships among data elements?
Business term are typically not used to describe data structures. They are used to provide business definitions and business context.
In principle, you can use something like “contains” relationship and connect term “Purchase Order” with term “Purchase Order Number” using “contains”. However, this type of thing is typically left to logical data models and not included in business glossaries. For a more specific answer to this question we need to better understand the use case you are trying to enable.
21. Data Dictionaries capture all data elements but not necessarily all Metadata, statistics. Is there a standard way of Data Catalog?
Please see answers to Q2 and Q3 above.
22. What is the difference between Master data and Data Catalog?
They have different purposes and contain different information.
A Data Catalog captures information about enterprise data sources. These can be databases, datasets, etc. Data catalog contains inventory of these data sources with the information sufficient for the catalog users to identify data of interest.
Master data contains master data entities used by an enterprise, such as its customers or products. These entities participate in business transactions as captured by transactional data records such as records about orders and receipts.
Thus, a Customer Master dataset will contain information about all customers. A Data Catalog will not contain a list of all customers, but it may contain a catalog record for the Customer Master dataset.
23. The attributes captured in data catalog may not be always easy – as data, data combinations, information, and business term mapping may not be that simple?
Correct. We are not talking about a fully automated catalog, but rather a catalog that is curated by people with the assistance of technology. The role of automation is to assist, do the bulk of the easy work, narrow down some possibilities, etc.
24. How do mappings from data element to terms appear on the Term page?
For any Term, you can go into Explore menu, then select Usages in other Asset Collections. This will show all references to a term that are stored outside of a glossary.
25. It appears that the automation requires item/column-level details to first be entered into both the glossary and the catalog, then it searches for likely matches. IOW: the glossary must be very detailed, down to the level of Employee Gender, DOB, etc., to match the level of detail in the catalog/database(s). Is that correct?
The business glossary has to be as detailed as needed to meet your goals. This will differ across different organizations and even within a single organization depending on what data is being described and the goals for establishing a catalog and a glossary.
To automatically infer a connection between a column and a term, you would need some level of explicitly defined information for the machine to base the decision on. If you rely on people for making these connections, then you need enough information
26. What type of data sources can you ingest? And Is this a cloud or on prem solution?
TopBraid EDG is available as both on premise and SaaS solution.
Out of the box, TopBraid EDG can catalog data sources that have JDBC connectivity (such as relational databases), spreadsheets files (such as Excel), unstructured documents (such as Word and PDF). There are open APIs through GraphQL and other options that make it possible to extend this to other types of data sources.