Marker-with-Icon_98-x-106_Reference-Data We fielded several questions as part of our recent webinar, Reference Data Management in TopBraid Enterprise Data Governance. The webinar focused on how to start a data governance initiative with reference data management and how to expand at your own pace to create an end-to-end data governance program that is more consistentand cost-effective. Here is a recap of the questions as well as our responses.

Questions about Reference Data Governance

Q1: How do you handle the case when an old code becomes obsolete and is replaced by one or more new codes? For example, when East Germany stopped being a separate country.
It is a best practice is to never delete codes that are in use. TopBraid EDG/ RDM will enforce such practice. Instead, status of a code is changed. This is done because there is historical data that used the code and it is unrealistic to expect that all this data will be modified. Thus, we need to know that the code existed and what it meant when it was in use.

In cases where organization wants to migrate some of historical data to using new codes, TopBraid EDG/RDM can help facilitate this process by capturing the mapping relationships between old and new codes.

Q2: What are the differences between and reasons for managing internal, external and what are called “standard” reference datasets?
Most organizations use both external and internal reference data. They’ll use external reference data in situations where a suitable “standard” dataset exists. They use internal reference data for codes that are unique to an organization such as product categories or location codes and sometimes for legacy codes where suitable standards exist, but an organization has already created their own alternatives.

Increasingly, organizations prefer to use standards as their reference data. For example, country and currency codes from ISO, occupation codes from the Bureau of Labor and Statistics, industry codes from NAICS, and security identifiers from Bloomberg. Using external reference data instead of creating your own codes has advantages, such as:

  • You can rely on the third party to create and maintain data.
  • It may often be more complete and future-proof than what you would have created since organizations that maintain such datasets have to consider a broad range of requirements.
  • Integration can become easier as other parties may also be using the same standards.
  • In some cases, you may be required to use the standard version – for example, for certain government reporting.

It is common for organizations to connect external and internal reference datasets and extend the external datasets with internal organization-specific codes. Some standards even have conventions (such as an allocated group of codes) designed specifically for extensions.

Q3: What are crosswalks?
Crosswalks let you create connections between the terms in two different vocabularies. This is especially useful for defining connections between two different standard vocabularies or between a standard one and a specialized local one. Applications can use saved crosswalk connection data to enhance the use of either vocabulary by taking advantage of the connected data and metadata for search, classification, and other operations. Included in EDG Packages: Reference Data Management, Vocabulary Management

Q4: Is it possible to make/model relations between different reference data sets/tables?
Yes, there is a detailed blog about it with screenshots Deep Dives into Reference Data Management and this feature was also demoed in the previous TopQuadrant Reference Data Management Webinar

Q5: Does EDG enable populating external application description/discovery metadata
attribule values? Ex. populate product description fields from a controlled
vocabulary list.

Yes, all information manage by TopBraid is readily accessible through RESTful APIs.
An application can connect to TopBraid EDG in real time to, for example, get reference data
for a particular product or a set or products or any other type of information. For example,
some customers use it to improve search and navigation as it can provide synonyms and
related links. Other use it to display forms for business entities such as an employee as it
contains all the relevant attributes and relationships for an object.

As a default API there is a SPARQL endpoint. We also include pre-built (template) APIs
for most common requests. And customers can define new APIs through modeling.
For more on APIs and wen services,
see Web Services and TopQuadrant Products
and Creating Web Services with the TopBraid Platform

Q7: How are mappings to reference/master data in EDG synchronized with the sources?
We showed mappings between two reference datasets. Both are being governed by EDG. Once an organization decides to use a reference data management tool, they typically commit to changing their processes.

In other words, they would no longer just update directly the job categories table, but rather update the HR Job Categories reference dataset in EDG taking such updates through the approval/stewardship process as needed.

TopBraid EDG will help the organization understand what depends on the reference data, what would be impacted by the change to it, including any mapping updates. For example, if they decided to split an existing code for Software Engineers into two: Developers and Architects, they may need to update how it maps to the standard SOC codes and may also need to make some updates to the applications, etc. Once the changes are made, the HR database would get the new set of codes from EDG – that is part of the role for EDG is centralizing provision of reference data to applications and databases. There is also a pre-built compliance checking service that can be used to check if the data in the reference table in the HR database corresponds with the current version of the reference dataset.

This blog Dashboards Provide a Clear View on Data Governance Rain or Shine may also provide some relevant information – especially, the paragraph about Operational Maturity metrics.


Q8: What does semantic standards-based mean?

Simply, when we speak of semantic standards, we are referring to an existing set of standards for meaningful, machine-readable communication. These standards describe a rigorous way to identify a “thing” another “thing” and the relationship between them. These things and the relationship between them form a simple, meaningful “sentence” just like a “subject” (a thing), “predicate” (a relationship) and an “object” (another thing or value). When you capture enough of these standardized “sentences” you have a “semantic” model that describes your domain of interest. This “semantic” description is similar to how a teacher or a manual could use a collection of sentences to say what was important to learn or know about a domain.

As all data governance products, TopBraid EDG uses models describing assets that are being governed such as reference data, but also data sources in general, glossary terms, business applications, etc. Models describing the assets include various attributes such as names, descriptions, level of security as well as relationships between them such as inputs and outputs of an application. TopBraid EDG is a fully model-driven application. Models included with the product cover the data governance space in the broadest sense: from business processes to data elements. If users want to add some piece of information that is not included in the pre-built models or remove something they don’t need or even if they need to add a new class of assets, they simply modify the models.

Unlike other data governance products, TopBraid EDG does not use a proprietary representation for these models i.e., some binary format our engineers made up. Instead, it uses standards for describing data and metadata from the World Wide Web consortium (W3C). This is the same organization that created and is managing standards like HTML and XML. When a data governance application uses a proprietary approach, information it contains becomes yet another silo that is not understood outside of the application. Using standards means that the models and data stored by TopBraid EDG are fully interoperable. Their meaning is well defined by the open standards. Other applications can use it more easily, making integration simpler and less expensive. W3C standards are web-based. Each resource in our system has a URI as a unique identifier. It can be queried, de-referenced through http and linked to. When customers need to extend the models, they do not need to learn a proprietary approach. They can use the standard languages and tools.