I had the pleasure to attend and represent TopQuadrant at the recent Enterprise Data World 2015 Conference (EDW) in Washington, DC. TopQuadrant was a Silver Sponsor, exhibitor and gave this talk.

Despite a few challenges with the venue (there was ongoing construction noise at the hotel–but, as a compensating factor, the food was really great!), it was an exciting and informative event. I had the opportunity to listen to presentations from and talk one-on-one with many knowledgeable practitioners who had stories of success but also especially outlined ongoing and new complex challenges for which they (as everyone at the conference) were eager to know about new solution approaches and technologies. Among all the talks and the data-related discussions at EDW this year, information governance clearly stood out to me as an area demanding and getting increased attention.

The need for more attention is driven by the proliferation of complex, diverse data from multiple sources that is making governance harder to achieve. Additionally, more and more data is coming from external sources, making it even more important to keep track of who created it, how often it changes and what the policies are for using it. At the same time, the growing business reliance on data and analytics and the need for increasingly rapid, agile development of corresponding business applications creates demands on systems to deliver more data faster, whether it is well governed or not. This results in many opportunities for data to be used incorrectly, often leading to critical errors in business reporting, compliance and other key data-centric business functions.

In its predictions for 2015, Gartner comments that “The rise of data discovery, access to multi-structured data, data preparation tools and smart capabilities will further democratize access to analytics and stress the need for governance.”

A number of talks at EDW pointed out that the governance of reference data is increasingly seen as an ideal place to start or improve an organization’s data governance program. The reasons are simple:

  • Reference data is used very widely across all enterprise systems, so any errors have can have a far-reaching impact. In several industries such as healthcare and financial transactions, any mistake in coding can have dire consequences and be very costly. Indeed, in a very impressive case study on deciding on and implementing a new comprehensive reference data solution in the healthcare area, one of the essential requirements stressed was that reference data: “Must be 100% accurate, for both internal and external customers.”
  • Correspondingly, improvements also have wide business impact and high ROI.
  • At the same time, reference data management and governance are tractable. Unlike master data, the volumes of reference data are relatively small making it easier to tackle.

There are several specialized reference data management products on the market today. A few of these were presented at EDW. A common set of features supported by all products is the functionality for importing reference data from different sources into a common repository where it can be governed and then making it available to the systems that need it either through exports or web services. Changes are tracked. Impacted parties are notified of the changes. Access is controlled.

Some of the products, including TopBraid Reference Data Manager (TopBraid RDM), offer flexibility in the “columns” or properties included in a reference dataset. This includes relationships between the reference codes in each dataset and across the datasets. Here, solutions based on a relational DBMS approach are at a disadvantage. For them, model changes require development work and IT intervention to enhance the reference data repository, its screens, and interfaces. This further reinforces the need for a solution to support flexible semantic modeling and implementation of reference data.

This need is well addressed by TopBraid RDM as it is based on the NoSQL graph database technology. Users can easily create their own semantic models to capture different types of reference data and link all the different datasets to each other, creating a network of reference data. End-to-end business lineage between the reference data and the related business and technical metadata, policies, rules and data assets becomes very easy to create. While the TopBraid RDM repository is a graph database, it is persisted using traditional RDBMs storage. As a result, users enjoy robustness of the storage layer and can rely on their standard procedures for backup and recovery.

Beyond these basic core capabilities, organizations are increasingly looking for the ability to manage information about reference datasets:

  • Is this an internal dataset or external one? Who is the body responsible for maintaining it? Is there a subscription cost? How often does the dataset change?
  • What are the policies and procedures for on-boarding a dataset. For governing? For provisioning and use?
  • Who is using the dataset and how? What applications and systems? Through what access method? For what data? What will be impacted by the change?
  • What is the meaning of the “columns” or properties in the dataset? What do they represent? How does one set of codes relate to another set of codes? What is the meaning of these relationships?
  • What is the meaning of each reference code? What broader information is available about them?

Knowing answers to these questions is necessary for the successful governance and use of reference data. TopBraid RDM is unique in providing its users with comprehensive capabilities for capturing semantic metadata about individual codes and also about reference datasets that contain them. When a solution doesn’t provide such features, users (as one of the EDW presenters shared) try to backfill the void by building supplementary stand-alone systems in MS Access or MS Excel for this information. As a result, organizations often end up with multiple systems for different parts of reference data management – as discussed in our previous blog .

TopBraid RDM offers a standardized approach for defining reference data and associated information. All the models are expressed using open web standards for data and metadata descriptions making a proprietary data representation a thing of the past and ensuring a future-proof solution.

Reference data management certainly isn’t the only way to get started with data governance, but it’s a good step to consider, whether you are just beginning your data governance journey or planning your next step along the way. Pack your bags and get started!