Many large businesses – from pharmaceutical to aerospace to consumer goods – as well as government agencies rely on data governance to bring together all their disparate data sources – to make data across an organization available, meaningful, usable, re-usable, and secure, all while maintaining data quality. To help understand how connecting data elements in a data catalog to corresponding business terms in a glossary is important for organizations to reach their data governance goals, we recently conducted the webinar: Applied Data Governance: A Day In The Life of A Technical Data Steward. (Recording and Slides are available there)
This was the second in our current series about applied data governance, which focuses on the practical, day-to-day integration of top-down, bottom-up and middle-out processes and capabilities to deliver business value by aligning management of data with an organization’s uses of data.
Data Governance is Key.
We all know too well how easy it is for the data environment in a growing organization to “explode” – all of a sudden, there are different databases operating independently in different departments, with some overlapping and some unique information. What data are we storing where? Are we storing the same information in the same way? Is confidential or personally identifiable information (PII) secure? The use of specific data for strategic and operational decision-making becomes inefficient and time-consuming, because the data elements are often fragmented, confusing, and partial. Data governance can address these needs, by connecting information in a meaningful way with the people, processes, and technology within an organization.
TopBraid EDG at Work.
Let’s look at a specific example of data governance in action, at a high level, using the role of a technical data steward and TopBraid Enterprise Data Governance (TopBraid EDG). The technical data steward in an organization has been asked to integrate a human resources database into the overall data governance structure – he knows this because his list of tasks is accessible through the TopBraid EDG, as well as the ongoing workflow, enabling collaboration with other team members. He can easily pull in the metadata and any existing glossaries into TopBraid EDG – providing an understanding around the data in the database, the data structure (field names, field length, business rules, etc), and how it fits within the overall “Employee Data Domain,” which contains all the organization’s employee data and the relationships between those data. Automated mapping to standard SQL data types helps to unify the data. He also takes advantage of the existing libraries for data profiling within TopBraid EDG to identify common fields such as phone number or email address, and he reviews the automated connections made by TopBraid EDG to make sure they are correct, and manually creates connections, if necessary. While the technical data steward is importing the database and setting up connections, he’s working in a “sandbox” environment, until he’s sure that it’s ready to go live. And, because of the workflows in TopBraid EDG, all involved stakeholders can see that work is underway, what work is being done, and how it’s being done. When he’s happy that the data are described and connected as they should be, the technical data steward can progress the task for review by the data governance officer, all within the same workflow.
The figure below shows the actual metadata imported through the JDBC connection and that the user can further inspect the profile statistics and sample values collected. Collected samples can be seen by clicking the “View Samples” link at the right. Notice the “EMPLOYEE_ID” column. Because the employee ID values have a consistent pattern, TopBraid EDG can use a business rule to recognize this pattern and make suggestions to enrich this metadata with definitions from appropriate glossary terms.
Reduce Costs, Improve Quality, Create Opportunity.
A quick review of how the data steward accomplished his task shows that he automated the cataloging and ingestion of this metadata and the profiling of it all at the same time. He was able, due to machine-readable business rules, to automate mappings to glossary terms to actually connect existing assets with these columns in the database for better understanding of the data and meaningful reuse. Because of the collection of W3C standards used in TopBraid EDG, a common vocabulary could be used throughout the organization, allowing communication across departments and across diverse, siloed data contexts. Why is that important? It’s important because in a larger organization, we’re inevitably trying to reduce costs, improve quality and efficiency, and respond more effectively to new opportunities; applying meaning to these data across the enterprise enables these benefits.
Note: for additional information on the connection of data elements to relevant business terms, the following document with screenshots excerpted from the webinar gives a quick summary of how data asset collections (data catalogs) and glossaries are built and connected to each other in TopBraid EDG: TopBraid EDG – Illustrative Screenshots Showing Support for Data Asset Collections (“Data Catalog”), Business Glossaries, and their Connection
Q&A from the Webinar
As usual, we fielded a few questions at the end of the webinar. Here is a recap of the questions and the answers.
Q1: You showed a Business Glossary briefly with some useful fields built in. Can you tell us more about how Glossaries are supported? Can different groups have a different glossary design in terms of the fields they want, etc.?
The previous webinar, A Day in the Life of a Business Steward (Recording and Slides are available there) covered Glossaries in TopBraid EDG more thoroughly.
Essentially, EDG comes with a prebuilt model for Glossaries, but you can easily adjust it. There is a lot of flexibility in EDG to accommodate how you want to organize data governance. You can have a set of fields that are common for the entire enterprise and you can also have local variations e.g., for a business area or even just for one specific glossary.
Q2: Can you talk about the operational governance model in your tool, and how it enables an organization to setup their particular governance model?
As you saw a bit of in the webinar demo, TopBraid EDG’s operational governance model enables you to setup a Charter Statement, Business Areas, Subject Areas, Roles, Policies, Workflows, etc. You can create your own Roles and assign user to them. You can create your own templates for Workflows – and add these to the ones provided in EDG. Different Subject Areas can have different workflows, etc.
Q3: Do you have APIs to access metadata within EDG?
Absolutely. Very easy to just use what’s provided out of the box. Any button that you saw me touch today was most likely API-driven, which means you could actually do those micro-transactions: creating new asset collections, creating new glossary term, or deleting a glossary term via our restful API. You can also create saved searches and custom templates that are exposed as web services, all without taking the system down. These are all live-deployed to TopBraid EDG. So you can use what’s provided out of the box, and then you can always create your custom a web services.
Jack: Yeah, if I can just emphasize that last thing that Jesse pointed out, because that’s probably one of the most powerful aspects of EDG, is that unlike many APIs, which just expose certain standard calls for standard information, all over the information that is stored inside of EDG, whatever it is, whatever you’ve stored there, can be accessed by creating custom services that expose custom API calls. So pretty much anything that you want. If it’s in EDG, it can be gotten out and the access through API.
Q4: Did I understood correctly that JDBC makes live (bi-directional) connection to an external database, so that data instances are actually stored in an external DB, while EDG can manage semantics of that data?
Yes, that is correct. TopBraid EDG is not interested in duplicating data. It wants to leave the data where it is, where it’s operational, and it wants to help provide the semantic metadata about/for that data. You pull the metadata of those external data sources in through things like JDBC, and a lot of other connectors and integration points that we have such as APIs, JMS, and spreadsheet/CSV importers so that you can connect it, enrich, and build bridges to things you’ve never seen before in order answer questions that you’ve never been able to answer before.
Then external systems, external workflows, and other applications, whatever it is, can get this rich information from TopBraid EDG. You don’t have to always use the TopBraid EDG interface alone. Embed connections into your existing workflows and just make API calls back across the EDG so you’re always aware of the semantics, relationships, and extra enrichment that you’ve never seen before.
Q5: Can we use EDG to verify data quality for instances from external DB via JDBC?
TopBraid EDG can sample external databases (via JDBC) for cursory visual checking and also use rules to procedurally evaluate these samples in accordance with expected patterns. Although TopBraid EDG can potentially collect and evaluate large numbers of such samples, it is not designed to directly perform such demanding quality control at scale. As demonstrated during the webinar, EDG can capture business rules associated with data elements. Having metadata for a data source and access to business rules for data elements makes it possible for EDG to auto-generate SQL queries that represent some of these rules. EDG could potentially execute them directly or the queries could be be exported for running on the database directly as a script and/or for running using a different tool as part of a dedicated checking and cleansing environment. In either case, the result of the run(s) could then be captured in EDG as data quality metrics.
EDG is designed to acquire, aggregate, evaluate, centralize and display metrics associated with governance metadata. For these purposes, TopBraid EDG can help govern large scale database quality control initiatives by integrating (via API and other connections) with dedicated third-party software tools optimized for quality control and direct database interaction at scale. EDG can use metadata gathered from such tools for direct display or meaningfully transformed as needed (using business rules stored in EDG) to calculate desired key performance metrics (KPIs)
Q6: Does it also supports data quality with complex rules going vertical and horizontal to cross reference and drive meaningful metrics?
Yes, the model, metadata, and data itself can be ‘quality checked’. Capturing the results of such a check is critical for proving meaningful and combined metrics.