Metadata and its use are at the heart of the data fabric. The metadata collected by the data fabric needs to be described using rich and comprehensive models. These models inform and power the process of activating metadata – using the metadata to derive new knowledge, make recommendations and automate data integration tasks.
The models must be:
- Flexible and easily extensible to accommodate evolution of the data fabric
- Open and exposed to access by all the tools that participate in the data fabric
- Precise and clear and their definition
Ontologies expressed using graph data modeling languages are uniquely effective in addressing these requirements. We particularly recommend the use of the graph data modeling language SHACL (Shapes and Constraints Language) because it is a standard and was designed to have the expressivity, extensibility and connectivity needed to power the data fabric.
In the last article we talked about how the data fabric collects, enriches and uses metadata. Models are essential to each of these component processes. In this article we will describe why the data fabric needs models and provide some examples of how the models are used.
COLLECT
In the collection step, the role of the model is to represent the metadata to be captured. As we discussed in the previous article, metadata can be of four types: business, technical, operational and social. The model behind the data fabric needs to describe all four of these types of metadata. Different data assets can be represented by different sets of metadata. Thus, the model needs to cover the different aspects of the different sources of metadata.
A good, organic starting point is to see what metadata is available at each participating source. This will largely be technical metadata. Most data cataloging solutions have already done this work and have built support for technical metadata into their product. Some may go further and describe other types of metadata. As an example, the diagram below shows a subset of metadata available in TopBraid EDG for a relational database.
The diagram shows some technical metadata such as foreign keys. It also shows business metadata such as connection from a business term to a database column. Operational metadata is represented by attributes like criticality.
Having the pre-built metadata already defined will speed up time to deployment. A key stumbling block to watch for, however, is how difficult it may be for the pre-built model to be modified and expanded to support current and future needs.
Questions to Consider:
Some of the questions to ask when considering a data catalog that will provide the foundation for the data fabric are:
- Are there any limitations in the model behind the catalog?
- Can it be used to describe JSON, XML, NoSQL and other data sources as well as it can describe relational databases?
- How easy is it to extend the model?
- How will the subject matter expert interact with the model?
To support describing all types of metadata the requisite modeling language needs to be powerful and comprehensive. This includes the ability to describe a variety of data structures and sources. While relational databases are still a bread-and-butter type of utility for most organizations, increasingly, sources using other file structures become a key part of the data landscape. It is imperative for the data fabric model to be able to accommodate them. As the recent blog How Data Modeling is Different Today discussed, ontologies defined with graph data modeling languages are very effective in supporting modern data modeling needs. They make it possible to create highly expressive models that are both modular and connectable.
ENRICH
The enrich and connect step is key to the data fabric. Its “secret sauce” is that it analyzes collected metadata to see what “conclusions” can be made from it. We may call these conclusions inferences. Inferences are simply new facts that could be arrived at from already known facts. The goal of these conclusions is to enable the data fabric to recommend or automate tasks.
For example:
- Based on the information known about a data element and the semantics of the business term captured in the model, we can infer that it maps to a business term. For a discussion on how a model together with the collected metadata can help to infer the mapping see our related post Mapping Data Elements to Business Terms.
- Knowing what term is represented by the data, the data fabric can assess and suggest critically of data, how it needs to be protected and who may or may not access it.
There are a number of techniques that enable creation of inferences. The model must facilitate this processing by supporting rich semantic rules, graph analytics and integration with machine learning.
Questions to Consider:
Some of the questions to ask when considering a data catalog’s ability to activate metadata are:
- Can it capture semantics of data?
- Can it go beyond data structures and express meaning that will drive smart inferences?
- Are models and data expressed using graph structures with support for graph analytics?
- How are taxonomies and ontologies integrated and used?
USE
Data fabric is not a single tool or product that an organization can buy and call the implementation done. Instead, It is created by using a collection of products that collaboratively work together. The model behind the data fabric must make this collaboration possible. A hardwired connection will not work for the data fabric – the enterprise data integration it needs to enable is dynamic, fluid and evolving. And, a hardwiring approach will be static and brittle (easily broken).
This means that the models behind the data fabric must support discoverability. Tools participating in the data fabric must be able to ask what information is available. Equally, they must be able to tell the data fabric that they’ve got some new information and request it to expand its metadata model to support this information. They need read and write access not only to the metadata itself, as it is available in the data fabric, but also to the model of the metadata in use by the fabric.
In the post Querying TopBraid EDG with GraphQL, we give some concrete examples of how this may work.
Questions to Consider:
Some of the questions to ask when considering how all the different products participating in the data fabric architecture will take advantage of and contribute to the metadata are:
- How will the tools participating in the data fabric interact with the metadata model?
- What APIs are available?
- What standards are being used?
- Is the core of the data fabric model-driven or is it hard-wired?