Strategies for resolving identity across data sources

TopQuadrant’s latest product, TopBraid Insight (TBI), federates queries across diverse data sources. It then brings the results back and merges them so that if, for instance, one data source has employment data about Irene Polikoff and another one has data about books and articles she wrote, we can see all of this data together, establishing a 360 degree view of all information available about me.

Then, as we click on any related “thing” such as a company or a book, we similarly will see all of the merged information about each one. And we can run queries like “Give me all people who worked at company X between 2005 and 2010 and who at any point published an article in Information Week”. More complex queries such as this can rarely be answered by consulting one data source; answering them requires the fusion of information from the data available in several sources.

Key to enabling this capability is the ability to correctly resolve identity across multiple data sources. In other words, the ability to match entities (in RDF lingo, resources) found in one data source with entities in other data sources. Since they will have different IDs in each data source, TBI needs to know how to make the match across the varying identities for the same resource.

For maximum flexibility we decided to support three strategies for achieving this critical matching of resources:

1. Canonical URIs

In some cases, it may be possible to use a URI creation policy as the way to merge data that is the same. TBI uses SPINMaps to map between the schema of a data source and a unified canonical model used for the federation.
(Note that with TopBraid Insight multiple federation mapping sets (sets of connected data sources) are supported–each can use a different combination of data sources, different mapping strategies and different canonical models).

SPINMaps are rules expressed in SPARQL that:

Say which class in the source corresponds with which class in the target
Define how a data field in the source model (or schema) relates to a respective field in the target model

Since SPINMaps are executable rules, in mapping classes, they define how to create a “target resource” from a “source resource”. This boils down to a rule for building a URI for the target resource.

For example, if we have multiple data sources that contain information about people and each data source includes social security numbers, we can build target URIs from the social security numbers in a way that is consistent across all data sources. As a result, data coming from the different sources will merge automatically using the native capabilities of RDF. This will happen simply because the same entities will have the same URIs.

This is a great strategy that is easy to implement. However, often it can only be used with a subset of the data sources and/or data entities. For example, some data sources may have social security numbers, but others may not. They may, instead, have customer IDs or just customer names and telephone numbers. In each case, different strategy and logic will be needed to match data.

2. LinkMaps

LinkMaps describe which resources are the same by using functional schema-level mappings. Each LinkMap works on a pair of data sources. Link mappings are statements that may say, for example, that if person’s first name, last name and a telephone number match in both data sources, this is the same person.

TBI can traverse these LinkMaps transitively, so we don’t need to create mappings for each pair of data sources. If a Person class from data source A is link-mapped to an Author class in data source B and the Author class in data source B is link-mapped to a Customer class in data source C, there is no need to link-map the Person class in A to the Customer class in C. This avoids the explosion of required mappings as the number of data sources grows.

3. LinkSets

LinkSets are RDF datasets with specific information. They contain explicit links between specific resources in two data sources. For example, a LinkSet record may say that a Person with an ID 123456 from the data source A is the same as an Author with an ID ABC789 from the data source B.

LinkSets are useful in cases where it is either impossible or too complex to describe functional mappings. LinkSets can also be used to hold exceptions to link-mappings. It is quite common to be able to cover 80% of the data that can be connected using functional mappings based on property values. But then another 20% of the data don’t quite follow the rules sufficiently to allow for using that strategy LinkSets can help cover these cases. Ultimately, the creation of LinkSets can be crowd sourced. If a user sees that entities that should have been merged, were not, they could click to create a new link record for them. The same is true if TBI merges two entities that are, in fact, different.

As LinkMaps, LinkSets can be traversed transitively, so if there are links between source A and B and links between source B and C, there is no need to create links between source A and C even though they contain information about the same things.

With these three complementary strategies, TBI offers flexibility for resolving identity that can best fit what is needed and what opportunities for doing so are available in the data that one wants to connect.

Do you think these three strategies are sufficient to address your data merging needs? Have you heard of or worked with other approaches to this problem? We’d like to hear from you.

Strategies for resolving identity across data sources

Recent Posts

Categories

Meta