Combining your Knowledge Graph in TopBraid EDG with Wikidata or other external Knowledge Graphs
Introduction
As many people today, you or your organization are likely to have lots of data you “own” and curate. However, with the growing amount of information available from multiple sources, you may also find it important to take advantage of the external data. For example, Wikidata contains background information about almost any topic in the world, such as the current population of each country.
In order to avoid duplication and manual data entry, you may want to set up your local repository to fetch let’s say the latest population count from an external source. Wikidata defines unique identifiers for each resource in its store. Even if you do not need to fetch external data to see it in TopBraid EDG, it may be useful to have any EDG glossary, reference dataset or taxonomy that talks about Australia link to the Wikidata resource for Australia – using Wikidata as a knowledge hub of reference.
TopBraid EDG, starting with version 6.2 released in 2019, has offered an easy, “no code required” way to connect to and use data from Wikidata or any other external knowledge graph that provides a SPARQL endpoint. This includes external endpoints and endpoints managed by your own organization.
Technical background: SPARQL endpoints provide an access point for receiving and processing SPARQL protocol requests. SPARQL protocol is a HTTP-based protocol, that is used to perform GET, POST, and PATCH operations carrying SPARQL query requests. Generally, SPARQL endpoints are offered by Graph Databases (known as RDF Triple Stores) and Knowledge Graph/Linked Data platforms. However, SPARQL is a standard and could be offered as a service by any data source, even if it does not store data in a graph store. TopBraid EDG provides a SPARQL endpoint and can access third party SPARQL endpoints.
In this article, we will illustrate how this works by using an example. We will establish a link from a local People database to matching resources in Wikidata, reusing data such as date of birth, height and images directly from Wikidata.
The capability we will demonstrate here is commonly referred to as data fabric or data mesh because it seamlessly connects and uses data from multiple sources. A data fabric can exists at multiple levels – it can be bringing together the actual data or, as our example demonstrates, bringing together metadata (data about data). TopBraid EDG can help you with both scenarios.
Let’s get started. In the diagram below, we distinguish between local resources (aka assets) shown in blue and remote resources shown in green.
Local resources are maintained in TopBraid EDG by dedicated staff, while selected data from remote resources is periodically copied over (or used dynamically on demand) and thus maintained by the remote source. Our local data includes information about people names, gender, where they went to school and who they are married to. The remote source has a lot of information, including, a few data points we would like to re-use – height, date of birth, date of death (for deceased people) and an image.
Dedicated link properties (such as “wikidata person” in the diagram above) are used to point from local resources to the corresponding remote resources. On-the-fly inferences using SHACL property value rules are employed to copy (transforming if necessary) selected remote values into properties of the local resources, so that they can be queried just like any other local values. We can then, see and use the remote data in EDG:
Let’s now walk through the steps to make this happen. This step by step example uses TopBraid EDG 6.4.
Creating a Link Property
In this example we have an Ontology (schema) as an EDG asset collection. It is using a SHACL version of the schema.org namespace as a starting point. The class schema:Person already declares various property shapes for values of properties such as schema:givenName and schema:height. In order to store links from our local assets of type schema:Person to corresponding resources on Wikidata, we introduce a new property called wikidataPerson. Since it connects two resources, it will be a relationship. We create it using Create Relationship dialog:
For now, leave the Class of Values empty since these are remote resources EDG is not likely to have their type nor have we yet defined anything in the model about these resources. In the screenshot below we are displaying the property definition on the form and in the Turtle syntax so that it becomes evident what changes are made once you declare this property to be a link to an external graph.
We will now pick “Make this property a Wikidata link” from the Modify menu:
This adds a value for the property dash:detailsEndpoint, linking it to the SPARQL endpoint of the Wikidata server: https://query.wikidata.org/sparql
Technical background: Whenever a property shape carries a value for dash:detailsEndpoint, TopBraid EDG will understand that the property values are URIs and that more RDF statements for these URIs can be queried from the given SPARQL endpoint. If the endpoint happens to be exactly the URL above then additional features for Wikidata get activated. The menu makes it easy to create the Wikidata connection with one click. However, you can update the value to any SPARQL endpoint of your choice.
That’s all for describing the link. The local schema:Person class now carries a link to Wikidata SPARQL endpoint.
Linking Local Resources to Wikidata Resources
We have created an EDG Data Graph with instances of the People class representing the usual suspects from the Kennedys family:
Not much detailed data is captured in EDG for these people. As shown in the model image above, we have the names of the people, their gender, family relationships between individuals and schools they went to. Since these are well known people, much more data about them is available on Wikidata.
Names of the people will be sufficient to automatically generate links from our local resources to the corresponding Wikidata resources. Since we have a Wikidata link property, running Problems and Suggestions will suggests suitable Wikidata resources as values for the link property. Suggestions are based on (approximate) similarity of the labels. To generate suggested matches, TopBraid EDG will run a sequence of queries to a web service kindly provided by Wikidata. This may take a while but can be interrupted at any time:
The resulting page can be used to review the suggestions and accept those that seem plausible:
Alternatively to the batch process for all local resources, you can use “Suggest matching Wikidata entities…” from the Modify menu for each individual local resource. This will bring up a dialog such as the following:
If we accept some of suggestions, our local resources will now have outgoing links to remote ones – such as the Q2685 link for Arnie:
You can follow the link to explore whatever Wikidata knows about this person:
Now that our local resources have references to corresponding Wikidata resources, we can start using the property values of the remote resources.
Defining the Remote Data of Interest
Our schema doesn’t know anything about the remote resources yet. We need to tell TopBraid EDG which properties we are interested in, and what format they have. The W3C Shapes Constraint Language (SHACL) is well suited for that job. To identify remote properties we are interested in, we will define a SHACL node shape that carries property shapes for these properties. This acts like a “view” on the remote data and informs the system what kinds of SPARQL queries it needs to use to fetch the actual values.
Back in our example schema, we define a node shape called “Wikidata Person”. We are using a node shape that is not a class because these are remote resources with data we only hold in TopBraid EDG as “refreshable cache”. Alternatively, we could define a class as well. If you interested in more information about differences between classes and node shapes, see this blog.
Click on the Node Shapes panel and press “Create Node Shape” button:
TopBraid EDG now offers another wizard that greatly simplifies the linkage with Wikidata. From the context menu of the new node shape, select “Add property shapes from Wikidata sample…”:
The resulting dialog asks you for the ID of any example instance that may hold typical values. In our example, we pick Arnold’s wikidata ID Q2685 and click on “Load”:
This dialog is fetching all properties of this sample instance, and allows you to browse the values – by expanding each property you are interested in. You can then select the properties of interest in and (optionally) set cardinality and datatype constraints for them. Above, we have selected the “height” property with a maximum cardinality (sh:maxCount) of 1, and datatype xsd:decimal.
We can repeat this process for other sample instances in case they have values for other properties that we may want. For example to pick “death date” which wasn’t available for Arnold. TopBraid EDG will generate suitable SHACL property shape declarations for all the selected properties, and attach them to our Wikidata Person shape:
For experts, here is the definition of this node shape in Turtle syntax:
people_schema:Wikidata_Person
rdf:type sh:NodeShape ;
rdfs:label "Wikidata Person" ;
sh:property [
rdf:type sh:PropertyShape ;
sh:path schema:description ;
sh:name "description" ;
] ;
sh:property [
rdf:type sh:PropertyShape ;
sh:path rdfs:label ;
graphql:name "rdfs_label" ;
sh:name "label" ;
] ;
sh:property [
rdf:type sh:PropertyShape ;
sh:path wdt:P18 ;
sh:name "image" ;
] ;
sh:property [
rdf:type sh:PropertyShape ;
sh:path wdt:P2048 ;
sh:datatype xsd:decimal ;
sh:maxCount 1 ;
sh:name "height" ;
] ;
sh:property [
rdf:type sh:PropertyShape ;
sh:path wdt:P569 ;
sh:datatype xsd:dateTime ;
sh:maxCount 1 ;
sh:name "date of birth" ;
] ;
sh:property [
rdf:type sh:PropertyShape ;
sh:path wdt:P570 ;
sh:datatype xsd:dateTime ;
sh:maxCount 1 ;
sh:name "date of death" ;
]
Now that we have described the data from the remote resources we are interested in, we tell our link property (wikidata person) about it, using sh:node. We are further defining the link property by identifying what kind of resources can be its values. (If we used a class to describe Wikidata Person, we could also use sh:class.)
This is enough to instruct TopBraid EDG about the values we want to fetch from the endpoint. However, it does not yet tell EDG into what properties of our local resources to copy these values.
Defining Local Properties to Populate
We have seen how to link our local resources with the related ones in Wikidata. Now, we want to tell EDG which properties of a local resource should get the data of interest from Wikidata. Let’s say we want the values of the local property schema:height to hold the same values as the property wdt:P2048 (aka “height”) of the corresponding remote resource in Wikidata.
SHACL property value rules can be used to instruct the system that certain property values shall be computed on the fly, whenever they are queried. The resulting values are “inferred” and not editable in TopBraid EDG. A simple form of property value rule can be employed to walk from the locally stored instance of schema:Person class into the associated wikidata person, and from the remote person to its the height value. More complex rules could also be defined to perform additional transformations, when needed.
You can either enter such rules by hand, or use the new wizard in TopBraid EDG. Start by navigating into the person’s property shape that defines the local height property:
Once there, select “Create property value rule from template…” from the Modify menu:
This wizard offers a growing number of templates, including the one that just copies a value from a related (in our case, linked remote) resource:
Once finished, the property shape of schema:height carries a SHACL property value rule:
To confirm that this is all now working, we can visit the local Arnold instance, and use “Refresh details of remote values” to fetch the remote values from the Wikidata SPARQL endpoint:
Once this has completed, we can see that our local Arnold instance has a schema:height property, which is inferred straight out of the Wikidata knowledge graph:
We can repeat the same steps for the other properties. In some cases, the property value rules may need to be post-processed to include extra transformations. Here, we have modified the rule for schema:deathDate so that the xsd:dateTime value from Wikidata is automatically turned into an xsd:date literal:
If you are not familiar with the syntax, check the SHACL Advanced Features 1.1 draft. The above roughly means “query the values of wikidataPerson and then query the values of P570 of those, and finally convert those to xsd:date using the SPARQL xsd:date(v) function”. Similarly, we can use the function sparql:iri to convert the image URL strings delivered by Wikidata into IRI resources. (To see the sparql: functions, include the “SPARQL vocabulary for SHACL” into your Ontology).
Refreshing and Querying Remote Values
Now that all shapes have been set up, we can use batch processes to periodically refresh the remote values, e.g. once a night. In TopBraid EDG, this can be automated using scheduled jobs. The batch process can be triggered from the Transform tab:
Alternatively, individual resources can be refreshed as shown earlier.
We can now see that all local person resources that have links to Wikidata entities carry values for height, birth date, death date and image:
You can also query these values, consistently with locally defined values, using GraphQL:
Since TopBraid’s GraphQL support is based on shape declarations, we can even query the values of the remote resources, as follows. Note that this requires the Wikidata Person shape to be marked with graphql:protectedShape in the Ontology.
Oh, and since we have used SHACL node shapes to declare the structure of the Wikidata entities, we can also perform constraint validation on that data. Combined with TopBraid EDG workflows, this means that data can be pulled from the remote service and then validated before it is accepted into production.
In Conclusion
You may have already heard the term data fabric – the trend is growing rapidly. Data fabrics are emerging as an approach to help organizations better deal with fast growing sources of data, speed of data evolution, distributed processing and changing requirements for data. For the longest time, each application had its own unique approach to storing and retrieving data, creating data silos that were very hard to cross. Today, we increasingly envision data-centric architectures where data from different sources can be more easily re-used in a variety of contexts and applications. A data fabric crosses different data stores and brings together the right data for the right application.
We have demonstrated a simple example of how you can easily implement a data fabric using TopBraid EDG and standards-based access endpoints. Think of data fabric as a web stretched over a large network that connects multiple locations, types, and sources of data, both on-premises and in the public clouds using different methods for accessing that data to process, move, manage, and store it within the confines of the fabric. The connected data fabric is based on the similar concepts to those underlying the web. It is no wonder that the Knowledge Graph technology built on the same standards and principles as the web is an excellent fit for implementing data fabrics.