Inferring Data with SHACL Property Value Rules

This document is part of the TopQuadrant GraphQL Technology Pages

This document introduces SHACL property value rules, a proposed extension to the SHACL-AF specification. Property value rules can be used to instruct an engine to dynamically derive (or "infer") property values at query time even if no matching statements have been asserted in the data graph. In TopBraid, these inferred property values can be queried like any other field in GraphQL but also through SPARQL. The features described here are available with TopBraid 6.1 onwards, and the base features are also exposed via matching SHACL API builds.

Motivation

RDF data and knowledge graphs contain all kinds of statements that have been explicitly entered (or: asserted) by users. Many applications query these graphs to derive additional statements that can be inferred from the asserted statements. For example, assume we have assertions about persons, their gender and a parent relationship:

kennedys:JohnKennedy
	a schema:Person ;
	schema:birthDate "1917-05-29"^^xsd:date ;
	schema:deathDate "1963-11-22"^^xsd:date ;
	schema:gender "male" ;
	schema:givenName "John" ;
	schema:familyName "Kennedy" .

kennedys:CarolineKennedy
	a schema:Person ;
	schema:birthDate "1957-11-27"^^xsd:date ;
	schema:gender "female" ;
	schema:parent kennedys:JohnKennedy .

kennedys:JohnKennedyJr
	a schema:Person ;
	schema:gender "male" ;
	schema:parent kennedys:JohnKennedy .

kennedys:PatrickBKennedy
	a schema:Person ;
	schema:gender "male" ;
	schema:parent kennedys:JohnKennedy .

Common sense defines rules that can be used to derive additional statements from this data:

A person's children are those who have the person as parent
A person's sons are those children that have male gender
A person's siblings are anyone with overlapping parents minus the person itself
A person's full name is the concatenation of given name, a space and the family name
A person's age is the number of years between the current time and the birth date, unless the person is already deceased

So the following statements could be inferred from the sample data and the inference rules above:

kennedys:JohnKennedy
	schema:children kennedys:CarolineKennedy .
	schema:children kennedys:JohnKennedyJr ;
	schema:children kennedys:PatrickBKennedy .
	schema:fullName "John Kennedy" ;
	schema:son kennedys:JohnKennedyJr ;
	schema:son kennedys:PatrickBKennedy .

kennedys:CarolineKennedy
	schema:sibling kennedys:JohnKennedyJr ;
	schema:sibling kennedys:PatrickBKennedy ;
	schema:age 60 .  	# in August 2018

...

The designer of a data graph may decide to assert all these statements so that they can be readily queried, yet there are downsides that often make this impractical: In addition to the larger storage requirements for the inferred triples, there is a maintenance problem because if data changes then it is not always easy to update all depending statements. If data changes frequently, or depends on external factors (such as the current date for age in the example), then keeping the inferences correct and up to date becomes a technical nightmare. In those cases, it is far easier to compute these values dynamically, even if this may cause a performance penalty for the necessary computations.

This document introduces a new mechanism that makes it possible to describe inferred values for SHACL property shapes. The new property sh:values can be used to link a property shape with a SHACL node expression that encodes instructions on how the values of the property shall be computed. SHACL node expressions had been introduced with the SHACL Advanced Features 1.0 specification. In order to support more use cases, new kinds of node expressions are proposed and described here, see also the updated SHACL Advanced Features 1.1 Community Draft.

Example: Family Relationships

Let's go through the examples introduced above and show how they can be implemented using sh:values. The starting point is a SHACL shape that targets all instances of schema:Person and defines property shapes for the asserted statements:

schema:Person
	a rdfs:Class, sh:NodeShape ;
	sh:property [
		sh:path schema:parent ;
		sh:class schema:Person ;
	] ;
	sh:property [
		sh:path schema:gender ;
		sh:in ( "female" "male" ) ;
		sh:maxCount 1 ;
	] ;
	sh:property [
		sh:path schema:birthDate ;
		sh:datatype xsd:date ;
		sh:maxCount 1 ;
	] ;
	sh:property [
		sh:path schema:deathDate ;
		sh:datatype xsd:date ;
		sh:maxCount 1 ;
	] .

Example 1: Inferring Children using Parents (Inverse Relationships)

It is often convenient to be able to query relationships in both directions. Here, only the schema:parent is actually asserted in the data, yet it would be nice to query schema:children as well. We declare schema:children using a property shape as follows:

schema:Person
	sh:property [
		sh:path schema:children ;
		sh:class schema:Person ;
		sh:values [
			sh:path [
				sh:inversePath schema:parent ;
			]
		]
	] .

The value of sh:values is a blank node with a value for sh:path. This specifies a path expression that represents all values of the specified SHACL path. Here, it's a path consisting of the schema:parent relationship, but walked in the opposite direction. We also use this property shape to tell the system that all values of this inferred property will also be instances of schema:Person. This is necessary to instruct the GraphQL processor to derive a schema from the SHACL shapes, and furthermore could be used to validate data.

Based on this definition, we can now issue a GraphQL query using TopBraid's GraphQL support as follows:

{
  persons(uri: "http://topbraid.org/examples/kennedys#JohnKennedy") {
    label
    children {
      label
    }
  }
}

This produces the following results, computing the values of the children field by walking the schema:parent relationship in the inverse direction.

{
  "data": {
    "persons": [
      {
        "label": "John Kennedy",
        "children": [
          {
            "label": "John Kennedy Jr"
          },
          {
            "label": "Caroline Kennedy"
          },
          {
            "label": "Patrick B. Kennedy"
          }
        ]
      }
    ]
  }
}

It is perfectly fine to use more complex path expressions than this inverse relationship, including deeper traversal of relationships. See the SHACL path syntax for details.

Example 2: Inferring Sons using Children and Gender (Filter Shapes)

We define schema:son as follows:

schema:Person
	sh:property [
		sh:path schema:son ;
		sh:class schema:Person ;
		sh:description "The son(s) of a person. These values are inferred as the children that have male gender." ;
		sh:name "son" ;
		sh:values [
			sh:nodes [
				sh:path schema:children ;
			] ;
			sh:filterShape [
				sh:property [
					sh:path schema:gender ;
					sh:hasValue "male" ;
				] ;
			] ;
		] ;
	] .

The sh:values node expression is a filter shape expression, consisting of a path expression that fetches all values of schema:children for the current focus node, and a filter shape (defined in SHACL) that these values are validated against. Only the children that have schema:gender "male" are returned as inferred values.

Such rules can also be visualized in diagrams:

Such diagrams illustrate that SHACL node expressions are essentially streams of RDF nodes (here: flowing from left to right), so that the output nodes of one step are the input to the next step. Each of these steps can modify the stream of RDF nodes, for example to filter or transform certain nodes.

Note that the sh:path of the example references schema:children which, by itself, is an inferred property. If path expressions are used then TopBraid recursively evaluates the required inferences, allowing rules to be chained together.

We can now issue a GraphQL query to fetch the sons of John Kennedy:

{
  persons(uri: "http://topbraid.org/examples/kennedys#JohnKennedy") {
    label
    son {
      label
    }
  }
}

{
  "data": {
    "persons": [
      {
        "label": "John Kennedy",
        "son": [
          {
            "label": "John Kennedy Jr"
          },
          {
            "label": "Patrick B. Kennedy"
          }
        ]
      }
    ]
  }
}

Filter shapes can be of arbitrary complexity, including any of the rich validation features of SHACL.

Any inferred field can be consistently used in GraphQL queries like any other (asserted) field, including for filtering. Here we ask for all persons who have at least 2 sons:

{
  persons (where: { son: { minCount: 2 } }) {
    label
  }
}

{
  "data": {
    "persons": [
      {
        "label": "John Kennedy"
      }
    ]
  }
}

Example 3: Inferring Siblings (Minus)

Siblings of a person are defined as follows:

schema:Person
	sh:property [
		sh:path schema:sibling ;
		rdfs:comment "The siblings are inferred to be the children of the parents, minus the focus node itself." ;
		sh:class schema:Person ;
		sh:values [
			sh:nodes [
				sh:path ( schema:parent [ sh:inversePath schema:parent ] )     # schema:parent/^schema:parent 
			] ;
			sh:minus sh:this ;
		] ;
	] .

The node expression at sh:values above uses a path expression that first walks up to the parents of the focus person and then walks down again into the children. This yields all persons that have overlapping parents, but including the focus person. That is removed from the results using a minus expression.

{
  persons(uri: "http://topbraid.org/examples/kennedys#JohnKennedyJr") {
    label
    sibling {
      label
    }
  }
}

{
  "data": {
    "persons": [
      {
        "label": "John Kennedy Jr",
        "sibling": [
          {
            "label": "Patrick B. Kennedy"
          },
          {
            "label": "Caroline Kennedy"
          }
        ]
      }
    ]
  }
}

Actually, let's make this more interesting, and use the inferred field as a filter in the query:

{
	persons (where: { sibling: { hasValue: "http://topbraid.org/examples/kennedys#JohnKennedyJr"}}) {
		label
	}
}

This produces all persons who have John Kennedy Jr as one of their siblings:

{
  "data": {
    "persons": [
      {
        "label": "Caroline Kennedy"
      },
      {
        "label": "Patrick B. Kennedy"
      }
    ]
  }
}

Example 4: String Operations (Functions)

For the sake of this example, the full name of a person is defined as the concatenation of given/first name and family/last name, with a space in between. SHACL node expressions can call functions including SPARQL functions. Among the built-in SPARQL functions (see see sparql: namespace) is the CONCAT operation that we can use here:

schema:Person
	sh:property [
		a sh:PropertyShape ;
		sh:path schema:fullName ;
		sh:name "full name" ;
		sh:datatype xsd:string ;
		sh:description "A person's full name, consisting of given name and family name, separated by a space." ;
		sh:maxCount 1 ;
		sh:values [
			sparql:concat (
				[ sh:path schema:givenName ]
				" "
				[ sh:path schema:familyName ]
			) ;
		] ;
	] .

Here is a screenshot from TopBraid EDG 6.1 illustrating a possible visualization of this rule:

Example 5: Inferring the Age (SPARQL queries)

The age of a person can be computed dynamically, using the current date and the person's date of birth. As this is a reasonably complex operation, we revert to SPARQL to implement it. SPARQL includes a NOW() operation that delivers the current time stamp, and in the example below the TopBraid function spif:timeMillis() is used to convert time stamps into milliseconds.

schema:Person
	sh:property [
		sh:path schema:age ;
		sh:datatype xsd:integer ;
		sh:description "A person's age derived from the current date and the given birth date. No value if the person is already deceased." ;
		sh:maxCount 1 ;
		sh:name "age" ;
		sh:values [
			sh:prefixes <http://topbraid.org/examples/schemashacl> ;
			sh:select """
				SELECT ?age
				WHERE {
					$this schema:birthDate ?birthDate .
					FILTER NOT EXISTS { $this schema:deathDate ?any }
					BIND (365 * 24 * 60 * 60 * 1000 AS ?msPerYear) .
					BIND (spif:timeMillis(NOW()) - spif:timeMillis(?birthDate) AS ?ms)
					BIND (xsd:integer(floor(?ms / ?msPerYear)) AS ?age)
				}""" ;
		] ;
	] .

We can use this field to ask for all persons younger than 70:

{
  persons (where: {age: {maxExclusive: 70}}) {
    label
    age
  }
}

In our example this only delivers one match:

{
  "data": {
    "persons": [
      {
        "label": "Caroline Kennedy",
        "age": 60
      }
    ]
  }
}

In general, SPARQL expressions can be used to further process the results of any other node expression, i.e. it is possible to chain together various node expressions and then use SPARQL to modify them. Details are found under SPARQL SELECT expressions and SPARQL ASK expressions. Users should of course be considerate of potential performance pitfalls, since SPARQL queries may need to be executed many times before results are produced.

FWIW, the above example may also be expressed without SPARQL syntax, but using SHACL node expressions instead. The computation here is quite complex, so I skip the source code of how to do that. The following image gives you an idea :)

Example: Shapes as Database Views

Let's assume we have two databases: one with FOAF Persons, and another with schema.org Persons. Here is some sample data:

db1:KlausSchulze
	a foaf:Person ;
	foaf:firstName "Klaus" ;
	foaf:surname "Schulze" .
	
db2:SteveRoach
	a schema:Person ;
	schema:givenName "Steve" ;
	schema:familyName "Roach" .

However, our application is about customer management and would like to pretend that the data had the following shape instead:

db1:KlausSchulze
	ex:firstName "Klaus" ;
	ex:lastName "Schulze" ;
	ex:fullName "Klaus Schulze" .
	
db2:SteveRoach
	ex:firstName "Steve" ;
	ex:lastName "Roach" ;
	ex:fullName "Steve Roach" .

Using SHACL property value rules, we can create the second data structure as a virtual view on the data without moving data around. We define a node shape that targets all instances of foaf:Person and schema:Person, and define the properties that we want to expose, including the sh:values rules to compute them when queried:

ex:Customer
  	a sh:NodeShape ;
	rdfs:label "Customer" ;
	sh:targetClass foaf:Person ;
	sh:targetClass schema:Person ;
	sh:property [
		a sh:PropertyShape ;
		sh:path ex:firstName ;
		sh:name "first name" ;
		sh:description "The first name, based either on foaf:firstName or schema:givenName." ;
		sh:datatype xsd:string ;
		sh:maxCount 1 ;
		sh:values [
			sh:path foaf:firstName ;
		] ;
		sh:values [
			sh:path schema:givenName ;
		] ;
	] ;
	sh:property [
		a sh:PropertyShape ;
		sh:path ex:lastName ;
		sh:name "last name" ;
		sh:description "The last name, based either on foaf:surname or schema:familyName." ;
		sh:datatype xsd:string ;
		sh:maxCount 1 ;
		sh:values [
			sh:path foaf:surname ;
		] ;
		sh:values [
			sh:path schema:familyName ;
		] ;
	] ;
	sh:property [
		a sh:PropertyShape ;
		sh:path ex:fullName ;
		sh:name "full name" ;
		sh:description "The full name, consisting of first name and last name, separated by a space." ;
		sh:datatype xsd:string ;
		sh:maxCount 1 ;
		sh:values [
			sparql:concat ( [ sh:path ex:firstName ] " " [ sh:path ex:lastName  ] ) 
		] ;
	] .

Note that some properties can have multiple sh:values expressions, and the resulting triples are the union of them all.

Using TopBraid's GraphQL support, we can now issue this query:

{
  customers {
    uri
    fullName
  }
}

TopBraid produces this JSON output:

{
  "data": {
    "customers": [
      {
        "uri": "http://example.org/db1#SteveRoach",
        "fullName": "Steve Roach"
      },
      {
        "uri": "http://example.org/db2#KlausSchulze",
        "fullName": "Klaus Schulze"
      }
    ]
  }
}

Use of Inferred Values using SPARQL

TopBraid includes a magic property (aka property function) tosh:values that can be used to fetch inferred values, or to check whether a given focus node has certain inferred values for a given predicate. Here is an example query:

SELECT *
WHERE {
	?person a schema:Person .
	(?person schema:age) tosh:values ?age .
}

Note that this magic property can only be used to derive the right-hand value from the left-hand values, not vice versa. So the caller needs to make sure that both variables on the left-hand side are bound when tosh:values is evaluated. This magic property makes property value rules available to any SPARQL-based technology in the TopBraid platform, including SWP, SPARQLMotion, SPIN and SHACL-SPARQL itself.

tosh:values falls back to any declared sh:defaultValue if no other value exists for the focus node and predicate.

We are currently evaluating whether this integration with SPARQL should also more directly work with every use of an inferred property in a SPARQL query. For example, the following would then also work:

SELECT *
WHERE {
	?person a schema:Person .
	?person schema:age ?age .
}

We welcome feedback on whether TopBraid should support this syntax in SPARQL or whether tosh:values is sufficient.

Use of Inferred Values in TopBraid EDG

TopBraid Enterprise Data Governance (EDG) is an agile data governance solution for today's dynamic enterprises. Among many other features, it provides an editing and browsing environment to manage metadata about data assets such as databases, database tables and database columns. The data model behind these capabilities is built around SHACL - for example it contains a type shape edg:DatabaseTable with a property edg:tableOf and a type shape edg:DatabaseColumn with a property edg:columnOf. The following screenshot shows how we are using SHACL inferences to derive all kinds of additional information for users and software agents:

In the screenshot, the inferred values are marked with a blue label (inferred). Here is the definition of "number of tables":

edg:Database
	sh:property edg:Database-tableCount .

edg:Database-tableCount
	a sh:PropertyShape ;
	sh:path edg:tableCount ;
	sh:datatype xsd:integer ;
	sh:description "The number of tables in this database, automatically computed." ;
	sh:group edg:StatisticsPropertyGroup ;
	sh:maxCount 1 ;
	sh:name "number of tables" ;
	sh:values [
		sh:count [
			sh:path [
				sh:inversePath edg:tableOf ;
			] ;
		] ;
	] .

Using TopBraid's GraphQL service, this data can be easily queried:

Here is a more complex example, computing the total number of columns across all tables and views associated with a database.

edg:Database
	sh:property edg:Database-totalColumnCount .

edg:Database-totalColumnCount
	a sh:PropertyShape ;
	sh:path edg:totalColumnCount ;
	sh:datatype xsd:integer ;
	sh:description "The number of overall columns in this database, automatically computed." ;
	sh:group edg:StatisticsPropertyGroup ;
	sh:maxCount 1 ;
	sh:name "total number of columns" ;
	sh:order 10 ;
	sh:values [
		sh:count [
			sh:path (
				[
					sh:alternativePath ( [ sh:inversePath edg:tableOf ] [ sh:inversePath edg:viewOf ] )
				]
				[
					sh:inversePath edg:columnOf ;
				]
			) ;
		] ;
	] .

The new generation of form displays in TopBraid also provides a SHACL-based widget to display multiple resources in tabular form, see the Overview section in the screenshot. To produce such tables, define a SHACL shape with properties for each column that you want to render:

edg:DatabaseTableSummary
	a sh:NodeShape ;
	sh:targetClass edg:DatabaseTable ;
	rdfs:comment "A shape that can be applied to DatabaseTables to provide a summary view." ;
	rdfs:label "Database table summary" ;
	sh:property [
		a sh:PropertyShape ;
		sh:path edg:name ;
		sh:datatype xsd:string ;
		sh:maxCount 1 ;
		sh:minCount 1 ;
		sh:name "name" ;
		sh:order 0 ;
	] ;
	sh:property [
		a sh:PropertyShape ;
		sh:path edg:columnCount ;
		sh:datatype xsd:integer ;
		sh:description "The number of columns, inferred from columnOf triples." ;
		sh:maxCount 1 ;
		sh:name "column count" ;
		sh:order 1 ;
		sh:values [
			sh:count [
				sh:path [
					sh:inversePath edg:columnOf ;
				] ;
			] ;
		] ;
	] ;
	sh:property [
		a sh:PropertyShape ;
		sh:path edg:recordCount ;
		sh:datatype xsd:integer ;
		sh:description "The number of records." ;
		sh:maxCount 1 ;
		sh:name "record count" ;
		sh:order 2 ;
	] .

Such shapes can be edited with the EDG Ontology Editor or with TopBraid Composer, or with any similar tool. To instruct the system to display these summary values in an HTML table, use the following:

edg:Database
	sh:property edg:Database-tableSummary .

edg:Database-tableSummary
	a sh:PropertyShape ;
	sh:path edg:tableSummary ;
	tosh:viewWidget swa:SummaryTableViewer ;
	sh:description "The tables in this database as summaries, automatically computed." ;
	sh:group edg:OverviewPropertyGroup ;
	sh:name "table summary" ;
	sh:node edg:DatabaseTableSummary ;
	sh:values [
		sh:path [
			sh:inversePath edg:tableOf ;
		] ;
	] .

As shown above, the property tosh:viewWidget provides UI metadata that is used by TopBraid and potentially other tools. The sh:node edg:DatabaseTableSummary statement selects the shape that declares the columns that shall be used, and their order. From there, the system can collect all relevant information. For example, it can understand that certain properties always return xsd:integer values, which instructs the tabular display to right-align the values. Note that in the table above, not only the values of edg:columnCount are computed on-the-fly, but even the rows of the table itself is inferred. So SHACL can be used to define views on data that is stored in RDF, for reporting and analytical purposes.

As this little table is also driven by GraphQL, software agents and users can use the same shape definitions to run queries:

{
  databases {
    label
    largeTables: tableSummary (orderBy: columnCount, orderByDesc: true, where: { columnCount: { minInclusive: 10 } }) {
    	label
    	columnCount
  	}
  }
}

This produces all databases and for each database it selects an ordered list of tables that have at least 10 columns.

{
  "data": {
    "databases": [
      {
        "label": "NORTHWIND",
        "largeTables": [
          {
            "label": "DBO.EMPLOYEES (NORTHWIND)",
            "columnCount": 18
          },
          {
            "label": "DBO.ORDERS (NORTHWIND)",
            "columnCount": 14
          },
          {
            "label": "DBO.SUPPLIERS (NORTHWIND)",
            "columnCount": 12
          },
          {
            "label": "DBO.CUSTOMERS (NORTHWIND)",
            "columnCount": 11
          },
          {
            "label": "DBO.PRODUCTS (NORTHWIND)",
            "columnCount": 10
          }
        ]
      }
    ]
  }
}

Note that the GraphQL service would also derive values based on sh:defaultValue if no other value exists for a field. In TopBraid, the values of sh:defaultValue may be node expressions too.

Caveats and Pitfalls

Note that on-the-fly inferences are only visible in certain circumstances. You cannot just query them as you would with normal triples. Currently, they are only exposed through GraphQL fields and when a path node expression is used as part of a property value rule, and the sh:path in that node expression is an IRI node. In other words, path expressions such as skos:narrower* is not supported at this stage (this is potential future work).

The inferences are not exposed in SPARQL triple matches or similar technology, unless a SHACL inferencing engine has been executed beforehand. In TopBraid Composer, press the Run Inferences button to materialize the inferences. In TopBraid EDG, use Transform > Execute Rules.

There is also a current limitation in how the system selects which property value rules are executed for a given focus node: The system selects node shapes that have property shapes with sh:values based on the rdf:type of the focus node. It will look for any non-deactivated node shape that is either a class and has the focus node as instance, or that has a sh:targetClass matching one of the types of the focus node. Other types of targets including user-defined targets, sh:targetSubjectsOf, sh:targetObjectsOf and sh:targetNode are not supported at this stage due the potential performance impact that they might have. This may be improved in future versions.