SPARQL Best Practices

Writing efficient SPARQL queries for TopBraid requires both a good understanding of the data model and of how TopBraid will execute SPARQL queries. TopBraid bundles the Apache Jena SPARQL engine which does not optimize queries automatically. This means that the onus is on the query author to follow some best practices as outlined below.

How SPARQL queries are processed

The WHERE clause of a SPARQL query produces bindings of variables. In the following query, the variable ?continent will be bound to all instances of the class Continent.

1SELECT ?continent
2WHERE {
3    ?continent a g:Continent .
4}

The row in the WHERE clause above is called a Basic Graph Pattern, and those are matched against the triples found in the current query graph. In general, a query will be executed from top to bottom, so that the variable bindings from one row will be used as input to the next row.

Order of Basic Graph Patterns

The most important rule of writing efficient SPARQL queries is to eliminate combinations of variables as early as possible. The following (bad) query will return all continents together with their labels:

1SELECT ?continent ?label
2WHERE {
3    ?continent skos:prefLabel ?label .
4    ?continent a g:Continent .
5}

When the above query is executed, the engine will first walk through all triples that have skos:prefLabel as predicate and only then combine those triple matches with the next row, which checks whether the bindings of ?continent are in fact instances of g:Continent.

This is very inefficient because there may be hundreds of thousands of triples with skos:prefLabel but only a handful of continents. The engine would do a lot of extra work that could be eliminated by reordering the clauses to the following (better) query:

1SELECT ?continent ?label
2WHERE {
3    ?continent a g:Continent .
4    ?continent skos:prefLabel ?label .
5}

The difference is that here it will focus only on the (few) instances of g:Continent and only for those it will query the labels.

FILTER Placement

TopBraid’s SPARQL engine will usually automatically move FILTER clauses to the end of its surrounding { … } block. For example, the following FILTER clause will be executed after all basic graph patterns have been processed:

1SELECT ?continent ?label ?country
2WHERE {
3    ?continent a g:Continent .
4    ?continent skos:prefLabel ?label .
5    FILTER langMatches(lang(?label), 'en') .
6    ?country skos:broader ?continent .
7}

The “real” execution order of the query above will be as follows:

1SELECT ?continent ?label ?country
2WHERE {
3    ?continent a g:Continent .
4    ?continent skos:prefLabel ?label .
5    ?country skos:broader ?continent .
6    FILTER langMatches(lang(?label), 'en') .
7}

As a result, the line that fetches the skos:broader matches will be executed even if the language of the label is not English. In practice this means that the engine would do a lot of unnecessary work with triples that will later be filtered out anyway.

To avoid this, you can introduce extra {…} blocks to make sure that the FILTER is executed earlier:

1SELECT ?continent ?label ?country
2WHERE {
3    {
4        ?continent a g:Continent .
5        ?continent skos:prefLabel ?label .
6        FILTER langMatches(lang(?label), 'en') .
7    }
8    ?country skos:broader ?continent .
9}

Hint

It is important to understand that SPARQL is executed “from the inside out”. This means that the engine will first execute the inner {…} blocks.