Bash Datalog:
Answering Datalog Queries
with Unix Shell Commands
Thomas Rebele, Thomas Pellissier Tanon, and Fabian Suchanek
Whom did Elvis Presley influence?
?
influenced
2
3
Triple store
?
influenced
Whom did Elvis Presley influence?
4
Triple store
Samuel Hui
Carlo Wolf
...
?
influenced
Whom did Elvis Presley influence?
Triple store
5
?
influenced
Samuel Hui
Carlo Wolf
...
Whom did Elvis Presley influence?
Triple store
6
> grep facts.nt
> cut
> sort | uniq
Bash
?
influenced
Samuel Hui
Carlo Wolf
...
Whom did Elvis Presley influence?
When Bash makes sense
7
Using Bash makes sense when
only a single query is executed
on the KB.
Examples:
• checking whether a KB is useful for a goal
(e.g.: does the KB contain Elvis at all?)
• filtering the data (e.g., when I am
interested only in the data about people)
• preprocessing the data
(removing the 4th column from quads)
• single‐use queries (e.g., debugging a KB)
Bash
> grep facts.nt
> cut
> sort | uniq
Scenario
8
Assume that we want to execute
only a single query
.
Query (SPARQL
or Datalog)
Result file
Data (one or
several n‐triples
or TSV files)
Bash
?
influenced
Samuel Hui
Carlo Wolf
...
> grep facts.nt
> cut
> sort | uniq
Our goal
9
Query (SPARQL
or Datalog)
Bash
Our goal is to translate the query automatically to Bash commands
?
influenced
> grep facts.nt
> cut
> sort | uniq
Step 1: SPARQL to Datalog
10
SELECT ?a WHERE {
<Elvis> <influenced>+ ?a .
}
facts(S,P,O) :~ cat facts.tsv
influenced(X, Y) :‐ facts(X, "<influenced>", Y).
influenced(X, Z) :‐ influenced(X, Y), influenced(Y, Z).
main(X) :‐ influenced("<Elvis_Presley>", X).
SPARQL Query
Datalog
Automated translation (as in RDFox)
Step 1: SPARQL to Datalog
11
Datalog is our native input query language. We support
• arbitrary arity
• joins, disjunctions, selections, projections, unions
• recursions
• stratified negation
•
arbitrary Unix commands that produce tab‐separated data
facts(S,P,O) :~
cat facts.tsv
influenced(X, Y) :‐ facts(X, "<influenced>", Y).
influenced(X, Z) :‐ influenced(X, Y), influenced(Y, Z).
main(X) :‐ influenced("<Elvis_Presley>", X).
Datalog
Step 1: SPARQL to Datalog
12
Datalog is our native input query language. We support
• arbitrary arity
• joins, disjunctions, selections, projections, unions
• recursions
• stratified negation
• arbitrary Unix commands that produce tab‐separated data
•
n‐triples files
facts(S,P,O) :~
ntriples facts.nt
influenced(X, Y) :‐ facts(X, "<influenced>", Y).
influenced(X, Z) :‐ influenced(X, Y), influenced(Y, Z).
main(X) :‐ influenced("<Elvis_Presley>", X).
Datalog
Step 2: Datalog to Relational Algebra
13
facts(S,P,O) :~ cat facts.tsv
influenced(X, Y) :‐ facts(X, "<influenced>", Y).
influenced(X, Z) :‐ influenced(X, Y), influenced(Y, Z).
main(X) :‐ influenced("<Elvis_Presley>", X).
recursion
start
recursion
step
14
facts(S,P,O) :~ cat facts.tsv
influenced(X, Y) :‐ facts(X, "<influenced>", Y).
influenced(X, Z) :‐ influenced(X, Y), influenced(Y, Z).
main(X) :‐ influenced("<Elvis_Presley>", X).
Step 3: Optimization
A
B
C
A
D
15
facts(S,P,O) :~ cat facts.tsv
influenced(X, Y) :‐ facts(X, "<influenced>", Y).
influenced(X, Z) :‐ influenced(X, Y), influenced(Y, Z).
main(X) :‐ influenced("<Elvis_Presley>", X).
Step 3: Optimization&Materialization
A
D
Step 4: Translation to Bash
16
awk $1=Elvis $2=influenced {print $3}
Step 4: Translation to Bash
17
awk $2=influenced {print $1 TAB $3}
Step 4: Translation to Bash
18
awk $2=influenced {print $1 TAB $3}
> m.tsv
join ...
m.tsv
>lock
Step 4: Translation to Bash
19
mkfifo lock
(
awk $2=influenced ...
> m.tsv
exec 3> lock; exec 3> &-)&
...
cat lock
join ...
m.tsv
# Create a lock
# Materialize in parallel
# Free lock
# Try to read from lock
# Use the materialized file
Step 4: Translation to Bash
green stuff > result.tsv
while
join result.tsv m.tsv > new.tsv
comm -23 new.tsv result.tsv
sort -merge result.tsv new.tsv
[-s new.tsv]; do continue; done
# Compute new results
# Remove old results from new
# Merge new and old
# Loop until no new results
Experiments
21
Competitors:
• Datalog-based systems (DLV, Soufflé, RDFox, BigDatalog)
• Triple stores (Jena, Stardog, Virtuoso)
• Database management systems (MonetDB, Postgres)
• Others (RDF Slice, NoDB)
Remember, our goal is to execute
just a single query
.
Therefore, we also count the loading time for our competitors.
If you want to execute several queries, our system is not for you!
Use a different system then!
Experiments on the LUBM benchmark
22
dashed =
cannot answer
all LUBM queries
Bash Datalog
There may be other systems that are faster than us, but: