Benchmarking rdflib parsers

data.ox.ac.uk has a Python frontend, which queries a Fuseki instance over HTTP. When a request for a page comes in it performs a SPARQL query against Fuseki, asking for results back as RDF/XML. These are then parsed into an in-memory rdflib graph, which can then be queried to construct HTML pages, or transformed into other formats (other RDF serializations, JSON, RSS, etc).

In a bid to make things a bit quicker I decided to benchmark some of the rdflib parsers. I timed rdflib.ConjunctiveGraph.parse() ten times for each parser (interleaved) over 100,000 triples. Here are the results:

FORMAT  MIN      AVG      MAX      SD
   xml  42.0758  42.7914  43.0649  0.2903
    n3  32.6210  32.7803  33.4146  0.2188
    nt  14.4455  14.7031  15.3278  0.2745

This isn’t a perfect benchmark as my work box was doing who-knows-what at the same time, but things should have evened out for a comparitative analysis. It’s also quite clear that the N-Triples parser is about three times faster than the expat-based RDF/XML parser. On the basis of this I’m going to make the data.ox.ac.uk frontend request data as N-Triples; hopefully it’ll have a noticeable improvement on response times. I am slightly shocked that the RDF/XML parser only manages an average of 238 triples per second.

Here’s the hacky code I used:

import collections
import math
import rdflib
import time

files = (('test.rdf', 'xml'),
         ('test.ttl', 'n3'),
         ('test.nt', 'nt'))

scores = collections.defaultdict(list)

for i in range(10):
    for name, format in files:
        g = rdflib.ConjunctiveGraph()
        with open(name) as f:
            start = time.time()
            g.parse(f, format=format)
            end = time.time()
        scores[format].append(end-start)
        print '%2i %3s %6.4f' % (i, format, end-start)

print scores

print 'FORMAT  MIN     AVG     MAX     SD'
for format, ss in scores.items():
    avg = sum(ss)/len(ss)
    print '%6s  %6.4f  %6.4f  %6.4f  %6.4f' % (format, min(ss), avg, max(ss), math.sqrt(sum((s-avg)**2 for s in ss)/len(ss)))
Posted in Uncategorized | 1 Comment

One Response to “Benchmarking rdflib parsers”

  1. TimP says:

    I’m shocked that you are shocked that XML parsing is slow!

    My view is that code is shonky. The shonkyness of a piece of code is the product of the shonkyness of its components. Hence the need to keep the number of elements as small as possible.

    XML is too big. XML is always bigger than you need for a particular task. If you use XML for anything there is something unnecessary, unneeded and unused slowing you down.

Leave a Reply