data.ox.ac.uk has a Python frontend, which queries a Fuseki instance over HTTP. When a request for a page comes in it performs a SPARQL query against Fuseki, asking for results back as RDF/XML. These are then parsed into an in-memory rdflib graph, which can then be queried to construct HTML pages, or transformed into other formats (other RDF serializations, JSON, RSS, etc).
In a bid to make things a bit quicker I decided to benchmark some of the rdflib parsers. I timed rdflib.ConjunctiveGraph.parse() ten times for each parser (interleaved) over 100,000 triples. Here are the results:
FORMAT MIN AVG MAX SD xml 42.0758 42.7914 43.0649 0.2903 n3 32.6210 32.7803 33.4146 0.2188 nt 14.4455 14.7031 15.3278 0.2745
This isn’t a perfect benchmark as my work box was doing who-knows-what at the same time, but things should have evened out for a comparitative analysis. It’s also quite clear that the N-Triples parser is about three times faster than the expat-based RDF/XML parser. On the basis of this I’m going to make the data.ox.ac.uk frontend request data as N-Triples; hopefully it’ll have a noticeable improvement on response times. I am slightly shocked that the RDF/XML parser only manages an average of 238 triples per second.
Here’s the hacky code I used:
import collections import math import rdflib import time files = (('test.rdf', 'xml'), ('test.ttl', 'n3'), ('test.nt', 'nt')) scores = collections.defaultdict(list) for i in range(10): for name, format in files: g = rdflib.ConjunctiveGraph() with open(name) as f: start = time.time() g.parse(f, format=format) end = time.time() scores[format].append(end-start) print '%2i %3s %6.4f' % (i, format, end-start) print scores print 'FORMAT MIN AVG MAX SD' for format, ss in scores.items(): avg = sum(ss)/len(ss) print '%6s %6.4f %6.4f %6.4f %6.4f' % (format, min(ss), avg, max(ss), math.sqrt(sum((s-avg)**2 for s in ss)/len(ss)))