Benchmarking rdflib parsers

data.ox.ac.uk has a Python frontend, which queries a Fuseki instance over HTTP. When a request for a page comes in it performs a SPARQL query against Fuseki, asking for results back as RDF/XML. These are then parsed into an in-memory rdflib graph, which can then be queried to construct HTML pages, or transformed into other formats (other RDF serializations, JSON, RSS, etc).

In a bid to make things a bit quicker I decided to benchmark some of the rdflib parsers. I timed rdflib.ConjunctiveGraph.parse() ten times for each parser (interleaved) over 100,000 triples. Here are the results:

FORMAT  MIN      AVG      MAX      SD
   xml  42.0758  42.7914  43.0649  0.2903
    n3  32.6210  32.7803  33.4146  0.2188
    nt  14.4455  14.7031  15.3278  0.2745

This isn’t a perfect benchmark as my work box was doing who-knows-what at the same time, but things should have evened out for a comparitative analysis. It’s also quite clear that the N-Triples parser is about three times faster than the expat-based RDF/XML parser. On the basis of this I’m going to make the data.ox.ac.uk frontend request data as N-Triples; hopefully it’ll have a noticeable improvement on response times. I am slightly shocked that the RDF/XML parser only manages an average of 238 triples per second.

Here’s the hacky code I used:

import collections
import math
import rdflib
import time

files = (('test.rdf', 'xml'),
         ('test.ttl', 'n3'),
         ('test.nt', 'nt'))

scores = collections.defaultdict(list)

for i in range(10):
    for name, format in files:
        g = rdflib.ConjunctiveGraph()
        with open(name) as f:
            start = time.time()
            g.parse(f, format=format)
            end = time.time()
        scores[format].append(end-start)
        print '%2i %3s %6.4f' % (i, format, end-start)

print scores

print 'FORMAT  MIN     AVG     MAX     SD'
for format, ss in scores.items():
    avg = sum(ss)/len(ss)
    print '%6s  %6.4f  %6.4f  %6.4f  %6.4f' % (format, min(ss), avg, max(ss), math.sqrt(sum((s-avg)**2 for s in ss)/len(ss)))
Posted in Uncategorized | 1 Comment

Open Data Hack Days at the ODI

Back in October I attended the Open Data Hack Days at the new Open Data Institute (ODI) offices.

Keynotes

On the morning of the first day we had keynotes from Jeni Tennison (@JeniT; Technical Director at the ODI), Chris Gutteridge (@cgutteridge; lead developer for data.soton.ac.uk, and facilitator of data.ac.uk), and Antonio Acuña (@diabulos; head of data.gov.uk).

Jeni started her talk by explaining that the ODI existed to demonstrate the value of open data. “Data helps us make decisions”, and by implication better use of data leads to better decisions, improved public (and private) services, and time and money efficiencies.

Here are some of the things we need to consider if we want to realize our vision of better use of public data. The vast majority of which is a paraphrase of Jeni, but I’ve added some of my thoughts in here, just to confuse you. You may also be interested in Tanya Gray‘s notes on Jeni’s talk.

Inferring data
Partly a UI issue around data collection; using data we already have to help us collect good quality data; auto-completion and auto-suggestion; offering to correct mistakes
“living off spreadsheets”
Spreadsheets are everywhere, and contain a lot of valuable data. We’re not going to be able to get people to give them up (and why should we?), so we need to be good at getting data out through transformation. (XLSX and ODS are just zipped XML; we can make them a bit more manageable with tools like tei-spreadsheet)
Validation
We need to know that the data we have makes sense. As a community we’re not very good at this, preferring to assume it works, and waiting for feedback. Antonio mentioned in his talk a tool they use to check the validity and recentness of spending data; we need more stuff like that! Other ideas include the automatic detection and flagging of outliers, and gamification for collaborative validation.
Combining data
A mixture of co-reference resolution, resolving differences in modelling granularity, and probably a few other things I haven’t considered.
Aggregation for Data Protection
When producing statistics over datasets containing personal data (e.g. employees, patients) we need to implement automatic aggregation so as not to expose information that is too fine-grained.
Analysis
We tend to ignore probability, uncertainty and statistical significance when analysing the data we have. For example, “the UK economy has lost 15,000 jobs in the last month” on its own doesn’t signify a trend, or any causal relationships. It doesn’t help that modelling uncertainty in RDF is Difficult™ and/or introduced modelling incompatibilities.
Publication issues
When did the data last get updated? How do I subscribe to changes? Where did the data come from, and how was it transformed? We need to attach provenance metadata to datasets, and a (machine-readable) feed of changes wouldn’t go amiss.
Visualization
Visualizations shouldn’t just look pretty; they should prompt us to make decisions and take action. They should also show uncertainty.

Chris talked about motivating people to publish data openly, and models for aggregation.

Antonio talked about data.gov.uk, their wealth of datasets, and how they work to improve the usefulness and “findability” of the datasets they have. I’m sorry I have neither notes, or links to slides!

The hacking

Chris and I decided we would tackle automated discovery of datasets by software agents. The goals were:

  • An agent, starting at an organisation’s homepage, should be able to discover structured information about that organization
  • The information should be categorised by concern (e.g. vacancies, energy usage, news feed)
  • Separately, the information should be categorised by format (e.g. a profile of RDF or CSV, RSS, an API specification)
  • To allow people to not care about the abstract concept of ‘a dataset’, just embodiments thereof
  • As ever, the barrier to entry should be low; it should be simple for people to implement

To bootstrap the discovery, we decided to use a /.well-known/ URI. These support discovery of host or site metadata in a consistent way, with an IANA-maintained registry of URIs and their specifications. VoID already provides a way to discover RDF datasets using /.well-known/, but we’re not concerned about datasets, or exclusively concerned about data modelled in RDF.

Chris and I have started writing up a specification for an organisation profile document on the OpenOrg wiki. The general idea is that a client can request http://www.example.org/.well-known/openorg and get back something like:

@prefix dcmit: <http://purl.org/dc/dcmitype/>
@prefix foaf:  <http://xmlns.com/foaf/0.1/> .
@prefix org:   <http://www.w3.org/ns/org#> .
@prefix oo:    <http://purl.org/openorg/> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos:  <http://www.w3.org/2004/02/skos/core#>
 
<> a oo:OrganizationProfileDocument ;
   foaf:primaryTopic <http://id.example.org/> .
 
<http://id.example.org/> a org:FormalOrganization ;
   # organization metadata
   skos:prefLabel "Example Organization" ;
   foaf:logo <http://www.example.org/images/logo.png> ;
   foaf:homepage <http://www.example.org/> ;
   # profile documents
   oo:profileDocument
     <http://www.example.org/news.rss> ,
     <http://energy.example.org/> ,
     <http://data.example.org/.well-known/void> ,
     <http://data.example.org/dumps/places.rdf> .

<http://www.example.org/news.rss> a foaf:Document ;
  dc:format "application/rss+xml" ;
  oo:theme ootheme:news ;
  foaf:primaryTopic <http://id.example.org/> .

<http://energy.example.org/> a dcmit:Service ;
  oo:theme ootheme:energy-use ;
  foaf:primaryTopic <http://id.example.org/> .

# and so on

Chris has also started putting together a list of themes.

It’s still got a long way to go before we can make the registration request for the /.well-known/ URI, but it’s a start.

Posted in Uncategorized | Leave a comment

The future of data.ox.ac.uk

In the spirit of Alex Bilbie’s post about the future of data.lincoln.ac.uk, and Southampton’s post about the technical background and direction of data.southampton.ac.uk, I thought it’d be a good idea to chime in with where data.ox.ac.uk is going. Continue reading

Posted in Uncategorized | Leave a comment

The Vacancies Dataset

The vacancies dataset contains job listings from the University’s central recruitment site.

Updates and timeliness

We currently check for new vacancies every fifteen minutes.

Modelling and data

The data are represented in RDF, primarily using the vacancies vocabulary. Data are stored across two RDF graphs, one for current vacancies, and another for those that have closed.

We also take a copy of the associated documents, which are stored on source.data.ox.ac.uk. Where possible we pull out the plain text of each document and include it in the RDF. At some point we aim to implement full-text search, which should make it possible to search for phrases within job descriptions.

Querying the data

In addition to using SPARQL, you can retrieve vacancy data using the following methods:

Feeds

It’s possible to get the data as RSS and Atom using URLs of the following formats:

http://data.ox.ac.uk/feeds/vacancies/xxxxxxxx[.format]
This will return all vacancies within the unit with the OxPoints ID of xxxxxxxx. It will exclude any attached to its sub-units, so if you ask for the Social Sciences Division, you won’t get back anything in Classics. You can find the base URL by finding the unit you want on this page. Once you’ve found the base URL you can append the name of a format. We currently support various RDF serializations, RSS and RSS2.0, and Atom.
http://data.ox.ac.uk/feeds/all-vacancies/xxxxxxxx.format
This is the same as above, but includes vacancies advertised as being within the sub-units of the requested unit. So in this case you’d also get vacancies in Classics when you ask for the Social Sciences Division. This means that you can get a list of all vacancies within the University hierarchy from http://data.ox.ac.uk/feeds/all-vacancies/00000000.

In both of these cases you can filter by adding a ?keyword=keyword parameter to filter by a sub-string.

For the RSS and Atom feeds we’ve given the closing date as the publication date, which should help when re-displaying these feeds as part of another website.

Bugs and limitations

At the moment we only pull data from the University’s recruitment site, which excludes collegiate appointments. We’re working to re-integrate work that was done to pull additional vacancies from www.jobs.ac.uk.

Salary information is currently just plain text. In due course we plan to model the University’s salary scales, which we can then link to. Once that is done it’ll be possible to perform range-based SPARQL queries on grade and upper and lower annual remunerations.

Our location parsing isn’t perfect, so we sometimes assign vacancies to the wrong unit. We hope to make this a bit cleverer soon!

Posted in Uncategorized | Tagged , | Leave a comment

About the University of Oxford’s open data service

data.ox.ac.uk is intended to be a repository for the University’s institutional open data. By collecting these data together and making them available in a machine-readable and re-usable way we hope to:

  1. Improve information flows in and around the University, allowing groups and departments to easily ingest and re-present data produced by another part of the University.
  2. Enable other people to use the data in new and innovative ways; to re-express the data to show something previously unknown.

To do this we’re creating a platform for storing and querying these data using standards-based technologies. A lot of the data will be available as RDF and queryable using SPARQL. However, where we can we are producing web-friendly APIs for those who want a low barrier to getting started and who aren’t familiar with RDF and SPARQL. It’s also possible to download the data in bulk for processing and manipulation off-line.

At the moment the service is in beta; it’s not officially supported and there’s no official commitment to keep it running. This may change as we demonstrate its usefulness.

Posted in Uncategorized | Leave a comment