The future of data.ox.ac.uk

In the spirit of Alex Bilbie’s post about the future of data.lincoln.ac.uk, and Southampton’s post about the technical background and direction of data.southampton.ac.uk, I thought it’d be a good idea to chime in with where data.ox.ac.uk is going.

Background

The idea for a data.ac.uk came about a year and a half ago. The Erewhon Project had done some work to produce a dataset describing places and organisations within the University, and we began to see the value that could be acheived were this data linked with other data sources around the University. We bid internally for a project to collect and publish electricity meter data from our ION WebReach system, noting that by linking meters to the buildings and spaces to which they pertain, we’d be able to provide a more transparent and cohesive picture of energy usage at the University. As a part of the continuation of this work we’ve open-sourced a Django application implementing a time-series API which we hope will be implemented by other institutions.

We developed a Django-based frontend called humfrey, which sits in front of a Fuseki triple-store. Local aspects of the project were kept in a separate dataox project on GitHub, hopefully meaning that the humfrey part could be repurposed by others.

From there, we kept on the look-out for other datasets that it would be suitable for us to publish. In the Summer of 2011, as part of OUCS’s internship scheme, we took on Austin Kinross, who did a superb job of getting up to speed with RDF and the linked data web, and worked to scrape vacancy data out of the University’s recruitment website and transform them into RDF.

Current work

Oxford is part of JISC’s course data programme, and we intend to publish data about the vast majority of our graduate training opportunities. Our approach is to model all our data as RDF, load it into our public triple-store, and expose the required XCRI-CAP feeds from there. This involves round-tripping XCRI-CAP XML to RDF and back again, and we’ve published our transformation code on GitHub.

Our other current focus is publishing data about research equipment and facilities. This will result in an internal catalogue for our researchers, a feed of data to share with other universities, and should provide for greater utilisation and efficiencies.

Technical details

Screenshot of admin.data.ac.uk
humfrey contains an update management system, which lets us add new datasets with relative ease (at least, for simple datasets). Each dataset update definition is made up of some number of pipelines, with a pipeline saying things like “retrieve this file from that resource, transform it using this XSL, and push it into the store with this graph name”.

Screenshot showing search on data.ox.ac.uk
After a dataset has been updated, the site kicks off a couple of tasks to update the dataset metadata to thedatahub.org and update our ElasticSearch-based search indices. In time we’ll be able to say “always update this dataset after that one” and support pushing update notifications to downstream consumers.

Behind the scenes we’re working on improving our deployment processes so that the whole thing becomes rather less ad hoc.

The future

The most important thing is that we’ve got the go-ahead from upon high to continue working on data.ox.ac.uk, and that it’s poised to become a more strategic element of the University’s data flows.

There are still a few datasets we’ve got our eyes on:

  • Undergraduate course data
  • Lecture timetables
  • Exam timetables
  • Club and society data
  • Events data
  • The contact directory

For most of these the systems that hold the data are in some state of transition (e.g. the student systems replacement project) or are otherwise unable to provide the data in a suitable format.

Other outstanding issues include:

  • Licensing: Ideally we’d like all the data we provide to be openly-licensed, and will push our upstream providers to use a Creative Commons license or the PDDL. However, we will for the foreseeable future be publishing data with either non-open or under-specified licensing (in which case the consumer should assume that they need to seek permission from the maintainer).
  • RDF? What’s that? We also realise that RDF isn’t everyone’s cup of tea, and that a lot of people want simple APIs which return chunks of JSON or XML. Our search API should go some way towards filling this void, and we’ll look to create bespoke APIs as appropriate.
  • Documentation: We realise that we’re woefully short on documentation. I’ll make a concerted effort to document our datasets and APIs on this blog.

Plans are also afoot for a data.ac.uk to act as a hub for co-ordination and discovery of data published by HE and FE insitituions. It’s still early days, but planning is currently taking place on the DATA-AC-UK mailing list. Community is a huge part of the push for open data: we’ll acheive so much more if we talk to one another.

Posted in Uncategorized | Leave a comment

Leave a Reply