Open Data Hack Days at the ODI

Back in October I attended the Open Data Hack Days at the new Open Data Institute (ODI) offices.

Keynotes

On the morning of the first day we had keynotes from Jeni Tennison (@JeniT; Technical Director at the ODI), Chris Gutteridge (@cgutteridge; lead developer for data.soton.ac.uk, and facilitator of data.ac.uk), and Antonio Acuña (@diabulos; head of data.gov.uk).

Jeni started her talk by explaining that the ODI existed to demonstrate the value of open data. “Data helps us make decisions”, and by implication better use of data leads to better decisions, improved public (and private) services, and time and money efficiencies.

Here are some of the things we need to consider if we want to realize our vision of better use of public data. The vast majority of which is a paraphrase of Jeni, but I’ve added some of my thoughts in here, just to confuse you. You may also be interested in Tanya Gray‘s notes on Jeni’s talk.

Inferring data
Partly a UI issue around data collection; using data we already have to help us collect good quality data; auto-completion and auto-suggestion; offering to correct mistakes
“living off spreadsheets”
Spreadsheets are everywhere, and contain a lot of valuable data. We’re not going to be able to get people to give them up (and why should we?), so we need to be good at getting data out through transformation. (XLSX and ODS are just zipped XML; we can make them a bit more manageable with tools like tei-spreadsheet)
Validation
We need to know that the data we have makes sense. As a community we’re not very good at this, preferring to assume it works, and waiting for feedback. Antonio mentioned in his talk a tool they use to check the validity and recentness of spending data; we need more stuff like that! Other ideas include the automatic detection and flagging of outliers, and gamification for collaborative validation.
Combining data
A mixture of co-reference resolution, resolving differences in modelling granularity, and probably a few other things I haven’t considered.
Aggregation for Data Protection
When producing statistics over datasets containing personal data (e.g. employees, patients) we need to implement automatic aggregation so as not to expose information that is too fine-grained.
Analysis
We tend to ignore probability, uncertainty and statistical significance when analysing the data we have. For example, “the UK economy has lost 15,000 jobs in the last month” on its own doesn’t signify a trend, or any causal relationships. It doesn’t help that modelling uncertainty in RDF is Difficult™ and/or introduced modelling incompatibilities.
Publication issues
When did the data last get updated? How do I subscribe to changes? Where did the data come from, and how was it transformed? We need to attach provenance metadata to datasets, and a (machine-readable) feed of changes wouldn’t go amiss.
Visualization
Visualizations shouldn’t just look pretty; they should prompt us to make decisions and take action. They should also show uncertainty.

Chris talked about motivating people to publish data openly, and models for aggregation.

Antonio talked about data.gov.uk, their wealth of datasets, and how they work to improve the usefulness and “findability” of the datasets they have. I’m sorry I have neither notes, or links to slides!

The hacking

Chris and I decided we would tackle automated discovery of datasets by software agents. The goals were:

  • An agent, starting at an organisation’s homepage, should be able to discover structured information about that organization
  • The information should be categorised by concern (e.g. vacancies, energy usage, news feed)
  • Separately, the information should be categorised by format (e.g. a profile of RDF or CSV, RSS, an API specification)
  • To allow people to not care about the abstract concept of ‘a dataset’, just embodiments thereof
  • As ever, the barrier to entry should be low; it should be simple for people to implement

To bootstrap the discovery, we decided to use a /.well-known/ URI. These support discovery of host or site metadata in a consistent way, with an IANA-maintained registry of URIs and their specifications. VoID already provides a way to discover RDF datasets using /.well-known/, but we’re not concerned about datasets, or exclusively concerned about data modelled in RDF.

Chris and I have started writing up a specification for an organisation profile document on the OpenOrg wiki. The general idea is that a client can request http://www.example.org/.well-known/openorg and get back something like:

@prefix dcmit: <http://purl.org/dc/dcmitype/>
@prefix foaf:  <http://xmlns.com/foaf/0.1/> .
@prefix org:   <http://www.w3.org/ns/org#> .
@prefix oo:    <http://purl.org/openorg/> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos:  <http://www.w3.org/2004/02/skos/core#>
 
<> a oo:OrganizationProfileDocument ;
   foaf:primaryTopic <http://id.example.org/> .
 
<http://id.example.org/> a org:FormalOrganization ;
   # organization metadata
   skos:prefLabel "Example Organization" ;
   foaf:logo <http://www.example.org/images/logo.png> ;
   foaf:homepage <http://www.example.org/> ;
   # profile documents
   oo:profileDocument
     <http://www.example.org/news.rss> ,
     <http://energy.example.org/> ,
     <http://data.example.org/.well-known/void> ,
     <http://data.example.org/dumps/places.rdf> .

<http://www.example.org/news.rss> a foaf:Document ;
  dc:format "application/rss+xml" ;
  oo:theme ootheme:news ;
  foaf:primaryTopic <http://id.example.org/> .

<http://energy.example.org/> a dcmit:Service ;
  oo:theme ootheme:energy-use ;
  foaf:primaryTopic <http://id.example.org/> .

# and so on

Chris has also started putting together a list of themes.

It’s still got a long way to go before we can make the registration request for the /.well-known/ URI, but it’s a start.

Posted in Uncategorized | Leave a comment

Leave a Reply