Discovering DiscoverEd

Recently I’ve been trying a few queries on Creative Commons’ new DiscoverEd OER search page. For example this query for any CC-licensed material referencing the doomed Franklin Expedition – my current pet interest outside work and something I thought might well produce no results – turns up an interesting open access paper on the metabolism of Inuits and its similarity to the metabolism of people on the Atkins diet.

DiscoverEd is currently only indexing the following sites:

  • Connexions (http://cnx.org)
  • National Science Digital Library (http://nsdl.org)
  • OER Commons (http://oercommons.org)
  • OpenCourseWare Consortium (http://ocwconsortium.org)

and so is considerably more limited than the general Creative Commons search tool which acts as a concentrator for licence-based search from Yahoo!, Google and Flickr. Does it have to be more limited in its indexing because its scope is deliberately focused on education? The white paper (pdf) published alongside the launch of DiscoverEd has many interesting things to say on how we ought to be flagging resources as both open and educational. After a brief discussion of metadata syndication technologies like OAI-PMH – which DiscoverEd currently uses to ascertain more about the web-available resources it spiders – the paper continues:

We believe that structured data are more useful when widely exposed and linked. Furthermore, such data are more likely to be provided once they are parsed by useful and popular web-based tools, such as search engines. A key benefit pertains to auto-discovery: if you’ve got the URL, then you have also got access to the structured data over HTTP and by following HTML links. There is no longer any need for a specialized protocol to benefit from the virtually linked structured data.

Structured data can gain the widest exposure and opportunities for linking when published in [X]HTML, visible to both people as well as software. Creative Commons addressed this issue as part of our work with licenses and identified the following as important principles for structured data in HTML documents:

Independence and Extensibility: The means of expressing information in HTML should be (1) independent of any central authority and (2) extensible, i.e., enabling the reuse of existing data models and the addition of new properties by anyone. Adding new properties should not require extensive coordination across communities or approval from a central authority. Tools should not suddenly become obsolete when new properties are added, or when existing properties are applied to new kinds of data sets.

Don’t Repeat Yourself: Providing machine-readable structure should not require duplicating data in a separate format. Notably, if the human-readable links or text are changed, a machine processing the page should automatically note this change without the publisher having to update another part of the HTML file to keep it “in sync” with the human-readable portion. This helps reduce the overall load of creating structured data after the fact.

Visual Locality: An HTML page may contain multiple items, for example a dozen photos,each with its own structured data. It should be easy for tools to associate the appropriate structured data with their corresponding visual display.

Remix Friendliness: It should be easy to copy an item from one document and paste it into a new document with all appropriate structured data included. In a world where people constantly remix old content to create new content, copy-and-paste, widgets, and sidebars are crucial elements of the remixable Web.

Apologies for the extensive quote, but these principles do strike me as very useful. Having implemented OAI targets and  their backend storage solutions in the past, I approve heartily of the opinion that metadata side channels are awkward and rarely used, and that metadata is more useful the ‘closer’ it is to the resource it describes. More specifically for the project in hand, I have recently been discussing with a member of technical staff here the best method for exposing RDF descriptions of media items and their licensing terms to search engines. The issue of multiple items on a page, along with multiple RDF descriptions caused us some puzzlement. How would this data be parsed by Yahoo! and Google? I’m going to set up some experiments, and – as the Creative Commons paper suggests – take a look at RDFa.

Posted in Oxford, partnerships | Leave a comment

Leave a Reply