I am currently working on the project Discovering Babel: enhanced language resource discovery, as part of a wider project to bring the latest technologies for finding and sharing data to the Oxford Text Archive. The project is funded by JISC, as part of the Infrastructure for Resource Discovery programme. This blog post explains out the project plan.
Aims, Objectives and Final Outputs of the project
The Oxford Text Archive is home to almost 2,000 literary and linguistic resources in electronic form. Many of the resources are the outputs of projects funded by UK and other funding agencies, including the British Academy, AHRC and AHRB. In the past two and a half years since the demise of the Arts and Humanities Data Service (AHDS), there have been more than 60,000 separate downloads of resources from the OTA.
This project aims to enhance and upgrade the resource discovery mechanisms of the OTA to ensure that it continues to offer free services to researchers in response to their needs, and to ensure that the technical infrastructure of the OTA is in line with the latest good practice in the Resource Discovery Task Force vision (see http://rdtf.jiscinvolve.org/wp/).
In particular, the OTA will implement procedures to ensure that:
- each resource in the collection is assigned an http URI
- each resource URI is registered as a persistent identifier with the EPIC Handle Service;
- URIs are made available via Open Archives Initiative Protocal for Metadata Harvesting (OAI-PMH) in a machine-processable form in XML, in the following formats: Dublin Core; Open Language Archives Community (OLAC) metadata set; Text Encoding Initiative (TEI) Header; CLARIN Component Metadata Infrastructure element set (CMDI);
- metadata is made freely available as open linked data;
- every URI is resolved to a machine-processable resource containing relevant metadata.
As the vast majority of the collections described by the metadata are made freely available by the OTA, these enhancements will not only facilitate aggregation of metadata, but will also be key steps towards enabling enhanced access services for the collections, where standards-conformant web services can perform operations on them, and they can be deployed in virtual research environments.
Wider Benefits to Sector & Achievements for Host Institution
The OTA provides a free service to Higher Education institutions and other users worldwide. The enhancements to the service will be of benefit to all users, the vast majority of whom are outside of the University of Oxford. Indeed, a successful outcome for the project would lead to easier and more widespread discovery of the OTA’s services. The workshop, manual and other dissemination and networking activities will help to spread across the sector information about the lessons we learn and mistakes that we make.
For the University of Oxford, these changes to the technical set-up of the OTA will help to connect together more of our services, lowering maintenance costs, sharing facilities, spreading expertise and allowing new services to be built linking together disparate datasets. Learning to deploy these technologies will help us to acquire skills which will be transferable to other projects.
Risk Analysis and Success Plan
The risks associated with this project appear to be relatively low. No new recruitment has been necessary, and the key technical tasks are the responsibility of a team of developers in OUCS, so the potential risks associated with staffing are low. There is the chance that some of the services with which we aim to interoperate will change – for example new protocols for harvesting metadata might be applied – but we have sufficient flexibility to adapt our workplan. The biggest risk with this type of project is usually the sustainability of the outputs. While this can never be certain, we have tried to embed, as far as possible, our work in ongoing production services which are part of Oxford’s core institutional IT infrastructure. The OTA has 35 years of success at ensuring the sustainability of our services, and we aim to continue this proud record!
The issues relating to intellectual property rights will not be a barrier to the successful completion of this project. Metadata for the resources in the OTA have been created by OTA staff and have always been freely available.
The datasets to which the resource discovery metadata refer are not owned by the OTA, but the OTA has permission to make the resources available subject to a user licence, which restricts use to exploitation for the purposes of education and research.
Project Team Relationships and End User Engagement
Martin Wynne is the prinicpal investigator and project manager. The technical tasks will be carried out by the InfoDev team at OUCS, which includes Sebastian Rahtz, James Cummings, Joseph Tlabot, Alexander Dutton and Richard Buckner. Ylva Berglund of OUCS will also contribute to dissemination activities.
Projected Timeline, Workplan & Overall Project Methodology
The specific items of work and outputs of the project are as follows:
By the end of March:
- Add records for British National Corpus datasets in OTA catalogue
By the end of April:
- OAI-PMH target for harvesting OTA metadata (minimum 1000 records)
- Establish persistent locations for metadata
- Establish persistent locations for datasets
- Enhanced and corrected metadata to ensure meaningful interoperation with access services and aggregators
- Crosswalks for metadata from TEI Headers to DC, OLAC, CMDI
- Crosswalks for metadata from TEI Headers to RDF
By the end of May:
- Register persistent identifiers with a handle service
- Ensure visibility of metadata in OLAC and CLARIN aggregators
- Workshop ‘How to make your language resources discoverable’ in Oxford
By the end of July (end of project):
- A freely available online manual ‘How to make your language resources discoverable’
- Make XSLT for crosswalks available from OTA website
- Deliver enhanced metadata in all above formats via OAI-PMH
The overall cost of the project is £37,465, of which the JISC grant pays £28,632, and the University of Oxford is providing the rest. The breakdown of expenditure in the major cost categories is broken down in the figure below: