Discovering Babel: technical issues

The Discovering Babel project aims to make the digital resources in the Oxford Text Archive easier to discover for potential users. The technical issues in the project relate to the ways in which we are making the OTA catalogue data available in new ways. There are several aspects to this work:

  1. making the catalogue records available to be collected by online resource discovery services;
  2. transforming the catalogue records into a variety of different formats for the different services;
  3. updating catalogue records for the items in the archive.

Making the records available

Before Discovering Babel, the OTA metadata was available only in abbreviated form in the catalogue list on the website, and on the webpages for each resource, or in full when a user downloaded the resource. An important additional service made available as part of the project workplan was to make the full metadata available for online services to collect, or harvest it. We chose to do this using the most widely used protocol for this purpose, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH for short).

In order to do this we had to follow the following steps:

  • add the appropriate Apache and Perl modules to our web server to allow OAI-PMH queries to our web service;
  • implement crosswalks (using XSLT) from our metadata in TEI Header format to the Dublin Core format;
  • register as a metadata provider with relevant aggregators;
  • set up procedures to ensure the ongoing availability, persistence, maintenance and updating of the OAI-PMH service

We have chosen to make metadata available in a number of formats via OAI-PMH, to fit the expectations and requirements of a number of harvesters relevant to our field. We therefore deliver Dublin Core, with extensions for the Open Language Archives Community (OLAC) and the TEI Headers. We were also planning to provide CMDI metadata for the CLARIN aggregator, but this format has not yet achieved sufficient maturity and stability, so we aim to add this later. In the meantime, the CLARIN aggregator is harvesting OLAC metadata, and in this way they are presenting OTA resources in the Virtual Language Observatory service at http://www.clarin.eu/vlo.

The OTA records are harvested from http://ota.oerc.ox.ac.uk/oai2/XMLFile/ota/oai.pl.

Crosswalks: transforming the metadata to different formats

We initially wrote the crosswalks using XSLT 2.0, but we found that the performance was very poor, and too slow for the harvesting services. We therefore backported the code to XSLT 1.0, which provided adequate performance and enabled the harvesters to operate. We plan to investigate these issues further together with other CLARIN centres to see if future improvements to the performance can be achieved.

What we understand so far is that the repeated calls to the Java-based XSLT 2.0 processor Saxon (in our case, using the saxonb-xslt package on Ubuntu) seem to be the problem. The original stylesheet which we wrote to transform the TEI Headers worked on a directory of header files. However, due to the way in which the OAI-PMH architecture works, the stylesheets had to be written to work on a file-per-file basis. So the Java Virtual Machine starts again and again for each call of Saxon, i.e. for every metadata item. This was very costly computationally, and simply providing more computing power would not have been a very good solution, since the procedure seems to be simply not easily scalable.

A key point for us to consider at this stage was that the our original stylesheets made use of XSLT 2.0 features, but there are few 2.0 processors available. None seem to be based on C or C++. The only real alternative to Saxon of which we were aware were the closed-source AltovaXML products, only available for Windows 32-bit architectures.

We therefore ran tests with C-based XSLT 1.0 processing (with the xsltproc package on Ubuntu), which was lightning fast in comparison for the hundreds of metadata records, with a time factor improvement of 100-200 times compared to Saxon. We therefore rewrote the pertinent parts of the stylesheets to conform to XSLT 1.0 and implemented this solution.

We also considered another possibility, of moving to a servlet-based solution. There is a Java-based OAI implementation (jOAI), for example, to be deployed on a Tomcat Server. Another option would have been to investigate setting up the Java-based Saxon XSLT 2.0 as a service in its own right, which could be consumed by the Perl Code. Both solutions would not involve starting up the JVM again and again. However, either solution would make it necessary to set up a server (Tomcat or Jetty, respectively), and we considered that as well as the additional effort to implement, this would raise an additional maintenance overhead, with serious risks to the robustness, persistence and sustainability of the service.

Updating the records

The OTA has always made freely available the descriptions of the electronic resources in the archive. These descriptions take the form of catalogue records, or metadata, and contain information useful to potential users about the resource, including its title, a summary of the content, where the electronic resource came from (its provenance), technical formats, types of annotation, size of the files, any restrictions on its use, etc..

This metadata for each resource is encoded in an XML file, and the information is encoded according to the guidelines of the Text Encoding Initiative (TEI), following the latest (P5) version of the guidelines. In the area of literary and linguistic computing, the TEI Guidelines are a widely recognized and respected reference point and standard for the encoding of data and metadata. The metadata for OTA resources is therefore in the form of a TEI Header.

The work in Discovering Babel on making this metadata more visible, and on transforming it into other formats has revealed some areas where it was necessary to update, correct or add to the existing information in the metadata. For example, it was found that the description of the language of a resource was missing in some cases, usually where the language was English, and was perhaps considered the default value in the past!

Posted in Uncategorized | Leave a comment

Leave a Reply