TEI-Comparator

I have just finished my poster for DRHA 2009 which is about the TEI-Comparator that RTS worked on for the Holinshed Project. My poster is available online in PDF and PNG formats. (Though for the record it was created in Inkscape as an SVG file).

The poster discusses the creation of the tool for the Holinshed Project at the University of Oxford. Holinshed’s Chronicles of England, Scotland, and Ireland was the crowning achievement of Tudor historiography and an important historical source for contemporary playwrights and poets. Holinshed’s Chronicles was first printed in 1577 and a second revised and expanded edition followed in 1587. EEBO-TCP had already encoded a version of the 1587 edition, and the Holinshed Project specially commissioned them to create a 1577 edition using the same methodology. The resulting texts were converted to valid TEI P5 XML and used as a base to construct a comparison engine, known as the TEI-Comparator, to assist the editors in understanding the textual differences between the two editions.

Using the TEI-Comparator has several stages. The first was to decide what elements in the two TEI XML files should be compared. In this case the appropriate granularity was at the paragraph (and paragraph-like) level. The project was primarily interested in how portions of text were re-used, replaced, expanded, deleted, and modified from one edition to another. This first stage ran a short preparatory script which added unique namespaced IDs to each relevant element in both the TEI files. It is the proper linking of these two IDs which the TEI-Comparator hoped to facilitate.

The second stage was to prepare a database of initial comparisons between the two texts using a bespoke fuzzy text-comparison n-gram algorithm designed by Arno Mittelbach (the technical lead for the TEI-Comparator). This algorithm, called Shingle Cloud, transforms both input texts (needle and haystack) into sets of n-grams. It matches the haystack’s n-grams against the needle’s and constructs a huge binary string where they match. This binary string is then interpreted by the algorithm to determine whether the needle can be found in the haystack and if so where. The algorithm runs in linear time and, given the language of the originals, was found to work better if the strings of text were regularized (including removal of vowels).
The third stage in using the comparator was for the research assistant on the project to confirm, remove, annotate, or create new links between one edition and the other using a custom interface to the TEI-Comparator constructed in Java using the Google Web Toolkit API. The final stage was to produce output from the work put in by the RA through generating two standalone HTML versions of the texts which were linked together based on the now-confirmed IDs.

Shortly the TEI-Comparator will be publicly available on Sourceforge with documentation and examples to make it easy for others to re-purpose this software for other similar uses, and submit bugs and requests for future development.

Although known as the ‘TEI-Comparator’, the program does not require TEI input, it works with XML files of any vocabulary as long as the elements being compared have sufficient unique text in them.

For more information about the TEI-Comparator e-mail: tei@oucs.ox.ac.uk

Posted in TEI | 6 Comments

6 Responses to “TEI-Comparator”

  1. Dave Postles says:

    I wonder what TEI-Comparator does which Wordsmith 5.0 doesn’t. When it is available on Sourceforge, will there be a Linux package and will there be .deb, .rpm, as well as source code? If source code, can it be easily compiled from tar.gz or tar.bz2 with tar, ./configure, cd, make, make install on all distros (currently using Mandriva, PCLinuxOS, Crunchbox, and Mepis-Antix)
    Cheers,
    D.

  2. Dave Postles says:

    add question mark at end!

  3. jamesc says:

    Hi,

    What TEI-Comparator does and Wordsmith do are very different things. TEI-Comparator is basically designed to do one simple task: to say this id here in this file is the same as this id there in this other file. In pre-processing the files IDs are assigned to them which it then through fuzzy comparison attempts to match with similarly id’ed items in another file. It then provides a web frontend to allow someone to confirm these matches, or remove them.

    TEI-Comparator is technically already up there on sourceforge in its first release beta, in the subversion archive, but we haven’t packaged it yet because it is missing documentation. (And it needs a bit more tidying up…) Basically one downloads the source and then builds a .war file to deploy under tomcat… so we’ll probably release it as a .tar.gz or zip or something because there is so much customisation needed for each use of it that it probably wouldn’t work well as a packaged .deb … but we’ll certainly explore that.

    It will be released shortly before the TEI Members Meeting in November… when I’ve got around to writing more comprehensive documentation. ;-)

    -James

  4. [...] Chronicles, a broad bibliography and a number of working papers. Technophiles can also explore a blog on the use of a TEI-Comparator for the project.  The launch of this site is by no means the end of this project. An Oxford [...]

  5. Arno Mittelbach says:

    A first packaged version of the TEI-Comparator is now available at http://tei-comparator.sourceforge.net/