I have just finished my poster for DRHA 2009 which is about the TEI-Comparator that RTS worked on for the Holinshed Project. My poster is available online in PDF and PNG formats. (Though for the record it was created in Inkscape as an SVG file).
The poster discusses the creation of the tool for the Holinshed Project at the University of Oxford. Holinshed’s Chronicles of England, Scotland, and Ireland was the crowning achievement of Tudor historiography and an important historical source for contemporary playwrights and poets. Holinshed’s Chronicles was first printed in 1577 and a second revised and expanded edition followed in 1587. EEBO-TCP had already encoded a version of the 1587 edition, and the Holinshed Project specially commissioned them to create a 1577 edition using the same methodology. The resulting texts were converted to valid TEI P5 XML and used as a base to construct a comparison engine, known as the TEI-Comparator, to assist the editors in understanding the textual differences between the two editions.
Using the TEI-Comparator has several stages. The first was to decide what elements in the two TEI XML files should be compared. In this case the appropriate granularity was at the paragraph (and paragraph-like) level. The project was primarily interested in how portions of text were re-used, replaced, expanded, deleted, and modified from one edition to another. This first stage ran a short preparatory script which added unique namespaced IDs to each relevant element in both the TEI files. It is the proper linking of these two IDs which the TEI-Comparator hoped to facilitate.
The second stage was to prepare a database of initial comparisons between the two texts using a bespoke fuzzy text-comparison n-gram algorithm designed by Arno Mittelbach (the technical lead for the TEI-Comparator). This algorithm, called Shingle Cloud, transforms both input texts (needle and haystack) into sets of n-grams. It matches the haystack’s n-grams against the needle’s and constructs a huge binary string where they match. This binary string is then interpreted by the algorithm to determine whether the needle can be found in the haystack and if so where. The algorithm runs in linear time and, given the language of the originals, was found to work better if the strings of text were regularized (including removal of vowels).
The third stage in using the comparator was for the research assistant on the project to confirm, remove, annotate, or create new links between one edition and the other using a custom interface to the TEI-Comparator constructed in Java using the Google Web Toolkit API. The final stage was to produce output from the work put in by the RA through generating two standalone HTML versions of the texts which were linked together based on the now-confirmed IDs.
Shortly the TEI-Comparator will be publicly available on Sourceforge with documentation and examples to make it easy for others to re-purpose this software for other similar uses, and submit bugs and requests for future development.
Although known as the ‘TEI-Comparator’, the program does not require TEI input, it works with XML files of any vocabulary as long as the elements being compared have sufficient unique text in them.
For more information about the TEI-Comparator e-mail: email@example.com