ODDly Pragmatic: Documenting encoding practices in Digital Humanities projects

[This is the rough draft text of a plenary lecture I will have given at JADH2013 http://www.dh-jac.net/jadh2013/abst21.html#plenary2 (click to expand abstract). This isn’t necessarily the precise text I delivered but the notes from a couple days before.

It is written very much to go with the slides, really a prezi, at: http://tinyurl.com/jc-JADH2013 and doesn’t really make lots of sense without it.  (I’m not claiming it makes lots of sense with it either!) It re-uses much material I’ve discussed and written about in other locations so I’m not making claims of originality either. Credit is due to everyone involved in the TEI, DH Projects mentioned, and many ideas from the DH community at large. All errors and misrepresentations are mine and unintentional, I apologise in advance. The intention is to superficially expose a slightly larger audience at JADH2013 to some of the concepts and benefits of TEI ODD Customisation.]

 The TEI

Use of the TEI Guidelines for Electronic Text Encoding and Interchange is often held up as the gold standard for Digital Humanities textual projects. These Guidelines describe a wide variety of methods for encoding digital text and in some cases there are multiple options for marking up the same kinds of thing. The TEI takes a generalistic approach to describing textual phenomena consistently across texts of different times, places, languages, genres, cultures, and physical manifestations, but it simultaneously recognises that there are distinct use cases or divergent theoretical traditions which sometimes necessitate fundamentally different underlying data models.  Unlike most standards, however, the TEI Guidelines are not a fixed entity as they give projects the ability to customise their use of the TEI — to constrain it by limiting the options available or extending it into areas the TEI has not yet dealt with. It is this act of customisation and the benefits of it that I will speak of today.

But what is the TEI?

The Text Encoding Initiative Consortium (TEI) is an international membership consortium whose community and elected representatives collectively develop and maintain the de facto standard for the representation of digital texts for research purposes. The main output the community is the TEI Guidelines which provide recommendations for encoding methods for the creation of digital texts. Generally the TEI is used by academic research projects in the humanities, social sciences, and linguistics, but also by publishers, libraries, museums, and individual scholars, for the creation of digital texts for research, teaching, and long-term preservation.

It is also a community of volunteers, institutions like the University of Oxford donate a fraction of staff time (like part of mine) towards the TEI, as do other institutions with elected volunteers or contributors working on research projects.

The TEI is also the outputs that it creates such as the Guidelines themselves, definitions and examples of over 530 markup distinctions, and various transformation software to convert to and from the TEI. It is also a consensus-based way or structuring textual resources – it isn’t determined by the weight of a single institution or commercial company but by the Technical Council members elected by the membership. The TEI is a way to produce customised, internationalised, schemas for validating a project’s digital texts. It is a format that allows you to document your interpretation and understanding of a text, but it is also a well-understood format suitable for long-term preservation in digital archives. But most of all, it is a community-driven standard so it is a product of all of those involved in it.

What the TEI is not:

It isn’t the only standard in this area. It is the most popular but there are others, and people re-invent the wheel unnecessarily all the time. It isn’t objective or non-interpretative: the application of markup is an interpretative act that shouldn’t just be left to junior research assistants – it is the intellectual and editorial content of a digital text. The TEI isn’t used consistently in different projects, and often not even in the same project. (Which is why TEI customisation for consistency is an important form of documentation.) The TEI isn’t fixed and unchanging. Unlike most standards which are static the TEI evolves as the community finds new and important textual distinctions. But customisation gives you a way to document precisely what version of the TEI you are using. It isn’t your research end-point: The creation of a collection of digital texts isn’t an end in itself — it is what you can then do with those texts, the research questions they can enable you to answer that is important.

It isn’t also automatic publication of your materials in a useful way. Any of the off-the-shelf TEI publication systems all will need customising to deal with the specific and interesting reasons you were encoding these texts in the first place. In general though experience teaches us that the benefits of a shared vocabulary far outweigh any difficulties in adoption of the TEI

Generalistic Approach:

The TEI takes a generalistic approach to describing textual phenomena consistently across texts of different times, places, languages, genres, cultures, and physical manifestations, but it simultaneously recognises that there are distinct use cases or divergent theoretical traditions which sometimes necessitate fundamentally different underlying data models. The ability to customise the TEI scheme is something which sets it apart from other international standards. At first glance this may seem contradictory: how can one have a standard that any project is allowed to change? This is because the TEI’s approach to creation of this community-based standard is not to create a fixed entity, but to provide recommendations within a framework in which projects are able to extend or constrain the scheme itself. They can constrain it by limiting the options available to their project or extend it into areas not yet covered by the TEI.

It is nonsensical for a project to dismiss use of the TEI because it does not yet have elements specific to its needs as that project is able to extend it in that direction.

This combination of the generalistic nature and ability to customise the TEI Guidelines is both one of its greatest strengths as well as one of its greatest weaknesses: it makes it extremely flexible, but this can be a barrier to the seamless interchange of digital text from sources with different encoding practices. Any difficulty can be lessened by documentation through proper customisation.

TEI ODD Customisation

Every project using the TEI is dependent upon some form of customisation (even if it is the ‘tei_all’ customisation with everything in it, that the TEI provides as an example). The TEI has many elements covering textual distinctions from linguistics and the marking up of speech to transcribing or describing medieval manuscripts and more. The TEI organises all these elements into a wide array of modules. A module is simply a convenient way of grouping together a number of associated element declarations. Sometimes, as with the TEI’s Core module (containing the most common elements) these may be grouped together for practical reasons. However, it is more usual, for example with the ‘Dictionaries’ module, to group the elements together because they are all semantically-related to one particular sort of text or encoding need. As one would expect, an element can only appear in one module lest there be a conflict when modules are combined.

Almost every chapter of the TEI Guidelines has a corresponding module of elements. In the underlying TEI ODD customisation language both the prose of that chapter of the Guidelines and the specifications for all the elements are stored in one file. It is from this file that both the TEI documentation and the element relationships that are used to generate a schema are created. So there is a chapter on dictionaries and it also creates the module for dictionaries.

The TEI method of customisation is written in a TEI format called ‘ODD’, or ‘One Document Does-it-all’, because from this one source we can generate multiple outputs such as schemas, localised encoding documentation, and internationalised reference pages in different languages. A TEI ODD file is a method of documenting a project’s variance from any particular release of the full TEI Guidelines. The TEI provides a number of methods for users to undertake customisation ranging from intuitive web-based interfaces to authoring TEI ODD files directly. These allow users to remove unwanted modules, classes, elements, and attributes from their schema and redefine how any of those work, or indeed add new ones. One of the benefits of doing this through a meta-schema language like TEI ODD is that these customisations are documented in a machine-processable format which indicates precisely which version of the TEI Guidelines the project was using and how it differed from the full Guidelines. This same format is what underlies the TEI’s own steps towards internationalisation of the TEI Guidelines into a variety of languages (including Japanese).

This concept of customisation originates from a fundamental difference between the TEI and other standards — it does not try to tell users that if they want to be good TEI citizens they must do something this one way and only that way, but instead while making recommendations it gives projects a framework by which they can do whatever it is that they need to do but document it in a (machine-processable) form that the TEI understands. This is standardisation by not saying “Do what I do” but instead by saying “Do what you need to do but tell me about it in a language I understand”.

The result of a customisation might be only to include certain modules, and by doing so lessen the amount of choice available when using a generated schema to encode a digital text. But of course, even inside these modules there will be elements that your project does not need.

ROMA

We do not necessarily need to learn the underlying TEI ODD format to create our customisation. The TEI community provides various tools to do this, such as ‘Roma’ which is a basic web interface for creating customisations. It gives you a way to build up from the most minimal schema, reduce down from the largest possible one, use one of the existing templates, use one of the common TEI example customisations, or upload a customisation that you had saved previously.

And of course the TEI strongly believes in internationalisation so wherever we can get volunteers to translate the website and the descriptions of elements into their own languages, we can incorporate that into the interface. What’s more is that this means the schemas you generate can have glosses and tooltips in your XML editing software that come up in that particular language.

On the ‘Modules’ tab we see a list of all of the modules and it is an easy thing to click ‘Add’ on the lefthand side and the modules will then be included in our schema and appear on the list on the righthand side. Removing them is just as easy.

Clicking on any of the modules, enables us to include or exclude those elements we want from the schema we are building

But what is happening underneath? In this case we’re generating a TEI ODD XML file which stores the changes we have made. We document that we want to included these modules, but also that we want to delete these elements, or in the case of the last one an attribute on an element. Back in the web interface we could look at the attributes for the <div> element and choose to include or exclude those that we want.

And for each of those attributes, here the @type attribute, we could choose whether it was required or not, whether its list of values was closed or open, and what those values might be.

Again, underneath this is XML that documents how we are changing the TEI schema. Here making the @type attribute required, and giving it a closed value list of prose, verse, drama, and other.

But there are limitations to this web interface: for example it currently doesn’t allow you to provide a description to each of these values (the <desc> element here). There is no reason it shouldn’t, just that the creators haven’t had time or money to improve the software in the last few years. The TEI Council is actively looking at ways to encourage the community to create newer ODD editors. From our customisation we can generate a variety of documentation and this documentation will be localised, meaning that your changes will be reflected in it, as well as internationalised in that it will use your choice of language where it can. One of the great things about TEI ODD files is that you can also include as much prose as you want describing your project’s encoding practice. And, of course, you can also generate a variety of schema languages to validate your documents. The TEI tends to recommend Relax NG as its preferred format. And although you can generate DTDs from it, this is now a dated document validation format that I would not recommend.

One of the interesting recent developments is that a user can now ‘chain’ customisations together. Their TEI ODD file points at an existing one as a source and so on. This means that if there is an existing customisation that you like (for example like the EpiDoc customisation for classical epigraphy), then a project can point at that to use it as a starting point, and add to it, but regenerate their schemas with new additions any time the original source has changed.

Such documentation of variance of practice and encoding methods enables real, though necessarily mediated, interchange between complicated textual resources. Moreover, with time a collection of these meta-schema documentation files helps to record the changing assumptions and concerns of digital humanities projects.

OxGarage

OxGarage is the web front end to a set of conversions scripts the TEI provides to convert to and from TEI. They are really easy to use, you choose what type of input document you have and if it can get from that, to any format, and from that to another other format in a pipeline then you can choose that as an output format. Once you’ve chosen the output you can convert to it, or there are all sorts of advanced options for handling things like embedded images. One of the benefits of this freely available tool is that it is a web service, and so you can build it into other platforms. For example the Roma tool we saw when it converted to HTML documentation, or indeed to the Relax NG schema, behind the scenes it is sending it to this OxGarage web service to do the conversion.

The Stationers’ Register Online

The Stationers’ Regsister Online project is a good example of how TEI ODD customisation can save a project money and further their research aims. This project received minimal institutional funding from the University of Oxford’s Lyell Research Fund to transcribe and digitize the first four volumes of the Arber’s edition of the Register of the Stationers’ Company. The Register is one of the most important sources for the study of British book history after the books themselves, being the method by which the ownership of texts was claimed, argued, and controlled between 1577 – 1924. This register survives intact in two series of volumes which are now at the National Archives and the Stationers’ Hall itself. The pilot SRO project has created full-text transcriptions of Edward Arber’s 1894 edition of the earliest volumes of the Register (1557—1640) and the Eyre, Rivington, and Plomer 1914 edition (1640—1708). It has also estimated the costs involved in the proofing and correction of the resulting transcription against the manuscript originals, as well as potential costs of transcription of the later series from both manuscript and printed sources.

A typical entry lists the members of the company registering the book, to ensure their right to print it, the name of the author, and title of the book. There is also an amount shown which is the cost of registering it. In this case the book is the Comedies, Histories, and Tragedies, of one Mr William Shakespeare. As Edward Arber’s nineteenth-century edition of the Stationers’ Register existed as a source, it was decided that this was a much better starting point for the pilot than the manuscript materials themselves. In the earlier volumes the register is also used as a general accounts book for the Stationers’ Company, but over time evolves into a more or less formulaic set of entries following a fairly predictable format.

Although hard to see in this low-res scan, even in the nineteenth century Arber recognized the potential usefulness of markup and thus marked particular features of the Register surprisingly consistently in the volumes he edited. The encoding tools at his disposal, however, were only page layout and choice of fonts. The ‘nineteenth-century XML’, as the presentational markup he chose was termed within the project, was used to indicate basic semantic data categories. For Members of the Stationers’ Company Arber uses a different font, Clarendon. Other names are in roman small capitals, but the names of authors are in italic capitals.

Arber’s extremely consistent use of this presentational markup, and the subsequent encoding of it by the data keying company, meant that the project could generate much of the descriptive markup itself. If this presentational markup had not existed then a pilot project (with very minimal funding) to produce a digital textual dataset would not have been possible. As with all TEI customisations, this was done with a TEI ODD file. This TEI ODD file used the technique of inclusion rather than exclusion (that is, it said which elements were allowed instead of taking all of them but deleting the ones it did not want). What this meant was that when the project regenerated its schemas or documentation using the TEI Consortium’s freely available services, only the original requested elements were included, and new elements that had been added to the TEI since the project created the ODD would be excluded.

The Bodleian Library’s relationship with a number of keying companies meant that the SRO project was able to find one willing to encode the texts in XML to any documented schema. And indeed, very importantly, this particular keying company charged for their work by kilobyte of output. Owing to this, the project realised that it would save money if it could create a byte-reduced schema which resulted in files of smaller size. Our ODD customisation replaced the long, human-readable names of elements, attributes, and their values with highly abbreviated forms.

For example, the <div> element became <d>, the @type attribute became @t, and the allowed values for @t were tightly controlled. This meant that what might be expanded as <div type=”entry”> (24 characters with its closing tag) was coded as <d t=”e”> (13 characters). The creation of such a schema was intended solely to reduce the number of characters used in the resulting edited transcription, as an intermediate step in the project’s workflow — document instances matching this schema are not public, since it is the expanded version that is more useful. This sacrificed the extremely laudable aims of human-readable XML and replaced it with cost-efficient brevity. Because of this compression of elements we called our customisation tei_corset.

This sort of literate programming becomes fairly straightforward once one is used to the concept. However, there is an important additional step here, which is the use of the <equiv> element. This informs any software processing this TEI ODD that a filter for this element exists in a file called ‘corset-acdc.xsl’ which would revert to, or further document or process, an equivalent notation. In this case a template in that XSLT file transforms any <ls> element back into <list>element. In addition to renaming the @type attribute to be @t, some of the other element customisations constrain the values that it is able to contain. For example, in the <n> element (which is a renamed TEI <name> element) the @t attribute has a closed value list enabling only the values of ‘per’ (personal name), ‘pla’ (place name), and ‘oth’ (other name). In most cases though the names are documented by Arber using his presentational markup, and this is captured with the @rend attribute (or its renamed version as @r).

As with many TEI customisations designed solely for internal workflows, the tei_corset schema is not in fact TEI Conformant. The popular TEI mass digitisation schema tei_tite has the same non-conformancy issues. Both of these schemas make changes which fly in the face of the TEI Abstract Model as expressed in the TEI Guidelines. The tei_corset schema, in addition to temporarily renaming the <TEI> element as <file>, changes the content model of the <teiHeader> element beyond recognition.

This bit of the customisation documents the renaming of the <teiHeader> element to <header> which compared to other abbreviations is quite long, but it was only used once per file so had less pressure to be highly abbreviated. The @type attribute is deleted and more importantly the entire content model is fully replaced. This uses embedded Relax NG schema language to say that a <title> element (which is later renamed to <t>) is all that is required, but can have zero or more members of the model.pLike class after it. This enabled the keying company to put a basic title for the file (to say what volume it was), but gave them nothing but some paragraphs as a place to note any problems or questions they had. Usually TEI documents have more metadata, but this is unproblematic because these headesr were replaced with more detailed ones at a later stage in the project data workflow. Other changes meant that elements that were usually empty would be (temporarily) allowed text inside. In the process of up-converting the resulting XML, these were replaced with the correct TEI structures. In this customisation of the TEI <gap> element in addition to allowing text, the locally-defined attributes, @agent, @hand, and @reason are removed.

In a full tei_all schema the <gap> element would have the possibility of many more attributes, but these are provided by its claiming membership in particular TEI attribute classes. For the tei_corset schema many TEI classes were simply deleted which meant that the elements that were claiming membership in these classes no longer received these attributes.

The result of the customisation is a highly abbreviated, and barely human-readable form of TEI-inspired XML. For example here we have a <n> element marking ‘Master William Shakespeers’ with the forename and surname marked with ‘fn’ and ‘sn’. The conversion of this back to being a <persName> element with <forename> and <surname> is very trivial renaming in XSLT.

Passing a couple centuries worth of records through the transformation results in much more verbose markup.

But it isn’t just simple renaming that we undertook in reverting this highly compressed markup to a fuller form. But more detailed up-conversion. Such entries contain fees paid and they are almost always aligned to the right margin by Arber and recorded in roman numerals. The keying company was asked to mark these fees (the <num> element having been renamed to <nm>) and to use the @r attribute to indicate its formatting of ‘ar rm’ (aligned to the right and roman numerals). The benefit to the project of them doing this is that it meant that the SRO project could up-convert this simple number into a more complex markup for the fee.

The up-conversion I wrote here isn’t simply to revert numbers back to the correct TEI markup, but to make them to even better markup by deriving information from the textual string that is encoded. The tokenization of the provided amounts into pounds, shillings and pence, and consistent encoding of the unit indicator as superscript are key parts of this. Arber’s edition provided all the markers of pounds/shillings/pence as superscript, so the keying company was not asked to provide it, as the project realised this could be done automatically after the fact and would save even more characters. I also converted the roman numerals to ‘arabic’ numbers so that easy calculations of total amount of pence (for comparative purposes) could be provided. To do this, the XSLT stylesheet converted the keyed text string back into pure TEI and simultaneously broke up the string based on whether it ended with a sign for pounds, shilling, pence, or half-pence. An additional XSLT function converted the roman numerals in-between these to arabic, and then to pence so that the individual and aggregate amounts could be stored. The markup that results provides significantly more detail than the original input.

The benefit of this customisation was based entirely on the keying company both using whatever XML schema we gave them, and charging per kilobyte of output. Originally we’d calculated that by having them use this schema rather than full TEI we were saving around 40%. In the end, if we include the up-converted information as well, this rises to a 60% saving. The extra money we had left meant that we were able to include the 1640-1708 material as well even though it had been out of scope for the original project.

The Godwin Diary project

The Godwin Diary project was funded by the Leverhulme Trust to digitise and do a full-text edition of the 48 years of William Godwin’s diary. William Godwin, 1756-1836 was a philosopher, writer, and political activist. He is perhaps most commonly known as the husband of Mary Wollstonecraft and the father of Mary Wollstonecraft Shelley, the author of Frankenstein. Godwin faithfully kept a diary from 1788 until his death in 1836; the diary is now preserved in the Abinger collection in the Bodleian Library. It is an extremely detailed resource of great importance to researchers in fields such as history, politics, literature, and women studies. The concise diary entries consist of notes of who Godwin ate with or met with, his own reading and writing, and major events of the day. The diary gives us a glimpse into this turbulent period of radical intellectualism and politics, and many of the most important figures of this time feature in its pages, including Samuel Coleridge, Richard Sheridan, Mary Wollstonecraft, William Hazlitt, Charles Lamb, Mary Robinson, and Thomas Holcroft, among many others.

The project team was small consisting mostly of Mark Philp and David O’Shaugnessy and a couple of their students in the politics department. It is important to note that it is the politics department since it is less Godwin’s life as a literary person, but the social network of relationships which concerned the project.

The Bodleian has provided hi-res images of the diary, and done so under an open license that has already significantly benefited research in this area. In providing the technical support to the project it is important to note that I gave them only 2 days of technical training for the project. Partly this is a benefit of the TEI ODD customisation; they didn’t have to learn the entirety of the TEI, only the bits they were using. I provided this training, created the TEI ODD customisation, developed the website and was also a source of general technical support during the life of the project.

However, even with basic training they were able to mark up the 48 years of the diary, categorise every meal, meeting, event, text mentioned, and person named. In addition they identified more than 50,000 of the ~64,000 name instances recorded in the diary and linked these to additional prosopographical information.

Godwin’s diaries are simultaneously immensely detailed (recording the names of almost everyone he ever met with) and frustratingly concise (he only rarely gives details of what they talked about). Godwin’s diary is quite neatly written and easy to read. The dates, here in a much lighter ink, are usually given (and given correctly) and generally a day’s entry forms the basic structural unit of the diary. In only a very few instances do the notes from one day stray into the page area already pre-ruled for the following day. Occasionally there are marginal notes to provide more information, but in most cases the textual phenomena are quite predictable – mostly substitutions and interlinear additions. In many ways the hierarchical nature of a calendrical diary entry makes it ideal for encoding in XML.

There is some indication that Godwin may have returned to certain volumes at a later date to rewrite or correct them. And yet, it is certainly impressive that there are entries for most days, and that whatever minimal information is given, the names of those attending the frequent meetings Godwin had with those in his circle are recorded. The majority of his diary entries were seen to be able to be broken down into several categories and sub-types. These include his meals, who he shared them with, who he met, very rarely what they talked about, and what works he was reading or writing at that time. The political historians, it is easy to understand, were eager to use the resource to explore which individuals might be meeting with what other friends of Godwin’s at specific times. Meanwhile those exploring Godwin’s writings might be interested in knowing what works he was reading when he was writing specific parts of some of his works.

But that is enough about Godwin, back to the project itself. Of course having the hi-res images means that I included a typical pan/zoom interface, here built on top of google maps, to show each page of the diary. Two links are important to notice on this screenshot though, one is the link to the creative commons ‘full image’. There is no barrier in getting the full image, no one that researchers need to ask, they can just download it. The same is true for all the underlying XML. The other link is a direct link to the diary text for this page. This means that one can browse the diaries based on their physical manifestation, as a series of images, and jump to the text at any point. Or one can read the transcribed text and jump to the image for that page. The project specifically asked for there not to be a side-by-side facing image/text view because they wanted to preserve the distinction between these two experiences of reading the text.

The customised TEI ODD in the case of the Godwin project wasn’t made to create highly abbreviated element names for some keying company. Instead it was to create aliases for elements to give those encoding the diary a small and easy set of elements through which to categorise the parts of a diary entry in terms that made sense to them.

So there were element specifications created for divisions that renamed them to be diary year, month, and day. There were specialised elements to mark segments of text, really re-namings of the TEI seg element, for those portions of diary entries for meals, meetings, events, and more, all with specific names that made sense to the project.

For example with the element specification showing here, it creates a new element called ‘dMeal’ which is a diary-entry meal. There is an <equiv> element pointing back to an XSLT file which can revert this to pure TEI.

There is a description of the new element, and some information about what classes it is a member of and what is allowed inside it. There is a locally-defined @type attribute which which has been made required, and has a list of values for each type of meal, but also indicates whether the person was dining at Godwin’s place or whether he was visiting them.

As with the Stationers’ Register project markup, this was easily converted back to pure TEI P5 XML. You can see some of the @type attribute values preserve the original name of the customised markup. Once restored this dMeal element becomes a <seg type=”dMeal”>.

In this case it is a supper, where Godwin has sup’ed at his friends the Lamb’s with a variety of other people. While at the meal he has had a short little side meeting with H Robinson.

The structure of the diary is also quite straightforward. As you can see each month has an @xml:id attribute which gives its year and month, each day has precisely the same thing, but with the day. These were required by the ODD customisation, and moreover, the schema requires that each day entry have a date element with a @when attribute encoded in it. This means that in creating the processing for the diary entries I could be sure that each diary entry would have a day, and each month a clearly understandable ID and so creating transformations of this which produce the website by each year, month, or day becomes very straightforward.

The changes to the TEI scheme, in renaming elements this time not for brevity but simplicity, meant that the project’s ability to mark up the documents in XML increased dramatically. The other changes, such as requiring a date element with a @when attribute, meant that the processing of the documents was even easier. In short, the customisation made both my life and the encoders lives easier.

In the resulting webpages, one can toggle on or off a variety of formatting for indicating all these categories of information they recorded, people, places, meals, meetings, reading, writing, topics mentioned, and events. The general website is clear, cleanly minimalistic, and intuitive with a calendar for each year one is looking at, and anything that can be a link has been turned into one. But one of the great strengths of the website is the amount of work they have put into the marking of all those people’s names. Because they have done that it means that we can pull out dataTables of information about the people, birth date, death date, gender, occupation, and how many times they are mentioned in all of the diary volumes and whether this was when they are acting as a venue (Godwin visits them) or were listed by Godwin as ‘not-at-home’.

For each person we produce a prosopographical page listing biographical details, editorial notes, a bibliography of works, and a generated graph showing when and how much they are mentioned in the diary. Of course, each of these references links back to that diary entry for a very circular navigation through the resource.

Extracting information from the diary was the reason the project team put so much effort into adding this encoding to the XML files. This means that we’re able to extract this information for any of the categories that they marked and each of the sub-types within that. In this case one of the subtypes of events was ‘theatre’ used to note when he went to the theatre and if known which theatre he was going to. With this data available in the eXist XML database that powers the resource, it is then easy to pull out all of the trips to the theatre, which theatre, and show the event usually containing the title of the play he went to. The website does this for every single category and sub-type of information they marked, so researchers can indeed compare how many times he ate supper with someone at his house compared to how many time he ate supper at their house. (If they really want!)

EEBO-TCP

Another benefit of the documentation of local encoding practice is for the legacy data migration of document instances in the future. Even the conversion of closely related documents such as those from the Early English Books Online – Text Creation Partnership into pure TEI P5 XML can be an onerous task. We recently converted the more than 40,000 texts of the EEBO-TCP corpus to TEI P5 XML. As the first phase of these will become public domain in 2015 we’re testing and improving the conversions we have for them to do fun things like create ePubs so we can read these early printed books on our iPads and phones.

The EEBO-TCP markup was based on TEI P3 but then evolved separately when it encountered problems the TEI hadn’t yet dealt with. However, it did not document these in a TEI extension or customisation file. In converting them to TEI P5 we used the TEI ODD customisation language to understand and record the differences of variations between EEBO-TCP and the more modern TEI P5. One proven approach to comparing texts is to define their formats in an objective meta-schema language such as TEI ODD, and in doing so the precise variation between the categories of markup used is exposed, and more importantly, provided in a machine-processable form. As part of the process of converting these to TEI P5 one of the things we looked at was the markup before and after conversion, and thus the frequency of certain elements. The resulting markup has almost 40 million instances of highlighting, but this is because this is one of the basic things captured by the TCP project.

Most of the elements that are highest in frequency are structural in nature. Remember how in the Stationers’ Register project limited the schema to a tiny 34 elements? In all of EEBO-TCP there are only 78 distinct elements used in the entire corpus. This reflects the nature of the TCP encoding guidelines of capturing basic structural and rendering markup. There are very few Interoperability problems between EEBO-TCP texts, as their markup is fairly consistent and basic. But what is interesting about these newly converted EEBO-TCP files is that now that we are able to convert them they are becoming the source for further research. Projects can take our TEI P5 XML files and add more markup to them to document the aspects of the texts that they are interested in.

Three EEBO-TCP Projects

Very briefly I’d like to mention 3 projects which have benefited from these conversions of EEBO-TCP materials, each of which I could go into more detail about at another time. This project (Verse Miscellanies Online) recently went online at the Bodleian, we took the converted EEBO-TCP texts and some researchers from another university edited them and provided information about genre, rhyme scheme, and editorial notes for each of the poems. They also glossed any unfamiliar words and provided pop-up regularisations for others. From these enhanced texts we built them a website to use for teaching and reading of the 8 verse miscellanies they encoded during the project. Similarly in this project (Poetic Forms Online) some researchers again took the TEI P5 converted versions of the EEBO-TCP texts that we supplied them and provided highly detailed metrical analysis, counted syllables, marked the type and location of all rhyme words as well as a regularisation of their rhyme sounds. From these enhanced texts we built them a faceted searchable website with all of the categories which they plan to expand by adding more texts as time goes on. The Holinshed project was slightly different, one of the earlier conversions of EEBO-TCP material that we did. In this case there are two editions of a very large text, Holinshed’s Chronicles of England, Scotland and Wales, one published in 1577 and the other published in 1587. The academics in question were writing a secondary guide to this huge work and wanted a way of following where paragraphs in one edition had been fragmented and moved around in the creation of the second edition. Sometimes whole sections had been moved, sometimes parts of paragraphs had been moved around and mixed with others, etc. In this case we converted the texts to TEI P5 and then designed a fuzzy string comparison system to find the most probable matches and record their paragraph ID numbers. We then built a website where the researchers could confirm that these were indeed the correct matches, before using the resulting links between the two to generate a website where when reading the text a user could jump to the same paragraph in the other edition and see how the social changes during Queen Elizabeth’s reign had affected the topics, especially religious topics, in the chronicle. All of these projects have benefited from our ongoing work to improve the transformations of EEBO-TCP to TEI P5, which itself is dependent on the TEI ODD Customisation language.

The Unmediated Interoperability Fantasy

One of the misconceptions about the TEI, and indeed any sufficiently complex data format, is that once one uses this format that interoperability problems simply vanish. This is usually not the case. Following the recommendations of the TEI Guidelines does, without question, aid the process of interchange especially when there is a fully documented TEI ODD customisation file. However, interchange is not and should not be confused with true interoperability.

I would argue that being able to seamlessly integrate highly complex and changing digital structures from a variety of heterogeneous sources through interoperable methods without either significant conditions or intermediary agents is a deluded fantasy. In particular, this is not and should not be the goal of the TEI. And yet, when this is not provided as an off-the-shelf solution some blame the format rather than their own use of it. The TEI instead provides the framework for the documentation and simplification of the process of the interchange of texts. This is a good thing and is a much better goal for the TEI. If digital resources do seamlessly and unproblematically interoperate with no careful or considered effort then:

  • the initial data structures are trivial, limited or of only structural granularity,
  • the method of interoperation or combined processing is superficial,
  • there has been a loss of intellectual content, or
  • the results gained by the interoperation are not significant

It should be emphasised that this is not a terrible thing, nor a failing of digital humanities nor any particular data format, but instead this truly is an opportunity. The necessary mediation, investigation, transformation, exploration, analysis, and systems design is the interesting and important heart of digital humanities.

Open Data

While proper customisation of the TEI and open standards generally are a good start, what still isn’t happening as much as it should is the release of the underlying data openly. All projects, especially publicly funded projects, need to release their data openly, but they also need centralised institutional support to enable them to do so. If other people can’t see your data, then we can’t re-use it, test it, and if so there is little benefit to the world to make the data.

I don’t know the situation here in Japan, but in the UK and the USA it is certainly the case that funding bodies are increasingly requiring data to be open.

I leave you with the final thought that the “coolest thing to be done with your data will be thought of by someone else”.

Posted in Conference, TEI | 1 Comment

One Response to “ODDly Pragmatic: Documenting encoding practices in Digital Humanities projects”

  1. […] two emails floated across my screen. Kevin Hawkins drew my attention to James Cummings’ Oddly Pragmatic talk at the JADH conference in Kyoto. Patrick Durusau in a posting on the TEI pointed to an article […]

Leave a Reply