The Text Encoding Initiative (TEI) has developed over 20 years into a key technology in text-centric humanities disciplines. It has been able to achieve its range of use by adopting a descriptive rather than prescriptive approach and by eschewing any attempt to dictate how the digital texts should be rendered. However, this flexibility has come at the cost of rather limited interoperability and virtual absence of tools that can publish TEI documents out-of-the-box in a sensible way. While TEI’s power, scope and flexibility are essential for many research projects there is a distinct set of more conventional uses, especially in the area of digitized ‘European’-style books that would benefit from a prescriptive recipe for digital text that comes with a ‘cradle to grave’ processing model that associates the schema with explicit and standardized options for displaying texts. This are the premises of the Mellon-funded TEI Simple project. TEI Simple both restricts the TEI tagset to a limited subset of elements and aims to provide default prescriptions for processing TEI Simple documents. It will also provide a way to customize and extend these default prescriptions, and an implementation of a processor that will generate transformations from users’ customizations.
TEI Simple started officially in September, when project PIs (Sebastian Rahtz, Brian Pytlik-Zillig and Martin Mueller) gathered in Oxford, bringing me and James Cummings for a week of intensive work, starting with analysis of existing large corpora, mostly consisting of ‘books by dead white men’, like Oxford Text Archive or EEBO-TCP. After some vivid discussions and cross-searching the reference corpora for evidence of actual usage of TEI elements we’ve been able to cut the list down to about a hundred elements that can still carry all the information that was present in original TEI P5 encoding. As we discussed I was writing the migration tool* that translates elements not available in Simple schema into their Simple equivalents wherever possible.
The next step was to decide how do we tackle the other TEI Simple goal: the processing model. In other words we needed to supply directions how TEI Simple documents are supposed to be processed into a range of output formats like HTML, PDF etc. Additional goal was not to confine the users to our ideas about the required outputs, but allow them to override default Simple rendering to achieve different results.
As the question “how is it going to look like?” always weighs heavy on the minds of editors this is probably the greatest challenge of TEI Simple project. Specifying how to deal with all TEI elements (in all specific contexts they might occur in a collection of documents) can be a nightmare even in one’s native language, doing this in computer-speak is definitely not everyone’s cup of tea. Yet editors have to make these decisions at some point and either write their own programs or communicate them clearly to their tech-savvy collaborators. Both options require the editors to explicitly state what is expected to happen. Assuming that editors already know and understand TEI/XML, it would be relatively small leap of faith to hope that they both can and would add a small bits of XML to their schema that describe in a formal way the rules for intended processing. Obviously ‘relatively small bits of XML’ cannot be expected to carry the very same power of expression as full-fledged programming language, yet I hope that at least for some this can be powerful enough that the benefits will justify trade-offs.
Pondering on this we parted in September without very clear idea how to achieve our goals, but with strong resolution to try – even if we were to fail. I spent a few days after that building a proof-of-concept implementation of a processing model, starting with the agreed upon assumption that we need to include intended processing instructions into ODD. My previous experience in collaboration with editors already taught me that in order to create a program that generates required output I need a set of rules that tell me how to render particular elements or combinations in the source and most probably conditions to determine which rule to apply. In communication with editors usually the ‘how to render’ part is given in quite simple language: ‘this is a note’, ‘only this part should be present in the output’, ’turn it into tooltip box’, etc. I started getting hopeful that simply specifying what to do, when to do it and how to decorate the output could in fact get us quite a long way. I was rather impressed when indeed I was able to reproduce with my implementation the look of the Oxford Text Archive HTML view for a simple novel, play or essay just by adding little more than 10 rules to the ODD and linking to a CSS file.
We presented our preliminary results during TEI Conference in Chicago in October and solicited feedback from TEI community there. Many shared their very reasonable doubts, some of which we kept asking ourselves: are we reinventing the XSLT, it’s too hard and technical, it’s not powerful enough, how is it going to deal with some tricky (yet real) problems. The overall attitude though was positive and assured us again it’s something very worth pursuing even if the road ahead us is still not terribly clear.
New Year started with another week-long meeting in Oxford, when we concentrated on the processing model. We discussed what elements and attributes do we need to arrive at a solution that will be simple, flexible and explicit. After much agonizing over naming conventions and banging our heads on the wall we came both with a rationale for our decisions and a model for additional ODD elements. The important principle** is that ODD should be as explicit as possible, and provide maximum expressivity to the editor. We have agreed to add a few elements to TEI ODD language, the most important one being <model> that allows to document the intended processing rule thanks to its attributes: behaviour (that defines what is going to happen), predicate (that defines conditions to which the rule applies), output (that tells which processing scenario the rule belongs to). <model> together with its <rendition> child (that defines the desired formatting) are therefore at the heart of a processing model, but important part of TEI Simple magic lies in the behaviour attribute. Behaviour specifies which function from specialized TEI Simple library to apply – currently we have about 20 functions named note, paragraph, anchor, block, figure etc. Thus prepared we were able to finally produce preliminary version of ODD enhanced with our new models and renditions before we split up again.
Meanwhile I was constantly adapting the implementation of processing model and eventually caught up with most new inventions of the past week. The first results can be already seen (as everything else) on GitHub and the next stage will involve testing the prototype against our reference corpora, refining our default processing instructions in Simple ODD and seeking feedback from TEI community.
It is definitely exciting project to be involved in – even more so because the possibility of failure is looming over us all the time. Even if we build something and it works for the documents that are in scope of the project – will the users come? I am looking forward to see what happens with TEI Simple in the next months and years.
* All source code, element usage statistics and other documentation is available on GitHub
** Informally named ‘Turska Tenet’ by the way. We also came up with Rahtz Rationale, Pytlik-Zillig Proposal and Mueller’s Method so you can see where this is going.