Self Study (part 2): Introduction to the Text Encoding Initiative Guidelines

Quite awhile ago I posted http://blogs.it.ox.ac.uk/jamesc/2012/03/15/self-study-introducing-xml-and-markup/ as a list of reading and steps I would recommend someone follow if they were wanting to learn TEI XML and related technologies. This first step was to learn a little bit about XML and markup languages like HTML to a bit of background.

The next step I’d recommend  is to learn a bit more about the Text Encoding Initiative and the Guidelines it produces.

Questions:

  1. In what markup language did documents using TEI P1 to TEI P3 use?
  2. How was this changed for TEI P4 and then TEI P5?
  3. In what way is the TEI ‘extensible’?

Questions:

  1. What does ‘ODD’ stand for? What can one generate from a TEI ODD file?
  2. What is a TEI module? What is the relationship between modules and chapters?
  3. What language does one use to define a TEI schema?
  4. Why might a single project use more than one schema at different stages in their project workflow?
  5. What is an attribute class? The att.global attribute class provides @xml:id and @n attributes to every element in the TEI; what is the difference between these two attributes? When might it be useful to use @n to number verse lines? When might this be a silly waste of time?
  6. What is the @xml:lang attribute for?
  7. What is the difference between the @rend, @style, and @rendition attributes?
  8. What is @xml:space for?
  9. What is a TEI model class, and what do members of the same class share?
  10. Why are model and attribute classes a good idea?
  11. What is a TEI datatype?

Note: If you are confused about modules vs model classes vs attribute class the following blog post might help: http://blogs.it.ox.ac.uk/jamesc/2008/09/01/modules-vs-model-classes-vs-attribute-classes/

  • Next, familiarise yourself with the table of contents of the TEI Guidelines: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index-toc.html
  • And then browse http://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ELEMENTS.html which contains a complete list of elements provided by the TEI.
  • Choose a couple elements which you think you might know what they are used to encode and click on them to explore their reference pages. For example http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-address.html
  • The information on this page seem confusing but it lists:
    • the element’s definition; which ‘module’ it comes from
    • what attributes it has (and if they come from attribute classes)
    • what model classes the element might claim membership of (which controls where it is allowed to appear in your document)
    • a list (by module) of these elements which are allowed to contain this element
    • a list (by module) of which elements this element is allowed to contain
    • a declaration of the content model of the element (which can be toggled between Relax NG compact syntax and XML syntax)
    • one or more examples
    • possibly some additional notes on usage.

Questions:

  • The address element does not define any attributes of its own. How does this compare in layout to the availability element? What attribute does this (at time of writing) define for itself rather than getting it from a class?
  • The address element has two examples; what is the difference between them?
  • If you click on the ‘Show all’ link in one of the examples what do you get? Notice, for example how address is used inside the publicationStmt element to give the address of the publisher of the electronic text.

This is a very basic survey of some of the initial things you might want to learn before diving into the Guidelines in more detail.  I plan to continue this with similar directed reading and questions on some of the topics they cover in the future. In fact, the next post in this series is http://blogs.it.ox.ac.uk/jamesc/2013/01/31/self-study-part-3-the-tei-default-text-structure/ which looks at the TEI’s Default Text Structure.

Posted in SelfStudy, TEI, XML | 1 Comment

Tokenizing and grouping rhyme schemes with XSLT functions

There is a project I work for which has encoded rhyme schemes in TEI using the @rhyme attribute on <lg> elements.  This contains some complex strings as they have used parentheses to indicate an internal rhyme and asterisks to indicate whether a particular rhyme is a feminine (multi-syllable) rhyme. Rhymes are also marked   So for example you get values that look like:

rhyme=”(a*)a*(a*)b(c*)c*(c*)bddee(f)fg(h)hg/”

But I need at any particular point to be able to get at least 2 things from this string:

  1. The documented rhyme above for the current <rhyme> element that I’m processing
  2. Whether the current rhyme is an internal (parentheses) or a feminine (asterisk) rhyme or not.
  3. The set of rhymes for the current line
  4. Whether the current line has any internal (parentheses) or feminine (asterisk) rhymes or not.

So the first step with this is to tokenize the given rhyme scheme.  I do this as an XSLT function and if I want to output it I could have something like:

 <xsl:variable name="rhyme">
(a*)a*(a*)b(c*)c*(c*)bddee(f)fg(h)hg/
</xsl:variable>
<tokenized-rhymes>
  <xsl:copy-of select="jc:tokenizeRhymes($rhyme)"/>
</tokenized-rhymes>

Here, inside some unseen template, I’ve got a variable with the rhyme scheme in it, and I’m getting a copy-of the output of a function I’ve created called jc:tokenizeRhymes(). This isn’t a very difficult XSLT function it just consists of some xsl:analyze-string as so:

<xsl:function name="jc:tokenizeRhymes" as="item()*">
<xsl:param name="rhyme"/>
<xsl:variable name="rhymes">
<list>
    <xsl:analyze-string select="$rhyme" regex="\(*[a-zA-Z]\**\)*">
        <xsl:matching-substring>
            <item>
                <xsl:value-of select="."/>
            </item>
        </xsl:matching-substring>
        <xsl:non-matching-substring/>
    </xsl:analyze-string>
</list>
</xsl:variable>
<xsl:copy-of select="$rhymes"/>
</xsl:function>

All this does is have a function which takes a single parameter (rhyme), and creates a variable containing a list with a bunch of items inside. To do this is uses a regular expression on xsl:analyze-string which looks optionally for an opening parenthesis \(* then any letter from a-zA-Z optionally an asterisk \** follow by an optional closing parenthesis \)* … see, simple. The output from this lookst like:


  <list>
         <item>(a*)</item>
         <item>a*</item>
         <item>(a*)</item>
         <item>b</item>
         <item>(c*)</item>
         <item>c*</item>
         <item>(c*)</item>
         <item>b</item>
         <item>d</item>
         <item>d</item>
         <item>e</item>
         <item>e</item>
         <item>(f)</item>
         <item>f</item>
         <item>g</item>
         <item>(h)</item>
         <item>h</item>
         <item>g</item>
      </list>

Well then, getting the current rhyme when I’m processing a rhyme is fairly easy then. I just create a variable $rhymePosition (the current number of rhymes I’m on) and then can call another function jc:getCurrentRhyme with that and the rhyme variable.

<xsl:variable name="currentRhyme">
  <xsl:value-of select="jc:getCurrentRhyme($rhyme, $rhymePosition)"/>
</xsl:variable>

The jc:getCurrentRhyme function is fairly straightforward as well. It looks like:

<xsl:function name="jc:getCurrentRhyme" as="item()*">
   <xsl:param name="rhyme"/>
   <xsl:param name="currentRhyme" as="xs:integer"/>
   <xsl:variable name="rhymes" select="jc:tokenizeRhymes($rhyme)"/>
   <xsl:copy-of select="$rhymes/list/item[$currentRhyme]"/>
</xsl:function>

It takes two parameters, the $rhyme and the $currentRhyme (which is an integer of how many rhymes there are so far in the <lg> including the one we are processing). It then creates a new variable $rhymes which has the output of the jc:tokenizeRhymes above. Then getting the current rhyme from the list is easy because we know its number so we just make a copy of the <item> we’ve created in that variable by using xsl:copy-of and filtering it by the number $currentRhyme. (This is why we made sure that this parameter was an integer.)

In order to check whether these are internal or feminine rhymes it is now very straight-forward, we just test the $currentRhyme we’ve created above to see whether it contains($currentRhyme, ‘)’) or contains($currentRhyme, ‘*’).

In order to get all the rhymes for this line, we need to re-process this tokenized list somewhat. We want to group those items which have parentheses together with the letter which follows them, splitting on each non-parenthesised letter (optionally having an asterisk). It took me awhile to get my brain around that but eventually I came up with:

<xsl:function name="jc:groupRhymes" as="item()*">
<xsl:param name="rhyme"/>
<xsl:variable name="rhymes" select="jc:tokenizeRhymes($rhyme)"/>
<xsl:variable name="groupedRhymes">
  <list>
   <xsl:for-each-group select="$rhymes/list/item"
      group-ending-with="*[matches(., '^[a-zA-Z]\**$')]">
     <item>
      <list>
       <xsl:for-each select="current-group()">
        <item>
         <xsl:value-of select="."/>
        </item>
       </xsl:for-each>
      </list>
     </item>
    </xsl:for-each-group>
  </list>
</xsl:variable>
<xsl:copy-of select="$groupedRhymes"/>
</xsl:function>

This function takes in the parameter $rhyme and tokenizes it using the earlier function, so now we have a list with some individual items in. It then creates a new list and uses xsl:for-each-group to select all the tokenized items. It creates groups ending with any item where the content matches a full line going from start to finish of a letter followed by an optional asterisk. This means each group will end with a normal rhyme letter and any internal rhymes (in parentheses) will be included in that group. For each group it puts out a new item with a nested list and makes each rhyme in that line an item in that nested list. This might seem overkill to some, but having the extra nesting, regardless of whether there are 1, 2, or 20 rhymes in the line just makes things easier. So this output from this looks like:

<list>
<item>
    <list>
        <item>(a*)</item>
        <item>a*</item>
    </list>
</item>
<item>
    <list>
        <item>(a*)</item>
        <item>b</item>
    </list>
</item>
<item>
    <list>
        <item>(c*)</item>
        <item>c*</item>
    </list>
</item>
<item>
    <list>
        <item>(c*)</item>
        <item>b</item>
    </list>
</item>
<item>
    <list>
        <item>d</item>
    </list>
</item>
<item>
    <list>
        <item>d</item>
    </list>
</item>
<item>
    <list>
        <item>e</item>
    </list>
</item>
<item>
    <list>
        <item>e</item>
    </list>
</item>
<item>
    <list>
        <item>(f)</item>
        <item>f</item>
    </list>
</item>
<item>
    <list>
        <item>g</item>
    </list>
</item>
<item>
    <list>
        <item>(h)</item>
        <item>h</item>
    </list>
</item>
<item>
    <list>
        <item>g</item>
    </list>
</item>
</list>

Which, admittedly, is fairly verbose. But you can now have a function that just gets the individual line’s items that you are interested in which would look something like:

<xsl:function name="jc:getCurrentLineRhymes" as="item()*">
  <xsl:param name="rhyme"/>
  <xsl:param name="currentLine" as="xs:integer"/>
  <xsl:variable name="rhymes" select="jc:groupRhymes($rhyme)"/>
  <xsl:copy-of select="$rhymes/list/item[$currentLine]"/></xsl:function>

Which when called with something like:

 <xsl:copy-of select="jc:getCurrentLineRhymes($rhyme, 4)"/>

(where ‘4’ here usually would be a variable containing the current line number) it will produce something like:

<item>
 <list>
  <item>(c*)</item>
  <item>b</item>
 </list>
</item>

Which a simple string test using contains() can again tell you whether there are any feminine (asterisk) rhymes or internal (parentheses) rhymes, etc.

Hurrah! See that wasn’t that difficult after all. In this case it makes a good example of using XSLT2 functions to call other functions to break the overall task down into manageable more object oriented-like tasks which can be re-used for a variety of purposes. (There are a lot of efficiencies which could be implemented here… the jc:getCurrentLineRhymes and jc:getCurrentRhyme are almost identical, except that one uses jc:groupRhymes() and the other uses jc:tokenizeRhymes(). This could be one function which tests a parameter to see which is intended.

The whole XSLT stylesheet is available from https://github.com/jamescummings/conluvies/blob/master/xslt-misc/tokenize-rhyme-test.xsl.

Posted in TEI, XML, XSLT | Leave a comment

Teaching the TEI-Panel

As part of the Text Encoding Initiative Consortium’s annual conference I participated in a panel organised by Elena Pierazzo called “Teaching the TEI: from training to academic curricula” see http://idhmc.tamu.edu/teiconference/program/papers/#teach for the abstract. Florence Clavaud and Susan Schreibman were unable to attend and so at the very last moment Julia Flanders from Brown University graciously agreed to join the panel. The panel consisted of: Elena Pierazzo, Marjorie Burghart, James Cummings, and Julia Flanders.

Elena Pierazzo started off the panel by introducing what it would cover. It looked and the differences and the similarities in teaching the TEI in a range of contexts: from a dedicated intensive workshop targeted at professionals to the teaching of TEI as part of a related academic course. These have differences in aims, methodologies, and overall coverage and the syllabus of each of these types of teaching might cover different chapters of the TEI Guidelines.

The panel discussed which of these approaches seemed most successful, and what was meant by success when teaching the TEI. The question of whether the TEI works better as a tool to solve a problem researchers are currently facing (e.g. a digital edition of a manuscript, a dictionary, a corpus…) or as a method for approaching analysis or tool for modelling concepts? Throughout the panel these two types of teaching were contrasted to see what might be learned to benefit the other pedagogical form.

Marjorie Burghart contrasted the similarities and differences between the BA and MA level training provided in Lyon as compared to Elena’s examples. She insisted on the importance of embedding TEI teaching in other disciplines, giving the example of one of her courses where students are taught editorial techniques as a whole, from the historical developments of diplomatics and philology to their digital translation. The central message being that not all TEI teaching is done in “TEI courses” or even those specifically about digital technology, some of it occurs in academic field-related courses that happen to include sections about the TEI.

James Cummings briefly surveyed the types of TEI training provided at the University of Oxford, noting the Digital.Humanities@Oxford Summer School (http://digital.humanities.ox.ac.uk/dhoxss/) evolved out of many years of TEI Summer Schools and now always included a week-long TEI workshop. The introductory TEI workshop in such a context tends to cover a large amount of the TEI Guidelines, giving a broad but shallow and intensive overview. He mentioned that they also did bespoke training for individual research projects where the whole of the TEI was not taught, but just a brief overview followed by specific training in the aspects that project would be using.

Julia Flanders provided a description of the workshops they teach at Brown University and the workshops provided at DHSI.org, and how these compare to those in Oxford and differ from those that form part of larger academic courses. She discussed various approaches to teaching the underlying concepts and how existing tools such as Roma might be improved to facilitate this. She suggested that introductory tools which allowed ‘Finger painting with semantics’ should be created to allow people to play with the concepts of data modelling in a user friendly manner.

There was much wide-ranging discussion with the audience with many interesting points made and questions raised.  Several participants mentioned that they used text encoding generally, and the TEI in specific, to teach different things. That is, the process of learning the TEI helps students to understand more about other topics (e.g. the nature of text). Michael Gavin commented that it would be nice to have a survey of both TEI courses and courses which include the TEI in higher education. Marjorie mentioned that Marjorie mentioned that Florence Clavaud (EnC) had started a similar survey for France / Europe, and that it would be good to get in touch with her.

TEI is taught in a variety of different ways, and the more teaching of it the better, but what has to be closely examined by providers is why any particular course is being offered. Is it to induct a large number of people into a basic understanding of the scope and coverage of the TEI Guidlines? Is it to teach them the practical skills to undertake the work on one particular research project? Or is it to teach them other more ethereal concepts, of which the TEI is one practical and concrete example?  The teachers on this panel had all taught in a variety of these kinds of contexts and the differences in approach and coverage made for an interesting comparison.  As the TEI continues to grow and be used more pervasively as the de facto standard for the encoding of digital texts (especially in academic contexts) then the community will need to continue to improve its teaching and organization of teaching.  One promising sign is the network of Digital Humanities training institutes (see those referenced at http://digital.humanities.ox.ac.uk/dhoxss/) which are slowly cooperating to produce a consistent pedagogical basis while retaining their unique character and experiences.

Posted in Conference, TEI | Leave a comment

More about @rend

Lou Burnard has provided a technical summary of some of the recently issues discussed concerning @rend, but I thought I might provide some more explanation for those not as familiar with the technical background to the discussion. I would have done so sooner but was driving around too narrow farm roads in Cornwall on holiday without much reception on my phone. What follows are my own opinions and interpretations of the TEI Guidelines which are continually evolving based on community consensus.

The @rend attribute

The TEI provides a @rend attribute which indicates an interpretation of how the element in question was rendered or presented in the source text. It has nothing to say about what should be done with the element in any particular output from processing or displaying the TEI text. The assumption that many people make is that processing TEI means outputting HTML designed to help you read the text, but this is certainly not necessarily the case. The TEI text might have any number of outputs, just for reading it might be HTML, ePub, PDF, DOCX, and many more, moreover those encoding the texts might not be intending to read it but process it for other forms of text analysis in any number of formats. While individual projects can provide project documentation on how they intend certain elements to be presented in particular forms of output, other people processing those texts could choose to do something completely different.

@type and @rend values and their whitespace

During a TEI-L discussion concerning why the @type attribute did not allow spaces it was explained that this is because the @type attribute does not contain free text, but a special token that categorises the element in some way. Moreover, the recommended practice is for projects to customise the TEI to constrain the choices available for the value of the @type attribute on some elements and document in their customisation exactly what those special tokens mean. @type attribute values are a datatype of data.enumerated which means that they are “expressed as a single XML name taken from a list of documented possibilities”. That means that this value has to obey the rules of what it means to be an XML name, and it should be from a set list that the project has documented (preferably in its TEI customisation, but possibly just in prose documentation preserved with the TEI file). Most elements that have a @type attribute get it from claiming membership in the att.typed attribute class, and if a secondary type classification is allowed they also get @subtype.

The discussion moved on (possibly because I referenced my earlier post on @rend) to the difference with the @rend attribute and using CSS inside it. However, with the @rend attribute though the situation is slightly more confusing. It allows 1 to infinity occurrences of the datatype data.word in it. A data.word datatype “defines the range of attribute values expressed as a single word or token.” As I’ve discussed elsewhere, this means if someone marks up a text using:

<hi rend=”It looks a bit like that other one”>text</hi>

This actually has 8 tokens “It”, “looks”, “a”, “bit”, “like”, “that”, “other”, “one”. The point is that the whitespace between these words in the attribute make these each separate values or tokens, not a phrase. The encoder might just have written:

<hi rend=”big bold beautiful”>text</hi>

or indeed

<hi rend=”largeStyle42″>text</hi>

The data.word datatype says that “Attributes using this datatype must contain a single ‘word’ which contains only letters, digits, punctuation characters, or symbols: thus it cannot include whitespace.”

Some encoders believe that the TEI should reverse its decision on free text in attributes and allow @rend to contain “It looks like that other one” and this not to be a set of discrete tokens. Personally, I disagree and feel that would be a retrograde step.

@rend values and their order

Other than defining it as a set of data.word occurrences the TEI does not dictate what the @rend values should look like. In my opinion it would be wrong if the TEI try to codify all the possible rendition values that appear in every sort of text. Moreover, describing the way something appears in a text is always an interpretative process and two separate encoders looking at the same text, or looking at it for different reasons, might perceive it in very different ways. In fact the Guidelines explicitly say:

“These Guidelines make no binding recommendations for the values of the rend attribute; the characteristics of visual presentation vary too much from text to text and the decision to record or ignore individual characteristics varies too much from project to project.” (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.html)

Some encoders believe that it is a shame that the TEI has not defined a syntax by which they should specify the @rend attribute values. I disagree because I feel that the greatest flexibility should be given to projects and sub-communities to customise and constrain such values for themselves. It could be argued that the TEI has indeed provided a syntax, but in a very general way, that these are whitespace separated tokens containing only letters, digits, punctuation characters or symbols. The point is that these are entirely meant to be intended as magic tokens that individual projects can decide for the meaning for their own use (and document). If I put in the magic token ‘bold’ it might mean in my project something different than it means in yours.

It came out in the TEI-L discussion that some encoders believe that the order of @rend values provided should be important, as if they are making a phrase. Others tend to put the most important rendition classification first, and still others always provide different types of classification in the same order. I find these all prone to human inconsistency and so I choose to believe that they are an unordered set of values that could be entered in any order. i.e. that:

<hi rend=”big bold beautiful”>text</hi>

should be understood to be semantically equivalent to:

<hi rend=”beautiful big bold”>text</hi>

My beliefs here are, perhaps unduly, influenced by long and painful experience in processing hand-encoded texts (which also influences my beliefs on the value of automatic and semi-automatic up-converting markup). In my encoding projects I recommend that no special significance be granted based on the order of the tokens present in the @rend value. The TEI, I think sensibly, allows individual projects to do what they want but does specify that these are individual tokens.

Some projects decide to put various standard presentation-description formats, e.g. Cascading StyleSheets, into the @rend attribute. I personally feel that this is misguided and sloppy. Partly this is because I suspect that some of them are actually encoding for a particular output format (rather than documenting what the original source looked like) and this is the wrong place to store this information. Partly this is because such presentation-description formats often use significant whitespace (which then means an abuse of the data.word datatype). And partly this is because I feel there is a better and easier way to do this more consistently using the @rendition attribute.

@rendition and <rendition> really aren’t extreme

As with many other things in the TEI, the Guidelines provide a simple use-case (@rend’s magic tokens) and a more complex system (@rendition). The @rendition attribute allows you to point to a <rendition> element up in the header where you can use any form of free text to describe how this was rendered in the original source. This means that instead of putting a set of magic tokens or classifications like “largeStyle42” an encoder can completely transparently point to a fuller description using the standard URI fragment pointing mechanism that is common throughout the TEI recommendations. Thus instead of writing:

<hi rend=”largeStyle42″>text</hi>

And having it documented somewhere what this meant. The encoder can point to a <rendition> element by its @xml:id attribute and have a fuller description there. For example this could be:

<hi rendition=”#largeStyle42″>text</hi>

and while that doesn’t look much different the URL fragment ‘#largeStyle42’ points to a place inside the TEI file’s <teiHeader> (specifically inside the <tagsDecl> element) where there is a better description:

<rendition scheme=”free” xml:id=”largeStyle42″>This text is really big, bold, and beautiful</rendition>

Okay, admittedly that might not be a very useful description. But the point with the ‘free’ scheme is that it is free text. It can be any prose, in any language, and way of describing it. The @scheme attribute also allows for ‘css’ for those people wishing to use cascading stylesheet language, ‘xslfo’ for those wanting to use extensible stylesheet language formatting objects, and ‘other’ for those using another set rendition description language. So ‘#largeStyle42’ could point to something using CSS that looked like:

<rendition scheme=”css” xml:id=”largeStyle42″>
font-weight:bold;
font-size: 75pt;
font-family:”brushstroke”, fantasy;
color:#002147;
</rendition>

If a more precise description (in whatever language) is able to be provided for ‘largeStyle42’, then this can be changed at a later date. Equally this could be broken up into multiple <rendition> elements and you can have:

<rendition scheme=”css” xml:id=”bold”>font-weight:bold;</rendition>
<rendition scheme=”css” xml:id=”big”>font-size:75pt;</rendition>
<rendition scheme=”css” xml:id=”beautiful”>font-family:”brushstroke”, fantasy;</rendition>
<rendition scheme=”css” xml:id=”oxBlue”>color:#002147;</rendition>

and in the text:

<hi rendition=”#big #oxBlue #bold #beautiful”>text</hi>

Moreover, because @rendition is one of the TEI’s many pointing elements it does not need to point to a <rendition> element in the very same file! Instead a project could centralise all their rendition information to a single place. So that might look like:

<hi rendition=”renditionFile.xml#largeStyle42″>text</hi>

or indeed

<hi rendition=”http://www.example.com/renditionFile.xml#largeStyle42″>text</hi>

Some encoders feel that pointing to a <rendition> element is a lot harder than just sticking some tokens into the @rend attribute. Others argue that as part of the process of hand encoding users should be able to add whatever they want to @rend, and for this to be valid because rationalising these in advance is more difficult than doing so afterwards. Or indeed that it is more convenient to encode unusual variants ‘in-line’ rather than pointing back to the header. Both of these are good points, and have some truth to them. In the first case, it depends on the level of specification needed. Most encoders in my experience use very general and imprecise @rend categorisations. That is, they could have a rend value of ‘big72pt’ but they tend to just use ‘big’ (or small/medium/large/x-large).

How much time and energy one wants to spend worrying about specifying @rend and/or @rendition values depends on how important to your project that that this information is documented and done so in a consistent manner. If it is just that you want record whether something is in one of a handful of different colours, sizes, or styles, then you probably just want to agree a project specification of @rend values (and what they mean) for your TEI customisation.

Other @rend issues

Some encoders believe that there is no formal way of indicating what syntax you have used for your @rend values. I disagree because I believe these are magic tokens which are most properly documented in the TEI customisation. This enables an encoder to give a free text description for every magic token used in @rend attribute values, and moreover if they wish it enables a project to constrain it to be just this set of values. If a project is using a specified syntax inside their @rend attribute values (so-called ‘rendition ladders’ are one such format) then this should be documented inside the <encodingDesc>, perhaps in prose or perhaps the TEI will add a mechanism in response to the TEI-L discussion which enables categorisation and description of the taxonomy of @rend attribute values.

Changing @rend

My arguments here are based on my own views and understanding of the current (P5 2.0.2) version of the TEI Guidelines. However, these are subject to change (both my views and the Guidelines). I’ve often been told that the TEI recommendations seem like dictates coming down from on high saying “do it this way”, but that is really not how I view the TEI Guidelines or the community that creates them. The TEI is an open source project which takes solicitations for bug and feature requests from anyone and everyone. This can be from someone encoding their very first TEI document, reading the Guidelines for the first time, or it can be from those with a long history of experience with the TEI. Each and every bug and feature request should be considered on its own merits by the TEI Technical Council elected by the TEI community. [Note: there is scope for electoral reform, but this is a very different topic.] The recommendations of the TEI are not a fixed quantity but an evolving record of the concerns and experience of the community that produces it. In many ways hearing what users new to the TEI have difficulty with, or where they find the Guidelines confusing is more valuable in the long run than some of the more arcane technical discussions.

Posted in TEI, XML | 1 Comment

Self Study (part 1): Introducing XML and Markup

I’m occasionally asked what people should read and do if they want to teach themselves TEI P5 XML. Where should they start? This depends, obviously, on what time they have and what resources. I tend to recommend directed intensive training such as the Digital.Humanties@Oxford Summer School as good ways to get an introduction to such topics.

However, some people are unable to participate in such training and prefer self-directed learning. What should they do? There are lots of resources online such as TEI By Example and the TEI Guidelines. Where to start?

When people are taking an Introduction to TEI workshop I usually introduce markup but move onto TEI and XML very quickly because in such intensive workshops time is limited. Instead, when people are undertaking self-directed learning I think they should use the time they have to learn more about HTML and then XML before starting to learn about the TEI vocabulary of XML itself.

There is so much reading that is possible to suggest for an initial exploration of XML and Markup.  I would suggest at least looking at:

as a good start.

If I were to suggest a series of assignments someone might undertake based on this reading it would be to do the following, writing up answers to the questions.

  1. Read the W3Schools HTML basic section and XHTML section, do the HTML and XHTML quizzes
  2. Read the W3Schools XML basic section and XML Namespaces page, do the XML quiz
  3. Read the TEI Guidelines Gentle Introduction to XML; and the wikipedia article on XML.
  4. How does XML differ from HTML? Why might it be more powerful to describe what some piece of data is, rather than say how it should be presented?
  5. Download and install the oXygen XML editor (you can get a 1 month free trial license, otherwise costs $64 USD)
  6. Choose a very short (1 page) sample of a document you are interested in.
  7. Create a list of the overall structural aspects you feel define this sort of document. Create a list of any of data-like entries (like names or dates) in the document. Create a list of presentational aspects of the document that you think important to record.
  8. Funding challenge part 1: Hypothetically, imagine you had funding to mark up several thousand pages of this material. Look at the list of aspects you would like to record. Why is each one important? What benefit does recording each of these things give those wanting to use or understand the text (or culture from which it originates)? Which would you choose to markup? How consistently can you mark up this feature? Such document analysis should be done long before any project starts (or asks for funding).
  9. Funding challenge part 2: An uncaring government has slashed its funding for higher education research projects and has reduced your project’s funding by 50%! What would you do? Will you mark up only 50% of the material? If so, how do you decide which parts? Will you only mark up certain aspects? If so, which ones and why?
  10. Using the ‘Text’ (code view) mode of the  oXygen XML editor create a well-formed XML file of your sample document with elements and attributes that you have invented yourself. What difficulties do you encounter doing this?
  11. Why might it be better for communities of users to agree on elements, what they mean, and how they should be used?
  12. What are the central ideas of Michael Wesch’s youtube video? How do they relate to the nature of XML and how it is used?
  13. Read the wikipedia article on RSS, and find an RSS feed to subscribe to in google reader to see its application.
  14. Does order really matter in an XML document?  What is the difference between:

    <list><item n=”1″>item 1</item><item n=”2″>item number 2</item></list>  and
    <list><item n=”2″>item number 2</item><item n=”1″>item 1</item></list>

    And how much difference does this make when viewing XML as a data storage format rather than a presentational one?

  15. Join the TEI-L mailing list and start lurking.

This certainly isn’t exhaustive, but with a bit of support, I suggest someone undertaking this would be much better placed to start learning about TEI P5 XML from the online sources available.

The next post in this series is an Introduction to the Text Encoding Initiative Guidelines.

Posted in SelfStudy, TEI, XML | 3 Comments

@rend and the war on text-bearing attributes

In discussing that the TEI attribute @rend from att.global although it allows you to type just about anything in it, doesn’t actually allow anything more that a set of single tokens. I recently explained to John, Paul, George, or Ringo (can’t remember which), that it really doesn’t mean that spaces are allowed, simply that whitespace is the delimiter in the attribute value.

The definition of @rend is “(rendition) indicates how the element in question was rendered or presented in the source text.” but it is very often used by some encoders to signal to processing how you want the output to appear.  In the remarks on the values allowed for the attribute it says:

may contain any number of tokens, each of which may contain letters, punctuation marks, or symbols, but not word-separating characters.

The point here being the ‘word-separating characters’ part. So although you can say <hi rend=”It looks a bit like that other one”>text</hi>, this actually has 8 tokens “It”, “looks”, “a”, “bit”, “like”, “that”, “other”, “one”. Sometimes people stick CSS or CSS-like rendition information into @rend so have values like “text-align: right”. Which I would say was wrong… or at least saying that there are two classifications applicable to its rendition in the source material, one that it is “text-align:” and another that it is “right”.  Of course they could solve this just be deleting the space “text-align:right” would be better, or even “text-align:right; font-size:large;” if you wanted to add another token.  However, even better would be to use @rendition to point to at least one @xml:id of a <rendition> element in the header.  This allows you to specify exactly what scheme you are using (e.g. CSS) and to give multiple statements for one classification.

Why does this matter you might ask? Well, of course, it doesn’t really — they are all magic tokens of one sort or the other to be interpreted (or not) by your processing for whatever reason you are undertaking the encoding. The <rendition> method is the most detailed in documenting precisely how you are interpreting the rendition in the original document.

However, the reason it matters to me is that there are NO attributes in the TEI which allow free-text.

By that I mean that all attributes are assigned to one datatype or another, and in none of them can you just type sentences of prose and have it be semantically meaningful.  This is as a result of the long War on Text-Bearing Attributes that was undertaken in the run-up to the first release of TEI P5. This took as one of its many principles that because any bit of free text might have a need to use a non-Unicode character, and that the TEI’s method for documenting non-Unicode characters was to use its <g> element, that you couldn’t have free-text attributes because you can’t use an element inside an attribute value. This is the reason for the creation of many new child elements like <desc> which are intended to contain free text concerning the nature of the element that contains them.

In the case of the @rend attribute it allows one to infinity of the data.word datatype.  This data type, even in P5 1.0.0 “defines the range of attribute values expressed as a single word or token.”  Thus when people put space separated characters into it, they are really putting in multiple tokens.  The war of text-bearing attributes attempted to limit the places where people were able to do this by the use of datatypes and the removal of free text in attribute values.

This helps to highlight the difference between syntactic and semantic validity. Just because your document validates against a schema, does not mean that it is semantically valid.  You can put the text of a title inside an <author> element and vice-versa and there is no way your schema can know that you have done this.

So really, I’ve posted this post so I can point to it later when people ask me about spaces in @rend and similar datatype kerfuffles.

Posted in TEI, XML | 2 Comments

Is it Bill or Ben that is speaking of flowerpot men?

A friend asked a question about how to encode a dramatic speech that possibly should be considered two speeches. Owing to a printing mistake, the second speaker’s name was omitted, so some consider it a single speech by the first speaker. However, a later hand has added the second speaker’s name in the margin after the fact, so some may wish to understand it as two speeches. The question was how do you encode these two possibilities simultaneously. Of course an entire stand-off solution is possible where you just mark the words and simultaneously mark word 1 to 20 as belonging to one speaker and the other speaker. But ignoring that more complicated solution here is some of the thinking I went through.

Let’s say we have some play, where Bill has two paragraphs. In the first he says “Bill and Ben, Bill and Ben,” and in the second he says “Bill and Ben, Bill and Ben, flowerpot men”. In TEI we might encode this as:

<!-- bill is speaker -->
<sp who="#bill">
   <!-- #bill points to more information about this speaker somewhere else in the document -->
  <speaker>Bill</speaker>
   <p>Bill and Ben, Bill and Ben,</p>
   <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

Now let’s say that the speaker marker ‘Bill’ was there and it had been crossed out by a later hand and replaced by ‘Ben’. We could indicate who we thought the real speaker was with the @who attribute whilst still retaining the orthographic distinction that a substitution had been made inside the <speaker> element.

<!-- Ben is speaker but a substitution noted-->
<sp who="#ben">
 <speaker>
    <subst>
      <del>Bill</del>
      <add>Ben</add>
    </subst>
  </speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

But this means we have to make the editorial decision, for all outputs, that one of them (here ‘Ben’) is the speaker. Another similar type of occurrence might be when Bill and Bill both say the paragraphs at the same time. In this case, we just note both of them as speakers:

<!-- bill and ben are both simultaneously speakers-->
<sp who="#bill #ben">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

Similar to this, is the case where the entire speech is spoken by either Bill or Ben, but the text just says Bill. In this case one solution (of a number of them) is not to post to a <person> element but instead point to a <listPerson> identified as ‘billOrBen’. Then in processing we can choose to assign this to one or the other, even though the text still says ‘Bill’. We’ve documented that we can only have one of them by using the @exclude attribute to point to the other <person> element.

<!-- billOrBen listPerson is speaker, but contents are mutually exclusive, so sort out in processing -->
<sp who="#billOrBen">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

<listPerson xml:id="billOrBen">
  <person xml:id="bill" exclude="#ben"><persName>Bill</persName></person>
  <person xml:id="ben" exclude="#bill"><persName>Ben</persName></person>
</listPerson>

But in the case that I was asked about the speaker’s name is added partway through a speech. Now, one way to deal with this is just to say the ‘Bill’ is the speaker, and the name ‘Ben’ is just an addition in the text. There is nothing wrong with this, you’re just documenting the original printing and the addition of the new name, but not changing the structure of the text.

<!-- bill is speaker but addition of  name partway through noted -->
<sp who="#bill">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p><add place="left"><name>Ben</name></add>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

The other option, of course, would be to understand the intellectual content of the addition as splitting the two speeches, and encode not the original printed work, but the final version with the editorial additional provided by a later hand. (So this would just be ).

<!-- bill is speaker but addition of  name partway through noted -->
<sp who="#bill">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
</sp>

<sp who="#ben">
   <speaker rend="left">Ben</speaker>
   <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

But that isn’t really what was asked for… this says that there are two speeches, and while they want to have this as a possibility, they also want to record that it is possible that the ‘flowerpot men’ paragraph was actually said by ‘Bill’ and this ‘Ben’ in the margin is just an addition. One way to do this is to use the @exclude attribute again and to do so at slightly different levels of granularity.

<!-- bill speaks first bit, and possibly second bit, but possibly ben speaks second bit -->
<sp who="#bill">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p exclude="#benPara" xml:id="billPara">
    <add place="left"><name>Ben</name></add>
     Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

<sp who="#ben" exclude="#billPara">
  <speaker rend="left">Ben</speaker>
  <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

In this case we’re saying that the second paragraph of Bill’s speech is mutually exclusive with the whole speech by Ben. In processing for any particular output we need to decide how to handle this, do we have the speech by Bill (which has the addition of a name to the left of the second paragraph) or do we have the speech by Bill consisting of only the first paragraph, and a speech by Ben.

Another way to do this is to use the <alt> element to record this elsewhere. In this case you just need to make sure there are proper @xml:id attributes on all the elements you want to point to, so here ‘billPara2’ is the second paragraph of Bill’s speech, and ‘benPara2’ is the whole of Ben’s speech. We then use the <alt> element to say that these two IDs are mutually exclusive, and specifically that we think it 70% likely that ‘billPara2’ is the correct one to choose and only 30% that ‘benPara2’ should be the correct choice.

<!-- bill speaks first bit, and possibly second bit, but (less) possibly ben speaks second bit  stand-off alternation-->
<sp who="#bill">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p xml:id="billPara2"><add place="left">Ben</add>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

<sp who="#ben" xml:id="benPara2">
  <speaker rend="left">Ben</speaker>
  <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

<alt mode="excl" targets="#billPara2 #benPara2" weights="0.7 0.3"/>

It is important to note that all of this is just a way to document whichever interpretation the encoder wishes to record. I’m not aware of any off-the-shelf processing which will do anything with @exclude or <alt> elements, however, I can picture that doing this in XSLT would not necessarily be too onerous depending on what circumstances it is used.

Oh, and obviously the original enquiry did not use a play based on the Bill and Ben theme song, but a much more famous Renaissance poet and playwright.

Posted in TEI, XML | Leave a comment

TEI P4 Support, Survey Results

Introduction

This post contains the results of a survey that  collected information which the TEI Technical Council will use to assess the need for ongoing support for the TEI P4 version of its Guidelines. These have largely been replaced by the TEI P5 Guidelines since November 2007. At that point it was promised that support would continue for TEI P4 for 5 years, until November 2012. As that is just over a year away we are starting a slow process of phasing out support for the TEI P4 Guidelines. The TEI Technical Council is planning to de-emphasize the appearance of TEI P4 as an offering since support for it will be ending in November 2012. We will continue to support it over the next year but may take steps to stop it being indexed by search engines or make it less prominent on the website. These are the results of this survey, which I’ve also transformed to TEI P5 XML at http://users.ox.ac.uk/~jamesc/SurveySummary.tei.xml.

1. Are you involved with projects that are still using TEI P4?

Answers for Question 1

My reading of these results is that many people are either not using TEI P4, or planning to migrate it to TEI P5. I suspect, given the other answers that those with TEI P4 projects probably do not rely on a lot of support from the TEI Consortium.

2. How important is ongoing TEI P4 support to you?

Answers to question 2

This seems fairly clear: out of 54 respondents 44 said it was not important, unnecessary or that we should get rid of it. But that it is important or very important for 18.5% of respondents is still significant and must be remember when making decisions concerning ongoing support for TEI P4.

3. How much should the TEI Consortium begin to de-emphasize TEI P4 on its website before November 2012?

Answers to question 3

There seems to be a strong vote for making TEI P4 available only from the TEI Vault and making sure existing links redirect.

4. Should search engines be dissuaded from index TEI P4 materials?

Answers to question 4

This result is less clear cut with some people feeling it shouldn’t be indexed, and some people thinking it should be (with slightly more weight on it being indexed than not indexed).

5. Approximately how many TEI P4 projects have you been involved with?

Answers to question 5

This is simply a statistical question (and of course depends how the respondent interprets ‘projects’). It is interesting that the majority of people seem to be involved with more than one project, but that is hardly unexpected. More were involved with 6-15 projects than I thought.

6. Approximately how many TEI P5 projects have you been involved with?

Answers to question 6

It is interesting that the percentages are vaguely the same as with TEI P4 projects, though slightly higher overall.

7. What amount of TEI P4 data do your projects have? (In documents, number of files, how many megabytes, or whatever convenient measure makes sense for your project)

This was a textual question, attempting to get a measure of how much TEI P4 stuff people have. It was deliberately left vague as to how it should be expressed, partly because I was interested to see how people would quantify their TEI P4 data, and partly because I recognise that it would be difficult to provide all the same form of measurement.  I was interested to see that this ranged more widely than I had expected.

  • 0
  • none
  • zero
  • Several hundred files.
  • I have about 500 texts
  • 3,200 files, 170Mb.
  • nil
  • Very roughly: 60,000 books = 5 million pages = 10 GB of marked-up text.
  • 40 megabytes in the one P4 project I still manage; a bunch more in ones I’m no longer involved in.
  • This varies a lot, but projects range from 3-150 MB In practice, the TEI files are a small part of the overall operation, which includes authority information usually in non-TEI format, and various generated TEI XML files used for web publication only
  • 50 files
  • Appx. 7000 files, 29 MB total data
  • Appr. 6500 documents (mostly letters)
  • 0
  • less than 10%
  • 0
  • about 3,000 XML files currently in P4.
  • in summa: about 4 Mb
  • All of the [Institution]’s projects are in migration from p4 to p5, so this is a snapshot of the migration process. The data is migrated, but the sites are not all rewritten yet. My hope is that by May of 2012, all of the current [Institution] sites will be serving out texts based on p5.
  • 0
  • Help files used by about 1000 Modes users.
  • 5 text-critical editions
  • 7000+ [P4 Customization] encoded letters
  • Main current project: several dozen megabytes including a few large files but mostly 10-20 kb: roughly 3000 files.
  • Roughly twelve published electronic editions, with at least a dozen more in the pipeline, in process of being finished (though they now have to be migrated to be published).
  • I have no clue, but it’s a lot.
  • The [Institution] has 113MB bytes of P4 documents, of archival interest only.
  • None, since we upgraded.
  • I’m not sure. I think I might have one project that is in TEI P4, but it’s a legacy project and I’m actually not positive. I haven’t looked at it in a while.
  • 2.5 million text pages
  • zero
  • None
  • Between 300 and 600 files.
  • ca. 70 files
  • dozens of documents.
  • Lots. Can’t access the figures quickly.
  • 700MB

This ranges from zero to multiple gigabytes of TEI text. What I should have asked was “And is all the TEI freely available for download?” as, of course, that is something I’d like to encourage.

8. Please list the URLs of any TEI P4 projects you want us to know about.

I’ve decided not to provide these on this summary, if projects wish to provide samples they should add them to  http://wiki.tei-c.org/index.php/Samples and/or describe their projects on the wiki.

9. Please list the URLs of any TEI P5 projects you want us to know about.

I’ve decided not to provide these on this summary, if projects wish to provide samples they should add them to  http://wiki.tei-c.org/index.php/Samples and/or describe their projects on the wiki.

10. Have you submitted a Bug or Feature Request to the TEI Technical Council?

Answers to question 10

Lots of people have provided bug or feature requests,  but most people have either contributed to discussion or not contributed them. We should, of course, strive to increase feedback from the TEI community. I’d be interested in any ideas on how to make this easier for the community to participate.

11. Where do you think the TEI Technical Council should expend its time and effort?

Answers to question 11

This is also an interesting result.  Scoring highest on ‘top priority’ is the idea that the TEI Technical Council should spend its time fixing bugs and implementing feature requests by the community. This, and analysing where the TEI Guidelines could be improved and undertaking these improvements was also ranked highly, along with developing the infrastructural basis for future versions of the TEI Guidelines. What  scored lower was the idea of the TEI Technical Council setting up a repository of TEI texts, or developing software to make publication of TEI texts easier. I would suspect that this is because that maintaining the Guidelines is the central mandate of the TEI Technical Council, and looking for how it can be improved is related to that, while the creating of repositories is already done better by people who already focus on those activities.  Although it is a community-based activity only the TEI is really in charge of maintaining the Guidelines, whereas any third-party can develop software or archives.  We should certainly encourage those activities and implement community suggestions which facilitate the greater development of community software.

12. Any other comments?

Here are the comments that I received (lightly edited), with my personal responses:
For people with large repositories of transcriptions (where the text content will never be updated), markup stability is essential. P4 to P5 is not essential but recommended, but it’s going to mean a huge effort. My worry is that there will be a far too rapid succession to P6, P7, P8, etc which adds bells and whistles but does not contribute anything meaningful to static repositories.
There is not necessarily any reason to migrate if your systems are set up and working fine with P4. I would, personally, recommend using P5 in any new project.  And then you probably reach a state where it is easier to migrate the P4 to P5 than support multiple systems, but different people’s experiences will vary.  The Birnbaum Doctrine suggested that the TEI Council should only move to new major versions (P6 etc.) when a large external technological change meant that it would be beneficial (e.g. SGML to XML) or a large internal infrastructural change (e.g. development of the P5 class system) was deemed significantly beneficial. I personally do not believe that we are at a juncture which would necessitate development of P6, rather I’d prefer to see P5 2.5, P5 4.5, P5 35.2, etc. than have people feel they need to move major versions.  This has its own challenges, of course, and your project in its TEI ODD can point to the very specific version of TEI P5 that it uses.
Yes – thanks for doing such a great service to the community!
You’re welcome, it was my pleasure. Although I know filling in surveys can be annoying I think it is a quick and easy way to get at least a vague indication of the community’s feeling on certain issues.
I think that lack of easy tools for presentation / publication od TEI documents is a serious drawback. Many of my younger colleagues would learn (or actually have learned) the TEI editing in Oxygen, but they are unable — and not willing! — to learn XSLT for the presentation of their texts (not to mention the publication – servers etc.). An average user who is not able to modify Sebastian’s stylesheets for his edition is left completely alone with his/her TEI document (only *exceptionally*, an XSL-expert is available for help in big institutions). As for now, the TEI is an ideal tool for only one part of the communication chain — but not for the whole …
This is of course difficult, but so is the publication of research in print or other mediums. Usually these forms of publication involve the work of other people, for which researchers pay in one way or another.  Perhaps it is because I happen to help manage a service, InfoDev,  which would be more than happy to undertake paid work in this area for you and other external institutions, but I don’t see this as much as a hurdle.  If the research is worthwhile, then hopefully funding is available, and some of this could be budgeted for technical development.  However, that said, researchers often spend years learning ancient languages or obscure discipline-based technicalities, and arguably they should be able to learn some basic XSLT and HTML with a very small dedication of their time.  Whether they should and could do this is, of course, a personal decision, but these are just more tools in a toolbox that might also include knowledge of how to write complex statistical queries or how to collaborate using version control systems. But again, we’re happy to undertake work, especially TEI-related work, from any part of the digitization to publication, analysis and visualization aspects of research projects.
Perhaps, a marketing campaign would help.
This would perhaps help get more people involved in the TEI. We would want, I suggest, that anyone doing a humanities text project applying for funding should feel (or get the advice that) they should be using the TEI (or at least justifying why they are using some other open standard instead). I feel this is probably more in the mandate of the TEI Board than the TEI Technical Council, but would encourage SIGs and indeed individuals to undertake whatever outreach activities are feasible.
about question 11 : it would be interesting to relate software/tools development and training/workshop. offering training sessions dedicated to one tool or category of tools, and looking at how people use tools IRL during the training sessions to get a better idea of need specifications… ?
This would be interesting, though those who have been just trained in tools are likely to perceive different needs from those who use them on a daily basis. But I do wonder whether this should be a priority for the TEI Technical Council, who has its hands full maintaining, improving, and extending the Guidelines themselves.  We should encourage tool development by third parties, and facilitate this development where it is in our power.
Please, please, please don’t spend time and money on building a TEI-wide repository. Instead, convince Google to recognize the TEI format so that one can easily do a web search for TEI texts. Then, get people to put their texts on the web. I think the building of publishing tools and education are very important, but that they shouldn’t be Council functions per se. Similarly, I think the interchange question is very, very important, but Council’s role in it should be limited. This is the kind of thing a SIG (or SIGs) should tackle, and Council should be involved in blessing/criticizing their output.
Personally, I agree with you about building repositories. I feel there are more than enough people with a lot more experience in undertaking this kind of activity.  There already has been discussion and work with Google regard exporting from Google Books in TEI P5 XML format which are promising. I agree the community, potentially through SIGs can handle a lot of these issues. I worry about the idea of it “blessing/criticizing” the output of SIGs, rather than just being on hand to provide support and implement changes recommended by them.
Creating and managing a content repository is vastly different from developing and maintaining markup guidelines, and would require a serious redirection of TEI-c’s resources. Let others who are already in the repo business (e.g., HathiTrust, OTA) take care of that.
I would agree with this, and it is what I would recommend to the TEI.
Thank you for undertaking this survey.

You’re welcome, it was my pleasure. I’m always interested in getting a sense of where the TEI community agrees on certain issues.


13. You may optionally include your email address so we can contact you if (and only if) we have any follow-up questions concerning your responses.

I’m certainly not going to provide these for spam-bots!

Conclusion

My recommendation to the TEI Council is going to be that we slowly start phasing out TEI P4 support. Closer to the end-of-support date (November 2012) we should move the TEI P4 materials to the TEI Vault but redirect links to there. I think this survey bears out my belief that the TEI Technical Council should focus on the maintenance and improvement of the Guidelines, and looking for ways to improve these in the future.



Posted in TEI | 5 Comments

TEI Consortium and its Future

John Unsworth, interim chair of the TEI Consortium (TEI-C) has asked those running for TEI Board or TEI Technical Council, and those who are remaining in place to answer some questions regarding the development of the TEI.  I’m already serving a term through 2012 so not up for potential re-election this year. I’ve chosen to write my answers up as a blog post because I found it difficult adhere to John’s plea for brevity.

1) Should the TEI cease to collect membership fees, and cease to pay for meetings, publications, services, etc.?

I feel it would be difficult for the TEI Consortium to continue its work without collecting membership fees. However, I think the majority of this money should not be reserved for travel. The majority of it should be available for application in the same manner as we have done the SIG grants in the past. (However, this might be used for travel for a particular TEI Technical Council additional workgroup, or bursaries for the conference, or targeted tool development (‘bounties’) for tools useful to the TEI-C’s mission, amongst many other things.) There should not necessarily be any limits on what could qualify for an application for funding. Not all revenues would need to be spent in a single year.

2) Assuming paid membership continues. should institutional members have a choice between paying in cash and paying by supporting the travel of their employees to meetings, or committing time on salary to work on TEI problems?

The cost of running meetings for the TEI Board or Technical Council should mostly be born by the institution and agreed to at time of nomination. (i.e. if your institution won’t commit to fund your attendance (travel and subsistence) at a couple meetings a year, then you should not necessarily be accepted as a candidate.) I realise this is unfair but so is participation in most standards-creating bodies, but there is nothing stopping significant participation by any member of the community (i.e. they don’t need to be on Board/Council to affect change).  It may be that public funds could be sought to further supplement this by the institution or individual. TEI-C money would be used for any overall expenses, such as the costs of room hire, or such things not covered by institutions. If an institutional member was in dire straits financially, but the participation of a person elected from that institution was deemed to be of such a benefit to the TEI-C, they could apply for support from the TEI-C. However, this should not be the norm. All Partner-level institutions should offer services as part of their partnership agreement in addition to the top-level membership fee. These partnership agreements should be made public on the TEI-C website. ‘Membership’ at a lower non-Partner rate might be replaced solely by services.  There should be nothing stopping voluntary participation in TEI-C activities by motivated individuals who are not institutional members.

3) Should the TEI have individual members (paying or not) who can vote to elect people to the board and/or council?

All members at every single level, especially including individual subscribers should have a single vote.  Institutions become Partners to support the TEI Consortium and tend to view it as participation in a standardization body, I doubt many care strongly about their privileged position of having a vote at election time. One vote for one member (whether individual, Partner, or otherwise).

4) Should the email discussions of the TEI Board be publicly accessible?

Yes. The TEI Technical Council archives were made public partly because of my suggestion that they should be done so. See http://lists.village.virginia.edu/pipermail/tei-council/2006/005757.html … in this post I assumed that TEI Board mailing list might contain details that would be detrimental if made public.  Having had reports back from institutional representatives on the mailing list I no longer believe that this is true for the majority of posts there. I would recommend that when something of an extremely confidential nature is discussed that this happen off the TEI Board mailing list, but that an edited summary of this discussion be posted back on the list for all to see. However such in camera discussions should be very unusual and justified before taking place.

5) Should the Board and the Council be combined into a single body, with subsets of that group having the responsibilities now assigned to each separate group?

I agree that the TEI Board and TEI Technical Council might seem a bit cumbersome. I’ve been on the TEI Technical Council since 2004 and have enjoyed that it is not in its remit to worry about the fiscal, marketing, and organizational aspects of the TEI-C. Although I think the TEI Board could do a better job in these areas, especially marketing, these are not my strengths.  If they were merged together I think it might distract from the technical work. If we then made sub-groups with responsibilities for Board-like activities and Technical Council-like activities, aren’t we just reinventing the Board and Technical Council?  If the activities and discussions of the TEI Board were conducted publicly (i.e. the mailing list archives were public), then I think that would be enough. The community could then lobby elected individuals if they wished to get their points of view heard.

6) Assuming we continue to collect funds, we will still have limited resources. Given that, in the next two years, which of the following should be the TEI’s highest priority? Pick only one:

a) providing services that make it easy for scholars to publish and use TEI texts online
b) providing workshops, training, and other on-ramp services that help people understand why they might want to use TEI and how to begin to do so
dc) encouraging the development of third-party tools for TEI users
d) ensuring that large amounts of lightly but consistently encoded texts (e.g., TEI Tite) are generated and made publicly available, perhaps in a central repository or at least through some centrally coordinated portal
e) developing a roadmap for P6 that positions the TEI in relation to other standards (HTML5, RDF, etc.)
f) tackling hard problems not addressed in other encoding schemes, in order to maximize the expressive and interpretive power of TEI

This is a difficult choice because so many of these are things that I feel strongly need to be encouraged.

a) is very vague and I feel it is not the role of the TEI-C to be providing lots of services, rather maintaining a standard.
b) also sounds good, but we already have lots of people providing training (my own institution included) at cost-recovery basis. Some more basic guides might be beneficial.
c) The TEI-C can encourage these through SIG grants and bounties where appropriate, but third-party tools should be developed by third parties.
d) I’m highly resistant to the idea that any TEI users should even see TEI Tite documents at all! This schema is not TEI Conformant or Conformable by itself as it breaks the TEI Abstract Model in several ways. Tite is fine as a mass-digitization schema, but should be transformed instantly and internally to the project to a proper TEI file with a <teiHeader>. I have nothing against lots of sample TEI texts being made available, in TEI Lite or better a different slimmed down mostly structural encoding. However, I think that having these all in one place is unlikely, and distributed collections of archives (all linked to from http://wiki.tei-c.org/index.php/Samples or another location) or through some OAI-PMH or RDF aggregator is probably an easier start). Again, this should be done by the community not the TEI-C. There are no barriers to the community just doing this and I know the Oxford Text Archive has some plans in this area.
f) Is a possibility, but the suggestions and developments for the TEI Guidelines should come from the community. However, the TEI Guidelines are not Guidelines of the Gaps handling just those things not done by other standards. It plays nicely with other standards where at all possible and developments should continue to improve it in this area.
e) Which I’ve cunningly left to last is probably central to what the TEI-C or at least the TEI Technical Council should be doing. We already have a statement on the conditions for maintenance of P5 and developments of such things like P6 http://www.tei-c.org/Activities/Council/Working/tcw09.xml and I do not believe we have reached such a major change in technology or infrastructure to warrant TEI P6, yet. However, I agree that there are things we can do with the TEI Guidelines to help those seeking transformations to HTML5, RDF, and other newer formats and recommendations to be made in this area. I disagree entirely that these somehow replace the need for TEI.  A roadmap is a good idea, but a lot of the necessary changes can be done under the umbrella of TEI P5 and its intended deprecation mechanisms.

So, on balance, I plump for ‘e)’, however I think all the other ideas are beneficial things, with c) and f) being my second choices.

Overall, I do not think the TEI-C is horribly broken, and believe that the TEI has a good and useful role to play in the development of digital resources. The suggested revisions moving towards openness and transparency would be beneficial. I feel the problems people have had with the TEI Board stem from not knowing what is going on there (lack of transparency) and members of the Board acting as individuals rather than remembering that they are there are representatives of the community at large.

-James

Posted in TEI | 1 Comment

Digital Humanities 2011

Digital Humanities 2011

My report from Digital Humanities 2011 is below. If anyone wants any more information about the various sessions I attended, I’m happy to try and dredge my memory for a recollection of my impressions. Otherwise the book of abstracts is available. Most of the interesting things were really in between sessions and in the evenings, in talking to people about possible future projects, advertising InfoDev services, etc.

Friday 17 June 2011

Sebastian and I took an afternoon flight to SFO where we were attending the Digital Humanities 2011 conference. I was lucky enough to get a row to myself, but Sebastian kept to his assigned seat rather than join me and be tormented by my cackling at juvenile films. I watched four films, the only one of which I’d recommend is Submarine whose screenplay and direction was by Richard Ayoade. Sebastian’s estimate of c.250 is a bit off, there were about 375 registered participants with various other hangers-on according to the organisers.

Saturday 18 June 2011

Sebastian and I woke early (thank you jetlag) to teach our Introductory TEI ODD workshop at 8:30am. Unfortunately, nothing on campus that serves anything which even vaguely resembles food opens until 8am on a Saturday. The course materials are at: http://tei.oucs.ox.ac.uk/Talks/2011-06-18-odd/ and we had about 15 participants. We went perhaps a bit too fast, and talked too long, but most of them made it through the first exercise. Some had difficulty with the idea that we weren’t teaching the stated prerequisite of TEI and XML but the TEI’s customization language instead. It really would have been better to do it as a full day workshop.

Afterwards a Craig Bellamy and I drove (in a mustang he had rented) down to Santa Cruz and ate a burrito on the beach. It was better than the ones I get here in Oxford and was not dissimilar to the real thing. We also went to look at UC Santa Cruz where Craig had spent some undergraduate time, a truly bizarre campus. Craig is responsible for setting up the Australasian Association for Digital Humanities see http://www.craigbellamy.net/2011/05/31/australasian-association-for-digital-humanities-aadh/ and http://aa-dh.org/ which is seeking to join ADHO (Alliance of Digital Humanities Organizations) alongside ACH, ALLC, and SDH-SEMI. Much of our conversation related to this topic and the AHDO Steering Committee meeting the next day. (Boy, don’t we know how to spoil a beach!) We returned to Stanford and met up with various other DH conference goers for ‘food’ and ‘drink’ in the local student’s union.

Sunday 19 June 2011

I intended to go swimming this day, but the lane swimming wasn’t open until the afternoon, so instead I rented a bicycle. I purchased a variety of items to put in the huge fridge that was part of the full-sized kitchen (with stove, sink, dishwasher, microwave, etc.) that was in my room. Sadly the kitchen didn’t come with anything useful to, you know, cook or eat with. It didn’t come with anything at all. Since Sebastian also had a bicycle we cycled to the Stanford Shopping Centre, where we looked around at things we could possibly buy, had lunch, and eventually cycled back to the residences. The conference’s opening plenary was by David Rumsey http://www.davidrumsey.com/ talking about “Reading Historical Maps Digitally: How Spatial Technologies Can Enable Close, Distant and Dynamic Interpretations” but partly seemed to be demonstrating the proprietary Luna Browser (http://www.davidrumsey.com/view/luna)(java servlet based image viewer) which I didn’t like at all. At the reception afterwards there was much pleasant conversation.

Monday 20 June 2011

I attended a morning session consisting of the following papers:

  • Maciej Eder & Jan Rybicki “Do Birds of a Feather Really Flock Together, or How to Choose Test Samples for Authorship Attribution “
  • Jan Rybicki “Alma Cardell Curtin and Jeremiah Curtin: The Translator’s Wife’s Stylistic Fingerprint.”
  • David L. Hoover “The Tutor’s Story: A Case Study of Mixed Authorship”

And then one with:

  • Yves Marcoux, Michael Sperberg-McQueen, & Claus Huitfeldt “Expressive power of markup languages and graph structures “
  • Gary F. Simons, Steven Bird, Christopher Hirt, Joshua Hou, & Sven Pedersen “Mining language resources from institutional repositories”
  • Thomas Eckart, David Pansch, & Marco Büchler “Integration of Distributed Text Resources by Using Schema Matching Techniques”

Of these the one by Yves Marcoux on OO-TexMECS was the most interesting (though Eckart’s showed some promise). However, I fundamentally disagreed that breaking XML is necessary for recording the majority of the graph data-structures he was presenting. TEI-style basic fragmentation, or even basic stand-off linking seems to do the trick in 99% of cases. It is an interesting discussion for markup geeks interested in the theory behind markup languages, but solving a problem that I feel isn’t really a problem for the majority of work we do here.

After lunch I went to a bit of:

  • Reinhild Barkey, Erhard Hinrichs, Christina Hoppermann, Thorsten Trippel, & Claus Zinn “Trailblazing through Forests of Resources in Linguistics “
  • Michele Pasin ” Browsing highly interconnected humanities databases through multi-result faceted browsers “
  • Alan Galey “Approaching the Coasts of Utopia: Visualization Strategies for Mapping Early Modern Paratexts”

before nipping off to the location where the posters were to be displayed and put up my Wandering Jew’s Chronicle poster as well as Sebastian’s Claros poster both right in front of the doors where you walk in, ensuring maximum throughput of people to look at them. The poster session was quite busy, shortly before I took photos of all the posters, however, this is on the camera which later went missing. There was a reception that followed this, but I was so busy talking to people about the poster that I seemed to miss it. Luckily someone brought me a drink (and we arranged a tour of SLAC for the next day).

Tuesday 21 June 2011

Sebastian woke up extra early to go on a punishing ‘fun run’ up huge mountains, whereas I slept in. From 08:30 we interviewed a
potential ePub and/or OpenData intern via skype). Since we’d missed the beginning of the sessions (and from the abstracts of them I didn’t feel cheated), while Sebastian went off to catch the end of the sessions, I cycled to the nearby B. Gerald Cantor’s Rodin Sculpture Park and looked at a bronze cast of Rodin’s “The Gates of Hell” see http://museum.stanford.edu/view/rodin__1985_86.html

Afterwards I caught one of the next sessions, specifically the one of a panel discussing “The Interface of the Collection”
consisting of: Geoffrey Rockwell, Stan Ruecker, Mihaela Ilovan, Daniel Sondheim, Milena Radzikowska, Peter Organisciak, & Susan Brown.

Over lunch, instead of nattering away to people about visualization Mike Toth had arranged a visit to the Stanford Linear Accelerator Complexhttp://yfrog.com/ke3m7tmj now ‘SSRL’. He had done work here in xray fluorescence to uncover the archimedes palimpsest and they wrote up a glowing press article about our visit. https://news.slac.stanford.edu/features/digital-humanities-experts-learn-how-ssrl-can-shed-light-past

We can,indeed, use real science tools to help digital humanities.

After this I ate some lunch in the back of the following session:

  • David Beavan “ComPair: Compare and Visualise the Usage of Language “
  • Trevor Muñoz, Virgil Varvel, Allen Renear, Kevin Trainor, & Molly Dolan “Tasks vs. Roles: A Center Perspective on Data Curation Needs in the Humanities “
  • Deborah Anderson “Handling Glyph Variants: Issues and Developments “
  • Scott Weingart & Jeana Jorgensen “Computational Analysis of Gender and the Body in European Fairy Tales “
  • Hiroyuki Akama, Maki Miyake, & Jaeyoung Jung “Automatic Extraction of Hidden Keywords by Producing “Homophily” within Semantic Networks”

Later we went to the Zampolli Prize Lecture in the Dinkelspiel Auditorium and listened to the winner, Chad Gaffield tell us
about “Re-Imagining Scholarship in the Digital Age”. This was a very motivational session by the president of the SSHRC funding body. I wouldn’t have been surprised if he had got everyone up and singing praises, but the auditorium was far too hot for that kind of thing.

Wednesday 22 June 2011

This morning I went to the panel on “Integrating Digital Papyrology” featuring Gabriel Bodard, Hugh Cayless, Ryan
Baumann, Joshua Sosin, & Raffaele Viglianti.

After a break I attended “The “#alt-ac” Track: Digital Humanists off the Straight and Narrow Path to Tenure” featuring Bethany Nowviskie, Julia Flanders, Tanya Clement, Doug Reside, Dot Porter, & Eric Rochester . Partly I attended because I have an article (as the last word) in the open access book they were launching http://mediacommons.futureofthebook.org/alt-ac/.

After lunch there was a panel on Funding Digital Humanities, with funders from the USA and Canada. There was not a UK, European, Australian, Japanese, or Mexican funder represented. Still, was good to hear what they said.

After this there was the closing plenary by JB Michel & Erez Lieberman-Aiden who had worked with Google to produce the Google ngram viewer. The long ‘s’ problem in OCR’ed data clearly visible by looking at ‘best,beft’ from 1700 to the modern day in http://ngrams.googlelabs.com/. (Something I tweeted about a couple days after its launch but using presumption vs prefumption.) Unlike Chad, who seemed to be celebrating what Digital Humanities had done, these two seemed intent on telling us quite obvious things that DH as a community should be doing… most of which I’m pretty sure we already are doing or striving to do. Because it was so hot during Chad’s talk on the way there I stopped to get a mango smoothie which made the talk more tolerable.

Following this there was a banquet at the Computer History Museum in Mountain View. The food and drink were so-so, the company was excellent, the museum was fairly usa-centric in its outlook.

Thursday 23 June 2011

While most people went on organised tours to Silicon Valley or the Sonoma Wine Country, instead Craig Bellamy (with his mustang) and Peter Organisciak and I drove up Highway 1 stopping off for delicious mexican food, beaches, and crossing the golden gate bridge. In S.F. we walked around fisherman’s wharf and some other places, before returning to Stanford. There was

simultaneously a meeting on the curation of digital humanities data which I followed via twitter.

Friday 24 June 2011

I was flying home in the evening, so accompanied by Raffaele Viglianti I went to S.F. on the train, where we met up with some
other people, wandered up and down the hills of china town, had some dim sum, and eventually I caught a shared van to SFO to
catch my flight. This time I got a seat in the much smaller ‘upper deck’ of the plane, but still didn’t capitalise on it and watched several more films. Arrived back Saturday midday horribly jetlagged.

Posted in Conference | 2 Comments

grouping by group-adjacent=”boolean(self::lb)”

A project I was doing some work for had some input that looked like:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader xmlns:xi="http://www.w3.org/2001/XInclude" type="text">
<fileDesc>
   <titleStmt>
      <title>A sample file</title>
   </titleStmt>
   <publicationStmt>
      <distributor>InfoDev</distributor>
   </publicationStmt>
   <sourceDesc>
      <p>VSARPJ project</p>
   </sourceDesc>
</fileDesc>
<profileDesc>
   <creation>
      <date/>
   </creation>
   <langUsage>
      <language ident="ojp">Old Japanese</language>
   </langUsage>
   <textClass>
      <catRef target="#bussoku"/>
   </textClass>
</profileDesc>
<encodingDesc>
   <samplingDecl>
      <p>This text was transcribed phonemically and edited to parallel the content from the
         corresponding item in the <title>Nihon koten bungaku taikei</title> version of the
            <title>Man'yôshû</title>, <ref>Man'yôshû I</ref>. </p>
   </samplingDecl>
</encodingDesc>
</teiHeader>
<text>
<body xml:id="BS.1">
   <div>
      <ab type="original" xml:lang="ojp"> 美阿止都久留 <lb xml:id="BS.1-orig_1"
            corresp="#BS.1-trans_1"/> 伊志乃比鼻伎波 <lb xml:id="BS.1-orig_2" corresp="#BS.1-trans_2"
         /> 阿米爾伊多利 <lb xml:id="BS.1-orig_3" corresp="#BS.1-trans_3"/> 都知佐閇由須礼 <lb
            xml:id="BS.1-orig_4" corresp="#BS.1-trans_4"/> 知知波波賀多米爾 <lb xml:id="BS.1-orig_5"
            corresp="#BS.1-trans_5"/> 毛呂比止乃多米爾 </ab>
      <ab type="transliteration" xml:lang="ojp-Latn">
         <s>
            <phr>
               <phr>
                  <cl>
                     <phr type="arg">
                        <w>
                           <m type="prefix">
                              <c type="phon">mi</c>
                           </m>
                           <w>
                              <c type="phon">ato</c>
                           </w>
                        </w>
                     </phr>
                     <w type="verb" function="adnconc" ana="#L031144">
                        <c type="phon">tukuru</c>
                     </w>
                  </cl>
                  <w type="verb" function="adnconc" ana="#L031144">
                     <c type="phon">tukuru</c>
                  </w>
                  <lb xml:id="BS.1-trans_1" corresp="#BS.1-orig_1"/>
                  <w>
                     <c type="phon">isi</c>
                  </w>
                  <w type="particle" subtype="case" function="gen" ana="#L000520">
                     <c type="phon">no</c>
                  </w>
               </phr>
               <w>
                  <c type="phon">pibiki</c>
               </w>
               <w type="particle" subtype="top" ana="#L000522">
                  <c type="phon">pa</c>
               </w>
            </phr>
            <lb xml:id="BS.1-trans_2" corresp="#BS.1-orig_2"/>
            <cl>
               <phr>
                  <w>
                     <c type="phon">ame</c>
                  </w>
                  <w type="particle" subtype="case" function="dat" ana="#L000519">
                     <c type="phon">ni</c>
                  </w>
               </phr>
               <w type="verb" function="infinitive" ana="#L030170">
                  <c type="phon">itari</c>
               </w>
            </cl>
            <lb xml:id="BS.1-trans_3" corresp="#BS.1-orig_3"/>
<!-- etc -->
         </s>
      </ab>
   </div>
</body>
</text>
</TEI>

What they wanted as output was a table-layout (icky) that aligned two nested tables of the original and the transliteration like:

<table>
   <tr>
      <td>
         <table>
            <tr>
               <td><span class="origLine">美阿止都久留</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">伊志乃比鼻伎波</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">阿米爾伊多利</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">都知佐閇由須礼</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">知知波波賀多米爾</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">毛呂比止乃多米爾</span></td>
            </tr>
         </table>
      </td>
      <td>
         <table>
            <tr>
               <td><span class="w">miato</span>
                  <span class="w">tukuru</span>
                  <span class="w">tukuru</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">isi</span>
                  <span class="w">no</span>
                  <span class="w">pibiki</span>
                  <span class="w">pa</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">ame</span>
                  <span class="w">ni</span>
                  <span class="w">itari</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">tuti</span>
                  <span class="w">sape</span>
                  <span class="w">yusure</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">titipapa</span>
                  <span class="w">ga</span>
                  <span class="w">tame</span>
                  <span class="w">ni</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">moropito</span>
                  <span class="w">no</span>
                  <span class="w">tame</span>
                  <span class="w">ni</span>
               </td>
            </tr>
         </table>
      </td>
      <td>BS.1</td>
   </tr>
</table>

If we ignore the icky aspect of using tables for layout and alignment purposes, then the solution has something interesting to learn from. This is, at heart, a grouping problem. The solution I came up with was:

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xpath-default-namespace="http://www.tei-c.org/ns/1.0" version="2.0">
    <xsl:template match="TEI">
        <html>
            <head>
                <title>test corpus</title>
            </head>
            <body>
                <xsl:apply-templates/>
            </body>
        </html>
    </xsl:template>
    
    <!-- You can put things you want to do nothing to all in one template -->
    <xsl:template match="teiHeader | note | entry | list"/>
    
    <!-- Or similarly things you want to just have the tags vanish from.  w is here and elsewhere, hence priority. -->
    <xsl:template match=" choice | m |w | s |phr|cl " priority="-1"><xsl:apply-templates/></xsl:template>
    
    <!-- If you are using tables for layout purposes (icky) then you don't need to change lb's to BRs. -->
    <!--
    <xsl:template match="lb">
        <br/>
     </xsl:template>-->
    
    <xsl:template match="body">
        <table>
            <tr>
                <td>
                    <!-- Nesting an icky table for each cell, so you can get one tr per line. -->
                   <table> 
                       <xsl:apply-templates select="descendant::ab[@type='original']"/>
                   </table> </td>
                <td>
                    <!-- Nesting an icky table for each cell, so you can get one tr per line. -->
                    <table><xsl:apply-templates select="descendant::ab[@type='transliteration']"/></table>
                 </td>
                <td>
                    <xsl:value-of select="@xml:id"/>
                </td>
            </tr>
        </table>
    </xsl:template>
    
    <!-- Not really necessary but in case you wanted to be able to do something with the original lines, wrap an element around them. -->
    <xsl:template match="ab[@type='original']//text()"><span class="origLine"><xsl:value-of select="normalize-space(.)"/></span></xsl:template>
    
    <!-- For original things group by any child nodes or text, and create the groups adjacent to whether there is a linebreak or not. -->
    <xsl:template match="ab[@type='original']">
        <xsl:for-each-group select="child::node()| child::text()"  group-adjacent="boolean(self::lb)">
        <tr><td><xsl:apply-templates select="current-group()"/></td></tr>
        </xsl:for-each-group>
        </xsl:template>
    
    <!-- For transliterations first flatten hierarchy (you could do this a variety of ways), by copying just the top w elements and linebreaks, and for each of these group adjacent to the line breaks. -->
    <xsl:template match="ab[@type='transliteration']">
        <xsl:variable name="test"><xsl:copy-of select=".//w[not(ancestor::w)] | .//lb"/></xsl:variable>
        <xsl:for-each-group select="$test/*" group-adjacent="boolean(self::lb)">
                    <tr>
                        <td><xsl:apply-templates select="current-group()"/></td>
                    </tr>
            
        </xsl:for-each-group>
    </xsl:template>
    
    <!-- Since we have w's nested inside w's when we have one of the top ones wrap and element around it, and then take the value stripping out any spaces. (other ways to do this as well). -->
    <xsl:template match="w[not(ancestor::w)]"><span class="w"><xsl:value-of select="translate(normalize-space(.), ' ', '')"/></span><xsl:text> </xsl:text></xsl:template>
</xsl:stylesheet>

Most of this is pretty straightforward, and I’ve included comments in the XSLT to help anyone wondering why I’m doing something. But if we look at just one bit of it:

  <!-- For original things group by any child nodes or text, and create the groups adjacent to whether there is a linebreak or not. -->
    <xsl:template match="ab[@type='original']">
        <xsl:for-each-group select="child::node()| child::text()"  group-adjacent="boolean(self::lb)">
        <tr><td><xsl:apply-templates select="current-group()"/></td></tr>
        </xsl:for-each-group>
     </xsl:template>
 

The reason this is interesting is using @group-adjacent=”boolean(self::lb)”. I’m using the truth or falseness of whether the current node is a line-break element as a test to group the adjacent nodes. In XSLT2 there are basically two types of grouping conditions, patterns and expressions. @group-starting-with and @group-ending-with require their values to be a pattern, but @group-by and @group-adjacent accept any XPath expression. This means with those two you can have a bit more fun! In these the condition is being applied to each item in the population you are grouping in order to calculate grouping keys. In those accepting patterns, the condition must match specific nodes in this population that will either lead or terminate a newly-created group. This is an important distinction to keep in mind and means that with group-adjacent you can use things that calculate the key to be matched rather than being that key. So in this case we use boolean(self::lb) to test whether the current node being matched is a or not. If it is, then the grouping condition is true so it creates the group based on its siblings.

Posted in TEI, XML, XSLT | Leave a comment

Ubuntu Twinview Maximizing Windows problem

This is more of a note-to-self. I had a problem in my recent upgrade to the latest Ubuntu in that my two monitors, when set to ‘twinview’ meant that the panels and task bars, and maximized windows spanned both monitors. What you really want is for these to be able to be moved from one monitor to the other, but when you maximize them they stay maximized in only one monitor.

The solution that I guessed might work, and it turned out did, was to comment out the ‘metamodes’ option in the Screen section of my xorg.conf. I.e.:


Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
Option "TwinView" "1"
Option "TwinViewXineramaInfoOrder" "CRT-0"
#Option "metamodes" "CRT-0: 1280x1024 +0+0, CRT-1: 1280x1024 +1280+0"
SubSection "Display"
Depth 24
EndSubSection
EndSection

That sorted out the problem as soon as I logged back in again.

Posted in Ubuntu | Leave a comment

Thunderbird Calendar Automatic Export

Previously I wrote about thunderbird, davmail, exchange and exporting to google calendar and my system was setup and working fine. Then I upgraded (full-wipe and install) to the latest Ubuntu operating system and I had to set things up again. Part of the problem was that the thunderbird Automatic Export add-on wouldn’t work with the new version of thunderbird. While I know sometimes changes of software mean that the plugin will no longer function, I didn’t think this might be a problem with Automatic Export… I mean all it does is take the calendar you’ve set and export it which hopefully isn’t too reliant on the way the program itself works. Hopefully.

It turned out that if I unzipped the thunderbird plugin package available from Automatic Export on Mozilla add-ons site then I was able to edit the install.rdf file which tells thunderbird about the package. When I did I found that it had a em:maxVersion attribute and all I did was change that to be far past the current version. (Note: there were two of these, I changed both since I wasn’t sure which applied to what.) Zipping the file back up again and renaming to .xpi was all that was needed for a successful install.

Everything now working again perfectly.

Posted in Ubuntu | Leave a comment

Teaching in Helsinki

I was recently invited to Helsinki by Varieng to teach a workshop on TEI XML, and specifically on TEI XML concentrating on transcription. The workshop slides and materials are at http://tei.oucs.ox.ac.uk/Oxford/2010-10-helsinki/. Though these were largely based on the TEI Summer School 2010 that we taught earlier in the year. We may hopefully be partnering with Varieng to convert the Helsinki Corpus to TEI P5 XML.

Posted in TEI | Leave a comment

simple dynamic transformation of xml with htaccess, php, and xslt

I often transform from TEI XML to XHTML as part of projects, but in some instances it is more difficult to manage using things like the eXist XML Database or Apache Cocoon, or even AxKit. This is because the hosting arrangement means that only a limited number of technologies are available.

In most cases these days a linux-based server will have Apache’s http server installed, and hopefully the Apache ReWrite module installed. In addition most hosting, even shared hosting, has PHP installed with libxml for XSL processing. Sadly, this only copes with XSLT1 not XSLT2.

However, one way to use this is to have one’s .htaccess file rewrite incoming URLs to run an xml2html.php conversion.

Basic preceding stuff:

#Turn on Rewriting
RewriteEngine On
RewriteBase /
# Redirect any svn requests 
RewriteRule ^.svn/(.*)$ http://subversion.tigris.org [R]
# utf-8 please
AddDefaultCharset UTF-8
# change directory index to index.xml as default
DirectoryIndex index.xml index.php index.html index.shtml
#ErrorDocuments
ErrorDocument 404 /unavailable.html
ErrorDocument 403 /forbidden.html

Here we start by turning the RewriteEngine on and setting the RewriteBase to the root of the domain. I’ve also got a RewriteRule that takes any requests for stuff in subversion directories and redirects it to the subversion site instead. (Though actually I’m thinking of having that just 404 or 403 instead.) After that we set the default character set to UTF-8 and change the default directory index file names. and specify some error documents for 404s and 403s. (These are of course actually unavailable.xml and forbidden.xml, and are transformed by the rule further down.)

After this comes the bit where the rewriting of requests for HTML files get turned into parameters on a PHP script:

# If I ask for .xhtml then give me xml2html
RewriteRule ^(.*).xhtml$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]
# If I have asked for .html then if the .html file exists, then give it.
RewriteRule   ^(.*)\.html$              $1      [C,E=WasHTML:yes]
RewriteCond   %{REQUEST_FILENAME}.html -f
RewriteRule   ^(.*)$ $1.html [L]
# else provide XML dynamically with xml2html.php
RewriteCond   %{ENV:WasHTML}            ^yes$
RewriteCond   %{REQUEST_FILENAME}.xml -f
RewriteRule ^(.*)$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]

The first of these says that when I ask for any url on the site ended in .xhtml then take an XML file named the same thing and transform it using the xml2html.php script and the site.xsl stylesheet both in the /scripts directory. This is just for me, so that I can force it to run the transformation if a foo.xml and foo.html exist in the same directory.

After this the next RewriteRule matches anything on the site that is asked for that ends in .html and takes the first bit of this (the path and filename). Simultaneously it uses ‘C’ to chain this with the next rule and ‘E’ to set an environmental variable ‘WasHTML’ to be ‘yes’. Then there is a Rewrite Condition testing if this filename with a .html extension exists. If so, it rewrites this to be that filename.html and ends. If not, it tests whether the environmental variable WasHTML is set to yes (because remember we’ve taken off the extension), and whether the filename we’ve asked for ending in .xml exists. If so, then it runs the script giving the filename with .xml as the xml parameter and in this case site.xsl (in the same scripts directory) as the xsl.

That .htaccess file as a whole looks like:

#Turn on Rewriting
RewriteEngine On
RewriteBase /
# Redirect any svn requests 
RewriteRule ^.svn/(.*)$ http://subversion.tigris.org [R]
# utf-8 please
AddDefaultCharset UTF-8
# change directory index to index.xml as default
DirectoryIndex index.xml index.php index.html index.shtml
#ErrorDocuments
ErrorDocument 404 /unavailable.html
ErrorDocument 403 /forbidden.html
# If I ask for .xhtml then give me xml2html
RewriteRule ^(.*).xhtml$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]
# If I have asked for .html then if the .html file exists, then give it.
RewriteRule   ^(.*)\.html$              $1      [C,E=WasHTML:yes]
RewriteCond   %{REQUEST_FILENAME}.html -f
RewriteRule   ^(.*)$ $1.html [L]
# else provide XML dynamically with xml2html.php
RewriteCond   %{ENV:WasHTML}            ^yes$
RewriteCond   %{REQUEST_FILENAME}.xml -f
RewriteRule ^(.*)$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]

The PHP script this is using (which I borrowed from a colleague) uses the http://www.php.net/manual/en/book.xsl.php libxml based XSLT processing in PHP. It is fairly short and consists of:

<script language="php">
#Basic check for directory/site traversal 
if(preg_match('/\.\.\/\.\./',$_REQUEST['xml'])) { die("invalid input"); }
if(preg_match('/http/',$_REQUEST['xml'])) { die("invalid input"); }
if(preg_match('/http/',$_REQUEST['xsl'])) { die("invalid input"); }
if(preg_match('/\.\.\//',$_REQUEST['xsl'])) { die("invalid input"); }
#load xsl document into XsltProcessor
  $xp = new XsltProcessor();
  $xsl = new DomDocument;
  $xsl->load($_REQUEST['xsl']);
  $xp->importStylesheet($xsl);
#load xml document
  $xp->setParameter( null, 'xml', $_REQUEST['xml']);
  $xml_doc = new DomDocument;
  $xml_doc->load($_REQUEST['xml']);
#Process any xincludes
  $xml_doc->xinclude();
#Transform the XML with the XSL or put out error
  if ($html = $xp->transformToXML($xml_doc)) {
      echo $html;
  } else {
      trigger_error('XSL transformation failed.', E_USER_ERROR);
  }
</script>

The first bit of this is just a security precaution against directory (or site) traversal which rejects anything that has ‘../..’ in it or ‘http’. I’m sure there are a lot better ways to do this, but just checking the xml and xsl parameters seemed the easiest. I could have made a function and then passed it to each of them, or had the regex look for either of these two things, but I think it all works out the same and doesn’t seem to have much of a speed implication. Then we start a new XsltProcessor(), and a new xsl DomDocument, we load in the xsl file given in the xsl parameter, and also pass to this the parameter ‘xml’ so that we can use this in our XSLT if we want. Then we start a new xml_doc DomDocument and load in the requested XML file, and we do any XIncludes in that XML file. We then transform the XML doc to HTML with transformToXML otherwise trigger and error and put that out.

This is a fairly lightweight way to transform XML to HTML on the fly using the technologies (PHP and .htaccess) that most hosting solutions provide. I’m using something like this on one of my personal sites and it is in use in a slightly different form in a number of work sites.

Hope it is useful to someone.

Posted in other, TEI, XSLT | Leave a comment

For Loops in XSLT2

A colleague asked me the other day about the proper way to do for-loops in XSLT2 or more specifically in XPath2. He knows all about xsl:for-each and xsl:for-each-group iteration over things, and of course recursively calling a template while passing a variable to let you count how many times you’ve done it.

I’ve always found that kind of recursion annoying, and in XSLT2 if you just want to do something a number of times, then it is also unnecessary. XPath2 allows you to do XQuery-like for-loops as part of your path statement. Take this short and stupid XSLT2 stylesheet for example:

 
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0">
    <xsl:output indent="yes"/>
        <xsl:param name="start" select="1"/>
        <xsl:param name="end" select="10"/>
	<xsl:variable name="from" select="$start"/>
	<xsl:variable name="to" select="$end"/>
    
    <xsl:template match="/" name="main">
        <foo>
        <xsl:for-each select="
            for $i in $from to $to
            return $i
            ">
            <blort><xsl:value-of select="concat('value is: ', . )"/></blort>
        </xsl:for-each>
        </foo>
    </xsl:template>

</xsl:stylesheet>

Let’s break this simple example down a bit.

First we have some starting stuff:

 
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0">
    <xsl:output indent="yes"/>
        <xsl:param name="start" select="1"/>
        <xsl:param name="end" select="10"/>
	<xsl:variable name="from" select="$start"/>
	<xsl:variable name="to" select="$end"/>
<!-- ... -->
</xsl:stylesheet>

All this is doing is starting up the stylesheet, saying that we want the result indented, and saying that there are two parameters ‘start’ and ‘end’ which if they aren’t set should be ‘1’ and ’10’ respectively. I then copy these to global variables ‘from’ and ‘to’ just to make my life easier.

 
    <xsl:template match="/" name="main">
        <foo>
        <xsl:for-each select="
            for $i in $from to $to
            return $i
            ">
            <blort><xsl:value-of select="concat('value is: ', . )"/></blort>
        </xsl:for-each>
        </foo>
    </xsl:template>

The whole template here is fairly simple. It either matches the root node ‘/’ or if called by its name (i.e. with “saxon -it:main for-loops.xsl”). We then output a ‘foo’ root element of our output document. Then we have an xsl:for-each which isn’t really the for-loop itself but does something for each iteration of this loop. Each time we create a new number we put out a ‘blort’ element whose content says what the value is. But in order to create the series which the xsl:for-each is iterating over we have made our select statement be “for $i in $from to $to return $i”. This says for a new variable ‘i’ for each of the things in the range from the ‘from’ variable to the ‘to’ variable give use back the value of the ‘i’ variable. So in our case it will create a series from 1 to 10 for the xsl:for-each to operate on.

Hopefully that is the last time I hear that XSLT can’t do for-loops. I’ve put this here to remind me later when I’ve forgotten.

Posted in XSLT | Leave a comment

ENRICH

Until December 2009 I worked on the ENRICH project, and as it has now finished, I thought that I should reflect on some of what the project has done and the aspects we’ve been involved with here in Oxford. For the most part the project has been attempting to both aggregate manuscript descriptions into the manuscriptorium framework and standardise these manuscript descriptions to a single, common, agreed format. For the background to the ENRICH project, see the website, and especially this article on the ENRICH Project and TEI P5. A list of deliverables is also available.

Standardisation of Specification

The workpackage we were most involved with, partly because we were leading it, was workpackage 3 whose object was:

To ensure interoperability of the metadata used to describe all the shared resources by analysing the various standards used by different partners and ensuring their mapping to a single common format, which will be expressed in a way conformant with current standards.

As one might expect, in practice, this common format was a more tightly constrained subset of the TEI recommendations on Manuscript Description. The difficulty in any such endeavour is getting coherent agreement between a large number of representatives on a wide variety of customisations. As part of this process we undertook a comparison of MASTER, TEI P5, and Manuscriptorium formats. A number of revisions were made to the ENRICH schema through the course of the project. Deliverable D3.1 was a “Revised TEI-Conformant specification” available in a number of schema languages. The ENRICH Schema is publicly and freely available as as DTD, RELAX NG, and W3C Schema, but we recommend the RELAX NG format:

Documentation

The next deliverable, D3.2, was “Documentation and training materials for use with the ENRICH Specification”. Because the TEI ODD had been written with documentation in it, the same TEI ODD which generated the schemas above could also be used to generate project-specific documentation. This meant that in addition to the documentation written specifically for the ENRICH project, it had access to all the internationalised reference material available in the TEI Guidelines as a whole. This meant that we could produce versions of the documentation which while still primarily in English, contained glosses of the elements in another language. So for example:

<msIdentifier> (manuscript identifier) contains the information required to identify the manuscript being described.

in the English documentation for the ENRICH Specification became, in the French:

<msIdentifier> (identifiant du manuscrit) Contient les informations requises pour identifier le manuscrit en cours de description.

While this is admittedly of limited benefit, since the bulk of the documentation remains in English, it can aid comprehension to those reading in a foreign language to have the element descriptions in their own language. The ENRICH Specification documentation is available in the following languages and formats:

(HTML needs odd.css and tei.css)

Training Materials

Training materials were also created as part of D3.2 and took the form of slide sets as PDF, HTML, and TEI XML that project partners were free to take, modify, and use in teaching the ENRICH schema:

Migration Tools

While the primary migration tools from other formats to the ENRICH Specification were undertaken by the lead technical partner, we were tasked with undertaking a case study based analysis of the construction of migration tools and the make recommendations to the project based on these. The Migration case studies focussed on MASTER records that we had accumulated as a testbed and EAD records given to us by the Bodleian Library. The Case Studies on Migration to the ENRICH Specification and all their materials are freely available online. The case studies examined methods for transformation of MASTER and EAD records to TEI P5, mainly using XSLT-based conversions. The report on the Development and Validation of Migration Tools is available online.

ENRICH Garage Engine

Originally D3.4 of the ENRICH Project was a “Report on METS/TEI interoperability, best practice with respect to handling of Unicode and non-Unicode data in Manuscriptorium and P5 conversion techniques”. However, after much investigation it was determined that the use of METS was unnecessary for our extension to the Manuscriptorium platform. (This is not to say that it would not have been suitable for this or other uses.)

Part 1 of D3.4 and some of the work on it was replaced through the development of the ENRICH Garage Engine (EGE) and a report on the Documentation and Use of the ENRICH Garage Engine. This is a primarily web-service based format conversion engine developed by PSNC which enables document conversion through a number of formats. The engine itself consists of a web service and website frontend and underneath consists of a recognizer, a validator, and a converter. As the EGE website explains:

  • Recognizer – this plug-in is responsible for the recognition of the Internet Media Type (MIME type) of the given input data. For example, it will receive the input data and state that the input data has text/xml MIME type. The recognized data may then be further validated to check the format of the data.
  • Validator – this plug-in is responsible for validation of the input data. For example it may be used to validate the ENRICH TEI P5 data stored in a MIME type (e.g. text/xml) either received from end user or created by one of the converters. The following notation is assumed: ENRICH TEI P5 (text/xml) – it means that validator is able to validate ENRICH TEI P5 format encoded in text/xml.
  • Converter – this plug-in is responsible for converting the input data. It may be, for example, conversion from XML to Word, conversion from Word to PDF, conversion of the XML from one form to another (e.g. MASTER -> ENRICH TEI P5) or even cleaning the input data (e.g. removing redundant information).

You can try the EGE at its website:

ENRICH gBank and Non-Unicode Characters

One problem encountered in the migration of legacy documents to the ENRICH Specification might be that these records use characters which are not currently present in Unicode. The Medieval Unicode Font Initiative (MUFI) campaigns for inclusion of some of these specialized characters into the Unicode Specification. The second half of the D3.4 deliverable we produced was a report on Best practice in handling non-unicode characters. This included the description of a software tool, the ENRICH gBank produced to assist in normalization and documentation of non-Unicode characters. This contains a list of all of MUFI non-Unicode characters in the Private Use Area (PUA), images of them, and a representation of them using a TEI <char> element. For the most part these were automatically generated from the MUFI Spec. Conversion of this involved exporting the Adobe InDesign file as RTF, converting this to a basic presentation TEI XML, running a transformation script on this to extract just the data we needed for our own tables. In addition, the PUA references were used, in conjunction with the Andron Scriptor Web font, to produce first SVG files (using Apache Batik) and then specific-sized PNG files from this. This allowed us to have character images for each of the characters in the PUA.

You can see the ENRICH gBank on the ENRICH beta website at:

ENRICH Templates

As part of the ENRICH teaching materials we also created some ENRICH templates, to assist those who wanted a guide as to the kind of material that should be present in an ENRICH manuscript description.

A number of projects have taken these templates as starting points to further customise in their own use of the the ENRICH Specification or TEI P5 msDesc.

Conclusions

Working for any large and dispersed EU project always has its benefits and drawbacks. In the case of ENRICH we were able to draw on a wide range of experience, technologies and data because of the diverse nature of the project. One of the major drawbacks stems from being partnered with commercial organisations. While all the work they did in their development and support of the Manuscriptorium platform was top notch, they naturally have commercial interests of their business model at the forefront of their activities. This meant, for example, that while the ENRICH Specification and all the software, documentation, training materials and tools that we (OUCS) produced were licensed under an open licence, the same was not true of the main commercial company behind Manuscriptorium. The platform itself is not open source, at no point were we able to see the workings of the platform, nor contribute patches or bug fixes to it. This meant any of our development took place in an isolated manner and at arm’s reach.

Fair enough, the EU (via its eContent+ programme) funded this project with the understanding, presumably, that this would be the case. However, I feel that it is wrong for the EU to fund projects with commercial partners where those partners are not required to release the products of the funded work under an open licence of some sort. I’m not in any way against these commercial companies, but there are plenty of workable business models which enable them still to profit from materials they have developed and released under an open licence.

The ENRICH project has produced a lot that is good and interesting, and one of its major achievements is the network of individuals, projects, and institutions which are all approaching medieval manuscript description in the same manner. Although ENRICH (as a schema or project) is certainly not the last word in large-scale projects for the aggregation and standardization of medieval manuscript descriptions, it is a good development and milestone along that road.

List of Deliverable Reports

Posted in TEI, XML | 4 Comments

Thunderbird + Lightning Nexus Calendar Export to Google Calendar

There are plenty of ways to sync one’s work (nexus, Oxford’s version of Exchange) calendar with google if you are using Windows and Outlook. However, I’m using Ubuntu Linux. The solution I’ve chosen for getting mail and shared calendaring is Thunderbird + Lightning + Davmail. This works, but had idiosyncrises such as not allowing you to share calendars (but use calendars you have shared through another method such as Outlook2007 or OWA-Messageware).

Let’s be clear here, I do not need full synchronisation. What I want to do is:

  • when looking at my google apps calendars (which I intentionally separate from my work ones) I want to be able to have at least read-only view of my work calendars. Basically I want to just see them so I know that work activities are not overlapping with personal ones.
  • make my calendars available read-only to specific other people who either are not inside *.ox.ac.uk or whose departments do not use calendaring aspects of nexus

The solution I’ve come up with is an ad-hoc one involving a mozilla thunderbird extension called automatic export. Once installed and the icon is added to toolbar you can select from a dropdown menu on this icon a cyclical export. I have this set to export my calendar every 10 minutes. As long as you export this to a web accessible location then google calendar can subscribe to this. In addition, I store mine on a remote server, so have a shell script that scp’s it to the correct location every 10 minutes…so at very worst it is 20 minutes out of date. On google you just subscribe to the remote .ics file… though it sometimes takes awhile for google to finally realise it is there.

Drawbacks

  • The export only works when you have a copy of thunderbird that is set to do this is currently running. So, for example, TB on my laptop is not set to do this, or if I add an appointment with OWA-lite it doesn’t end up in my google calendar until I load up TB at work on Monday.
  • It is fairly insecure. The entire calendar is exported as an .ics file that is world readable. While it is in a place that is fairly obscure, security by obscurity isn’t really security.
  • I tried having it on a passworded WebDAV storage, but even giving google the username/password in the url, it had problems finding it.
  • Private events are shared with those with whom you share the calendar… so they basically see anything you see.
  • You need to have a constantly web-accessible location in which to put the calendar, exporting it to your desktop machine isn’t sufficient since google will think it has disappeared when the machine is off. (And we all hibernate our desktops and use OUCS’s Wake-On-Lan service to wake them up when needed… don’t we?)

I don’t know if this will be useful to anyone else… but that is how I export my Thunderbird+lightning+davmail Nexus Calendar to my Google Apps Calendar.

-James

Posted in Uncategorized | 1 Comment

TEI-Comparator

I have just finished my poster for DRHA 2009 which is about the TEI-Comparator that RTS worked on for the Holinshed Project. My poster is available online in PDF and PNG formats. (Though for the record it was created in Inkscape as an SVG file).

The poster discusses the creation of the tool for the Holinshed Project at the University of Oxford. Holinshed’s Chronicles of England, Scotland, and Ireland was the crowning achievement of Tudor historiography and an important historical source for contemporary playwrights and poets. Holinshed’s Chronicles was first printed in 1577 and a second revised and expanded edition followed in 1587. EEBO-TCP had already encoded a version of the 1587 edition, and the Holinshed Project specially commissioned them to create a 1577 edition using the same methodology. The resulting texts were converted to valid TEI P5 XML and used as a base to construct a comparison engine, known as the TEI-Comparator, to assist the editors in understanding the textual differences between the two editions.

Using the TEI-Comparator has several stages. The first was to decide what elements in the two TEI XML files should be compared. In this case the appropriate granularity was at the paragraph (and paragraph-like) level. The project was primarily interested in how portions of text were re-used, replaced, expanded, deleted, and modified from one edition to another. This first stage ran a short preparatory script which added unique namespaced IDs to each relevant element in both the TEI files. It is the proper linking of these two IDs which the TEI-Comparator hoped to facilitate.

The second stage was to prepare a database of initial comparisons between the two texts using a bespoke fuzzy text-comparison n-gram algorithm designed by Arno Mittelbach (the technical lead for the TEI-Comparator). This algorithm, called Shingle Cloud, transforms both input texts (needle and haystack) into sets of n-grams. It matches the haystack’s n-grams against the needle’s and constructs a huge binary string where they match. This binary string is then interpreted by the algorithm to determine whether the needle can be found in the haystack and if so where. The algorithm runs in linear time and, given the language of the originals, was found to work better if the strings of text were regularized (including removal of vowels).
The third stage in using the comparator was for the research assistant on the project to confirm, remove, annotate, or create new links between one edition and the other using a custom interface to the TEI-Comparator constructed in Java using the Google Web Toolkit API. The final stage was to produce output from the work put in by the RA through generating two standalone HTML versions of the texts which were linked together based on the now-confirmed IDs.

Shortly the TEI-Comparator will be publicly available on Sourceforge with documentation and examples to make it easy for others to re-purpose this software for other similar uses, and submit bugs and requests for future development.

Although known as the ‘TEI-Comparator’, the program does not require TEI input, it works with XML files of any vocabulary as long as the elements being compared have sufficient unique text in them.

For more information about the TEI-Comparator e-mail: tei@oucs.ox.ac.uk

Posted in TEI | 6 Comments

addingIDs

Rehdon asked me about giving @xml:id attributes to things, so I whipped up this quick XSLT stylesheet. Some people prefer to use generate-id() to get a truly random and unique ID without semantic baggage. In many cases, where IDs are exposed to the public, I prefer to use some which make sense and are human readable.

Warning: there is a distinct flaw in the lack of testing I’ve done before applying the @xml:id. If something other than a <p> element already has xml:id=”p5″ then it will still add ‘p5’ as an @xml:id to the fifth paragraph. This means that it will produce an xml document that is not well-formed since one of the requirements of @xml:id is that it is unique in the document. Also it would number paragraphs in other namespaces as well. (This may be a bug or a feature depending on your outlook.) It numbers from tei:text so if you don’t have that in your document you should change that variable.

The XSLT stylesheet takes a parameter ‘e’ which you can pass the local-name of the element in question. It assumes ‘p’ otherwise, but you could use it number div, head, w, or really any element just by passing it e=w (or whatever).

Update: Rehdon asked about a configurable optional prefix to the ID and a 4-digit zero-padded number for it. So I changed the script to do that.

 
   <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:tei="http://www.tei-c.org/ns/1.0"
    xmlns="http://www.tei-c.org/ns/1.0"
    exclude-result-prefixes="tei"
    version="1.0">
    <!-- Parameter to pass to the stylesheet, assumes 'p' if nothing given -->
    <xsl:param name="e" select="'p'"/>
    <!-- If it exists, a prefix string: include a separator, like 'text1_' to get 'text1_p0005' -->
    <xsl:param name="pre"/>
    
    <!-- typical copy-all template -->
    <xsl:template match="@*|node()|comment()|processing-instruction()" priority="-1">
    <xsl:copy><xsl:apply-templates select="@*|node()|comment()|processing-instruction()"/></xsl:copy>
    </xsl:template>
    
    <!-- higher priority one to match elements -->
    <xsl:template match="*" >
    <xsl:copy>
    <!-- If the local-name is the element we've passed it, and there is not an @xml:id attribute  -->
    <xsl:if test="local-name() = $e and not(@xml:id)">
    <!-- make a variable numbering current nodes at any level from tei:text -->
    <xsl:variable name="num"><xsl:number level="any" from="tei:text" format="1111"/></xsl:variable>
    <!-- Then create an @xml:id attribute with the name and the number concatenated -->
    <xsl:attribute name="xml:id"><xsl:value-of select="concat($pre, local-name(), $num)"/></xsl:attribute>
    </xsl:if>
    <!-- apply any other templates (i.e. copy other stuff) -->
    <xsl:apply-templates select="@*|node()|comment()|processing-instruction()"/></xsl:copy>
    </xsl:template>
    </xsl:stylesheet>
    

Hope that is useful. I’ll try to remember to add it to the TEI wiki as well.

Posted in TEI, XSLT | 2 Comments