adding word-level markup

Rehdon and snail and others occasionally have asked me recently about marking up words inside another element where there may be markup (sometimes containing more than one word) inside this so I thought I’d write it up.

So for example we might have an XML file that looked like:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<line>This is a test</line>
<line>Only a <seg type="foo">test</seg> ok?</line>
<line>And <seg>so; is</seg> this as well.</line>
</root>

Let’s say we want to mark up each of the whitespace-separated words, and for some reason the randomly added semi-colons, as words with a element. What we can use is and a regex. For example:

<xsl:template match="line//text()">
 <xsl:analyze-string regex="(\w+|;+)" select=".">
 <xsl:matching-substring><w><xsl:value-of select="."/></w></xsl:matching-substring>
 <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
 </xsl:analyze-string>
</xsl:template>

In this example we’re matching any text() inside an element anywhere and if it matches the \w regex (or is a semicolon) it will get wrapped in a element. If it doesn’t match, then the text that was there gets output. Because this is l//text() (as opposed to l/text()) it will recurse down into grandchildren elements and further.

So assuming we have a copy-all template something like:

<xsl:template match="@*|node()" priority="-1">
  <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
</xsl:template>

(where we basically copy any nodes and attributes unless something else matches them) then we should get the result:

<root>
  <line><w>This</w> <w>is</w> <w>a</w> <w>test</w><w>;</w></line>
  <line><w>Only</w> <w>a</w> <seg type="foo"><w>test</w></seg> <w>ok</w>?</line>
  <line><w>And</w> <seg><w>so</w><w>;</w> <w>is</w></seg> <w>this</w> <w>as</w> <w>well</w>.</line>
</root>

Of course that is only the beginning, as your documents will probably have weird special cases and punctuation that you want to handle differently. And also it would, of course, be useful to create an @xml:id attribute for each word element.

-James

Posted in XML, XSLT | 4 Comments

Evaluate a string as an XPath

Looking at ways to process a suggested change in TEI P5, I wanted to test that there is a straightforward way to evaluate a string that exists in a document as if it was an XPath you had included in your document.

So say I have a made-up document where I store some xpaths relating to that very document in the document itself as bits of text.

Input

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <paths>
        <path>/foo/blort/wibble[1]</path>
        <path>/foo/blort/wibble[2]</path>
        <path>//*[@xml:id='wibNum2']/splat/@att</path>
    </paths>
    <blort>
        <wibble>test text 1</wibble>
        <wibble>Another wibble </wibble>
        <wibble xml:id="wibNum2">This is <splat att="value1">a
            test</splat></wibble>
    </blort>
</foo>

To grab these and evaluate them as XPaths, you need to use an extension in saxon, unfortunately, saxon:evaluate(). For example in this stylesheet:

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0" xmlns:saxon="http://saxon.sf.net/"
    exclude-result-prefixes="#all">
    <xsl:output indent="yes"/>

    </xsl><xsl:template match="/foo">
        <foo>
            <xsl:for-each select="paths/path">
                <out>
                    <xsl:value-of select="saxon:evaluate(.)"/>
                </out>
            </xsl>
        </foo>
    </xsl>



This should produce the output:

Output

< ?xml version="1.0" encoding="UTF-8"?>
<foo>
  <out>test text 1</out>
  <out>Another wibble </out>
  <out>value1</out>
</foo>

This does use the saxon:evaluate(.) extension. There are similar extensions in a variety of other implementations for XSLT1 as well.

-James

Posted in TEI | 6 Comments

XSLT2 collection() with dynamic collections from directory listings

Something I didn’t know about XSLT2’s collection() function. I had previously used it in the form:

<xsl:variable name="files" select="collection(docs.xml)"/>

where docs.xml has a structure of:

<?xml version="1.0"?>
<collection>
    <doc href="blort1.xml"/>
    <doc href="blort2.xml"/>
</collection>

You can then address, via the variable, the structure of those files blort1 and blort2 and iterate over them etc. e.g. you can do something like:

<xsl:for-each select="$files/tei:TEI/tei:text/tei:div">
  <xsl:apply-templates mode="TOC" select="tei:head"/>
</xsl:for-each>

Ok… I already knew how to do that and have used it to run XSLT on a whole raft of files. To get the docs.xml file I used to run “xmlstarlet ls” and then I have a dir2collection.xsl that transforms its output to the correct format.

However, what I didn’t know is that I didn’t need to bother creating the collection file at all. Saxon can generate the collection file from a parameter on the URI that you hand collection(). That is you can do something like:

<xsl:variable name="files" select="collection('../foo/?select=blor*.xml')"/>

And $files is then addressable in the same way as if you had made a collection document of all the files matching blor*.xml in the directory ../foo/ (and of course you can just do *.xml)

But wait, that’s not all. You can get a bit more complicated about it, pass the path as a parameter, and supply the collection() function extra parameters. So something like:

    <xsl:param name="path2collection">../foo/</xsl:param>
    <xsl:variable name="path">
        <xsl:value-of
            select="concat('../',$path2collection,'?select=*.xml;recurse=yes;on-error=warning')"
        />
    </xsl:variable>
    <xsl:variable name="docs" select="collection($path)"/>

And thus forth $docs contains a recursive collection of anything in the path2collection parameter you give it.

Isn’t that fun? Ok, maybe only me.

Posted in XML, XSLT | 1 Comment

XIncluding portions of TEI Documents

‘Leoba’ another time asked me what to do when multiple files want to refer to the same textDesc, msDesc, listPerson or similar elements in their teiHeader.

To me, this is the canonical example use-case for W3C XInclude. You can store the individual bits anywhere you want on the web, and point (for example) into an element with a @xml:id element on it. There are ways to do more complicated xpointer fragment identifiers, but these aren’t processed automatically in oXygen, my preferred XML editor. oXygen, by default processes XIncludes in this format and so virtually includes the referenced element before validating the file.

So, in your file1.xml where you are encoding an electronic text, you might replace a listPerson element with the following:

<xi:include href="people.xml" xpointer="listPerson1" parse="xml">
    <xi:fallback>
        <listperson>
            <head>People not available</head>
            <person/>
        </listperson>
    </xi:fallback>
</xi:include>

This will include the element which has an @xml:id attribute on it, one assumes a listPerson, stored in the (full TEI file) ‘people.xml’ at that point in file1.xml. Here an optional fallback is provided to provide an empty listPerson with a message inside a head element. One of the benefits of this is that many texts can refer to the same listPerson, listPlace, textDesc, msDesc, or what have you, so you share resources across multiple documents, projects, and hopefully institutions. When projects use such a system, in addition to their editions, their standalone listPerson, listPlace, etc. files should also be made transparently available so that other people can point to the same people, places, etc.

Posted in TEI, XML | Leave a comment

publicationStmt

‘Leoba’ asks me recently about publicationStmt, wondering:

I have always thought that the publicationStmt in the TEI header is to describe the publication of the electronic text, that is the TEI file itself (and the description of fileDesc seems to support this). However when I look at the descriptions of publicationStmt it’s much less clear (examples with very recent dates, for example). I’m dealing with 17thC documents that are being newly transcribed, can you advise me as to whether we are mis-using it to refer to the electronic text?

‘Leoba’ is right to use publicationStmt inside teiHeader/fileDesc to refer to the electronic text itself. That is what fileDesc is documenting. Her confusion comes from the necessarily vague description of publicationStmt as grouping:

…information concerning the publication or distribution of an electronic or other text.

The reason this doesn’t say ‘publication or distribution of this electronic text’ or something like that, is that publicationStmt, if we look at the ‘Use by’ section of its reference page, can also be used inside the biblFull which can appear anywhere model.biblLike is allowed. This enables you to give a teiHeader like bibliographic citation elsewhere in your document.

Posted in TEI, XML | Leave a comment

Modules vs Model Classes vs Attribute Classes

‘Dr John Smith’ asked me recently to explain the difference between modules and classes in the TEI.

Modules basically gather together element definitions into a single group. As the TEI P5 Guidelines say:

A module is … simply a convenient way of grouping together a number of associated element declarations. (TEI Modules)

Sometimes, like with the Core module these are grouped together for practical reasons, in most cases, as with the Dictionaries module, this is because the elements are all semantically related to one particular class of text or sort of encoding. An element can only appear in one module.

Each chapter has a corresponding module of elements. In the underlying TEI ODD language, both the prose of that chapter of the Guidelines, and the specifications for all the elements are stored in one file. From this is created both the TEI documentation, and the element definitions used to generate a schema.

The TEI Class system slightly different. While an element can only appear in one module, it can be a member of many classes. While a module is a single unit, classes can contain not only elements (or attributes) but also other classes.

Classes are used to express two distinct kinds of commonality among elements. The elements of a class may share some set of attributes, or they may appear in the same locations in a content model. A class is known as an attribute class if its members share attributes, and as a model class if its members appear in the same locations. In either case, an element is said to inherit properties from any classes of which it is a member. (The TEI Class system)

To enable easier comprehension of the many elements that the TEI Guidelines describe, these elements categorised into classes usually on structural or semantic grounds. The primary division of classes is between attribute classes and model classes. In the first of these, all the elements that are members of the same attribute class share the attributes stored in the definition for that class. For example, the class att.internetMedia contains an attribute @mimeType. There are three members of this attribute class: binaryObject, equiv, and graphic, which means that each of these elements has a @mimeType attribute. Attribute classes may contain other classes, and attributes from a subclass will inherit the attributes from a superclass which contains that subclass.

Elements which are members of model classes are all allowed to appear in the same place. What this means is that in the construction of the content model of an element it will say what content is allowed inside it. In many cases that element will say members of a particular class of elements are able to be used there. One of the benefits of this slight indirection is that if you want a new element you have created to appear in the same places as an existing element, you simply need to add to it that class. For example, the class model.noteLike is used by many elements (and indeed another model class model.global) to allow things which are note-like to be used inside them. The only members of model.noteLike are note and witDetail. So, in any element content model where model.noteLike is referenced, both note and witDetail are able to be used.

You may have noticed that some of the model elements have the suffix ‘Like’ or ‘Part’ in their name. This delineates two types of groupings. If it has ‘Part’ as a suffix, then it is defined by its structural location. For example, members of model.biblPart contains elements which are used inside of the ‘bibl’ element. That is, they are a ‘part’ of that element in the sense of being possibly valid children. However, elements with a ‘Like’ suffix are elements which are of similar semantic nature, and thus able to be used at the same point. For example, model.biblLike contains those elements which are ‘like’ the bibl element in that they contain a bibliographic description of some sort. There are other model classes, such as model.inter which do not contain a ‘Like’ or ‘Part’ suffix, and are convenient groupings of elements (often super classes) that all appear in the same place.

How are modules and classes related? Most classes are defined initially in the TEI Infrastructure module, what attributes or elements are available as part of any TEI schema is dependent upon the modules which are loaded. For example, model.phrase contains many subclasses, one of which is model.lPart (for parts of a metrical line). However, if in generating your schema you’ve not included the Verse module, then the two elements which model.lPart provides, caesura and rhyme, would not appear as an option anywhere you use model.phrase.

Although most classes are defined by the tei infrastructure module, a class cannot be populated unless some other specific module is included in a schema, since element declarations are contained by modules. Classes are not declared ‘top down’, but instead gain their members as a consequence of individual elements’ declaration of their membership. The same class may therefore contain different members, depending on which modules are active. Consequently, the content model of a given element (being expressed in terms of model classes) may differ depending on which modules are active. (Model Classes)

While I hope that clears up some of the confusion, reading The TEI Infrastructure chapter will certainly help, as well as perusing Appendix A: Model Classes, Appendix B: Attribute Classes and Appendix C: Elements for reference and to look at examples. Playing with Roma which allows you to customise your TEI schema (and more) is another option.

Posted in TEI, Uncategorized, XML | Leave a comment