Self Study (part 3): The TEI Default Text Structure

This (long) post follows on from posts on a basic one Introducing XML and Markup, and one on an Introduction to the Text Encoding Initiative Guidelines. Neither of these are really complete in themselves and barely scratch the surface, but are offered up as a help should people think them useful.

In this post we look at the overall basic structure of a TEI File. In many ways this is much more concrete than the infrastructure of the TEI where it is possible to get lost in the differences between TEI ODD files and the schemas generated from them, or modules, model classes, and attribute classes. Instead here we’re looking at the markup that is part of almost every TEI file, its default text structure. Readers may notice that the ‘Default Text Structure’ chapter of the TEI Guidelines comes after two that I’ve skipped: ‘The TEI Header‘ (chapter 2) and the slightly inaccurately named ‘Elements Available in All TEI Documents‘ (chapter 3). Have no fear if you are following this set of blog posts, I will be returning to chapter 3 next and then chapter 2, I just feel it is good to get a sense of a TEI file as a whole before learning about all the core elements and metadata.

A Basic TEI File Structure

A basic TEI file might look like this image below.

In this image the element names are in blue and XML comments (delineated by <! –– comment –-> ) are in green.

An XML file always should start with an XML Declaration (here at the top in purple). After that we have a <TEI> element in the TEI Namespace (http://www.tei-c.org/ns/1.0). Inside all <TEI> elements the TEI Guidelines require there to be a <teiHeader> element. In order for this to be a real and valid TEI P5 file, there are some elements which would need to appear inside the <teiHeader> element, but I’ll talk about those in another post.

After the <teiHeader> element you can have one or more optional <facsimile> or <sourceDoc> elements. These are for recording image facsimile information, or for a non-interpretative transcription method sometimes used for creating genetic editions.

After these we have a <text> element. Technically this is optional if you have <facsimile> or <sourceDoc> elements but really for most introductory uses of the TEI it is probably a good idea to have a <text> element. If you do use one it has to come last.

Inside a <text> element you can optionally have a <front> element. This is for containing front matter like titlepages or prefaces, anything that comes before the main body of the text.

The <body> element is required, because whatever text you are creating (whether a transcription of ancient clay tablets, medieval manuscripts, modern web-pages or teaching slides) it will have a body of some sort. Inside <body> you might get divisions (the <div> element) or just paragraphs (the <p> element) or a wide variety of other things. (We’ll talk more about these in a bit.)

The <back> element which follows the <body> element, as with <front>, is optional but is intended for back matter such as indexes, appendices, bibliographies, addenda, etc.

Now one of the things you might notice about this is it brings to bear certain assumptions of the TEI. This default text structure reflects the assumption that most text-bearing objects can be transcribed and editing in a way which resembles something that we might usually associate with a codex-like structure. (e.g. front matter, the main body stuff, then stuff that comes after). Our association of this with an assumed codex structure probably is a bit misplaced. For example, manuscript rolls, for example, often have optional ‘stuff at the top’ then ‘the main body stuff’, then optional ‘stuff at the end’ and many other cultures and methods of writing text on objects also have such systems. People have used the TEI to successfully encode a huge variety of texts from different times and cultures so it is unlikely that this structure will impose too much of a semantic burden on your own use of it.

 The TEI Default Text Structure Chapter

This is a long chapter which covers a lot of ground. It looks at the default text structure of the TEI (that I’ve tried to explain briefly above), and then investigates the kind of things that happen inside the <text> element. This includes looking at the types of divisions available inside the <body>, <front> and <back> elements and the elements available inside these divisions. It includes ways of encoding groups of texts (such as anthologies and collections), virtual divisions that can be automatically generated such as tables of contents. It also looks at the <front> element, title pages, and the <back> element.

Read this chapter and in order to make sure you have, answer these questions:

  • How might you decide whether a text is unitary or composite?
  • Personally I have a strong preference for almost always using un-numbered divisions <div> rather than numbered ones <div1>. In what circumstances might numbered ones be more appropriate to use?
  • Why does the TEI not use numbered headings (c.f. HTML where there are elements <h1>, <h2>, <h3>, etc.) but just a <head> element?
  • If you were digitising my love letters (who knows why?!), how would you mark up the closing bit at the end of a letter where I say:
With love and cuddles,
James
xxx
  • When would you use <group> element rather than have separate TEI files?
  • What is a <floatingText> element used to indicate? Try to think of examples from your own area of work?
  • Do the texts you work with have front matter that you would encode in the <front> element? How would you encode it? How do you decide to encode something as front matter rather than as the body of the file?
  • On a title page how would you encode a title that has several parts to it?
  • Are there differences between what is allowed in <front> and what is allowed in <back>? Why is this the case?

Try it out

I always think, if possible, it is good to have practical exercises to reinforce things you have learned. If you have time try this:

  • Start up the oXygen editor
  • Create a new document by going to File ? New and double-click to expand ‘Framework templates’ scroll down inside it and do the same to open ‘TEI P5’. Inside this select ‘All’, and click on ‘Create’ to open a new document.
  • Ignoring the schema declarations at the top you should get a file which looks something like this:

  • Assuming you’ve not turned off automatic document checking, you should have a happy green square in the upper right-hand corner of the editor, near where a scrollbar would appear if our document was longer. This tells you not only that it is well-formed but also valid according to the rules of the tei_all schema.
  • Delete the entire paragraph element (including <p> and </p> tags) that says:
<p>Some text here.</p>
  • Does that happy green square disappear? Is it angry and red? If document checking is turned on the opening <body> tag should be underlined in red, that happy green square should now be red and there should be a red line part way down the right-hand side indicating where the error is in the document.
  • At the bottom of the screen there will be an error message, in this case saying ‘element “body” incomplete’ because it is expecting one of any number of elements.
  • Instead of replacing this paragraph, let’s instead add a division. Move to inside the <body> element between the opening tag and the closing </body> tag where the paragraph was previously. Press the < key and wait a second; oXygen should be helpful and give a drop down list of the elements allowed by the TEI at this point. Scrolling up and down this list can give you a sense of the vast array of things you could be encoding at this point, but is also a bit of a mixture because you can have texts with divisions or without them at this point. Select the <div> element and notice what oXygen does.
  • oXygen should have added both an opening and closing division tag: <div></div> . Move the cursor between these two tags and press enter a couple times to get some space.
  • Add a <head> element and inside it put the text content “My First Heading”.
  • After the closing </head> tag, add a paragraph using the <p> element and the text “My first paragraph.”
  • In all cases make sure you only stop when you have a happy green square indicating that your document is well-formed and valid.
  • Your <body> element should now look something like:

  • Add at least one more division after this. (If you had a document with only one division, you don’t really need to use the <div> element at all.) Inside this second division, try nesting a sub-division!
  • If you do your <body> element might look something like:
  • Save your document.
  • The oxygen-tei framework comes complete with some transformations to other formats. From the oXygen menus choose Document ? Transformation ? Configure Transformation Scenario(s) and select ‘TEI P5 XHTML’ and click on ‘Apply associated’ (though this may be slightly different if you are using a different version of oXygen).
  • You should get a minimal HTML rendering of your file appearing in a browser. Note some of the information that the transformation has added. Try some other transformations or changing the document and seeing the effect.
  • Think about the nature of your own materials and how you might structure them if encoding them according to the default text structure of the TEI!

I’ve intentionally glossed over the introduction of many of the core TEI elements (such as <p>), but don’t worry we will survey these next time!

Go on to Self Study (part 4) TEI Core Elements next!

Posted in SelfStudy, TEI, XML | 3 Comments

3 Responses to “Self Study (part 3): The TEI Default Text Structure”

  1. Sebastian Rahtz says:

    teiCorpus, not TEICorpus. might as well get it right, even in a comment….

  2. James Cummings says:

    A comment in an image no less. ;-) Ok, corrected, thanks for pointing it out.

    mea culpa, mea maxima culpa.

  3. […] explore: James Cummings’ TEI tutorial Part 1 Part 2 Part 3 Part 4 Part 5 Part 6
TEI By Example Project Sample TEI projects (Brown WWP) Initiation à […]

Leave a Reply