RIP SPQR

RIP SPQR
Sebastian Patrick Quintus Rahtz,
13 February 1955 – 15 March 2016

1098024_10151516230070303_499818552_n

I have been helped by, worked with, learned from, and been friends with Sebastian Rahtz since before I even came to work at OUCS (now IT Services).  During my years working with him there were only a few times that I ever showed him something new, or a better way to do the thing he was doing, more often than not it was the reverse. I learned a lot from Sebastian, not only about the way to approach technical problems, management, workflow, etc. but in many ways how to be a better human being.  I won’t attempt to list the projects, services, people, and organisations that Sebastian has made better by his existence. I would only leave out many crucial and important ones. I will miss him.

sit tibi terra levis

zadar_123

Posted in other | Leave a comment

Teaching for DEMM: Digital Editing of Medieval Manuscripts

 

This is the second year that, as part of my commitment to DiXiT, I have also taught on the Erasmus+ Digital Editing of Medieval Manuscripts network.  Digital Editing of Medieval Manuscripts (DEMM) is a joint training programme between Charles University in Prague, Queen Mary University of London, the Ecole des Hautes Etudes en Sciences Sociales, the University of Siena, and the library of the Klosterneuburg Monastery. It equips advanced MA and PhD students in medieval studies with the necessary skills to eAmiatinusdit medieval texts and work in a digital environment. This is done through a year-long programme on editing medieval manuscripts and their online publication: a rigorous introduction to medieval manuscripts and their analysis is accompanied by formal training in ICT and project management. The end of each one-year programme will see the students initiated into practical work-experience alongside developers, as they will work on their own digital editions, leading to its online publication.

Funded by the Strategic Partnership strand of the European Union’s Erasmus+ Programme, DEMM will run for three consecutive years, always with a new group of students. It will lead to the publication, in print and online, of teaching materials, as well as a sandbox of editions.

My institution is not directly involved in it (but there is overlap with DiXiT) and last year I taught and assisted at both the workshop in Lyon and the Hackathon in London. This year the students had a a week’s introduction to Palaeography, Codicology and Philology at Stift Klosterneuburg in the autumn and then in March had a week’s workshop on encoding, tagging and publishing in Lyon.

Needless to say I was providing tuition on the Text Encoding Initiative and a full schedule, with links to my presentations (some of the others are behind a password protected site) is available at:

Digital Editing, Lyon 2016

This follows a fairly predictable pattern of introducing people to the concept of markup, the formal syntax of XML, and the vocabulary of the TEI. It then goes on to expand this with an introduction to the core elements, named entities, and the following morning TEI metadata. Here of course we also single out the elements for both manuscript description and transcription since that is key for those undertaking  to build digital editions of medieval manuscripts.  The course continued on to talk about critical apparatus, genetic editing, and publication / interrogation of your results.

Posted in TEI | Leave a comment

Report on the Digital Humanities at Oxford Summer School 2015

Digital Humanities at Oxford Summer School 2015 Report

About

The Digital Humanities at Oxford Summer School (DHOxSS) http://digital.humanities.ox.ac.uk/dhoxss/ is an annual training event at the University of Oxford which took place this year on 20 – 24 July 2015. This year it took place primarily at St Anne’s College, IT Services, and the Oxford e-Research Centre. The DHOxSS offers training to anyone with an interest in the Digital Humanities, including academics at all career stages, students, project managers, and people who work in IT, libraries, and cultural heritage. Delegates follow one of our week-long workshops, supplementing their training with expert guest lectures. Delegates can also join in events each evening. This year the DHOxSS grew significantly. It swelled from 5 workshops in 2014 to 8 workshops in 2015 and this meant the number of delegates and speakers also grew from 107 delegates + 54 speakers in 2014 to 163 delegates + 83 speakers in 2015.

The DHOxSS runs primarily on the goodwill of various units of the University of Oxford donating their time as DHOxSS Directors, Organisational Committee, Workshop Organisers, Speakers, and in the work of the IT Services Events Team. Organisers and Speakers are not financially remunerated for their participation, though travel and accommodation expenses for visiting speakers are covered by the DHOxSS. Speakers and Workshop Organisers are rewarded for their labours through attendance at the DHOxSS welcome reception and sometimes other DHOxSS events. The enterprise as a whole is financially underwritten by IT Services, which also donates multiple FTE worth of staff time spread across part of the time of one of the Directors and staff commitment from those in the IT Services Events Team.

DHOxSS Directors

For the last few years James Cummings (IT Services) has been the overall director of the DHOxSS. However, it has grown to such a size that this year Pip Willcox (Bodleian Libraries) joined him as a co-director. In the planning for DHOxSS 2016 the responsibilities of individual directors is already more distinct as a result of this first year of experience: they oversee discrete areas of the summer school in collaboration with the events team and DHOxSS Organisational Committee.

DHOxSS Organisational Committee

The year-long organisation of the DHOxSS is overseen by an organisational committee consisting of stakeholders from across the collegiate university. After DHOxSS 2014 this committee was intentionally re-structured to give broader representation from more stakeholders and the planning of DHOxSS 2015 bears the fruit of this. The committee for DHOxSS 2015 consisted of:

  • Jacqueline Baker, Oxford University Press
  • James Cummings, Co-Director of DHOxSS, IT Services
  • David De Roure, Wolfson College Digital Cluster
  • Kathryn Eccles, TORCH Digital Humanities Chamption
  • Andrew Fairweather-Tall, Humanities Division
  • Ruth Kirkham, The Oxford Research Centre it the Humanities
  • Eric Meyer, Oxford Internet Institute
  • Kevin Page, Oxford e-Research Centre
  • Pamela Stanworth, IT Services
  • Tara Stubbs, Continuing Education
  • Jessica Suess, Museums & Collections
  • Kathryn Wenczek, IT Services Events Team
  • Pip Willcox, Co-Director of DHOxSS, Bodleian Libraries

Content

Structure of the DHOxSS

Overall the DHOxSS mostly has a fairly regular daily structure of:

  • 9:30-10:30 Additional Plenary Keynotes or Parallel Lectures
  • 10:30-11:00 Break
  • 11:00-12:30 Individual Workshops
  • 12:30-14:00 Lunch and travel time
  • 14:00-16:00 Workshops Continue
  • 16:00-16:30 Break
  • 16:30-17:30 Workshops Continue
  • Evening Events

However, some individual workshops varied the times of breaks slightly from this. Indeed, the TEI workshop was asked to extend its teaching until 13:00 each day when an overcrowding situation in the OeRC atrium became evident at lunchtime. For DHOxSS 2016 this schedule will need to be revised to include more travel time because of the distances between some of the chosen venues.

Additional Plenary or Parallel Lectures

The DHOxSS structure provides an opening and closing plenary keynote on the Monday and Friday of the week. Tuesday through Thursday provides an opportunity for parallel sessions in smaller venues. The DHOxSS 2015 had 3 parallel sessions on these days.

Monday 20 July 2015, 09:30-10:30

Tuesday 21 July 2015, 09:30-10:30

Wednesday 22 July 2015, 09:30-10:30

Thursday 23 July 2015, 09:30-10:30

Friday 24 July 2015, 09:30-10:30

Workshops

This year the DHOxSS grew from 5 parallel workshops to 8 workshops each running in parallel over the course of the week. This sudden growth and corresponding need for additional venues did pose an additional administrative burden and a greater degree of logistics.

All workshops at DHOxSS run for the full 5 days. Delegates chose a single workshop and stayed with that workshop for the entire week. Workshop organisers were responsible for designing and running the program of the workshop, providing the necessary information about it, liaising with the speakers, and ensuring it runs smoothly. Organisers also often were speakers on that workshop. A call for workshops issued in 2014 resulted in the committee approving the following workshops for DHOxSS 2015:

An Introduction to Digital Humanities
Crowdsourcing for Academic, Library and Museum Environments
Digital Approaches in Medieval and Renaissance Studies
Digital Musicology
From Text to Tech
Humanities Data: Curation, Analysis, Access, and Reuse
Leveraging the Text Encoding Initiative
Linked Data for the Humanities

Each workshop is given its own colour which carries through on the website, in the printed booklet, and in the lanyard that delegates on that workshop are given. This makes it blindingly obvious if delegates are trying to switch from one workshop to another. This is something which is not allowed for both pedagogical and administrative reasons, and incurs an administration fee and needs the express approval of the workshop organiser.

dhoxss2015-workshops.png

The Introduction to Digital Humanities workshop was organised by Pip Willcox (Bodleian Libraries) and was our most popular workshop strand. It is a mostly lecture-based survey of a large number of Digital Humanities topics and those speaking on it are often appearing in other workshops as well.  This year speakers included: Alfie Abdul-Rahman (Oxford e-Research Centre, University of Oxford), James Cummings (IT Services, University of Oxford), David De Roure (Oxford e-Research Centre, University of Oxford), J. Stephen Downie (Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign), Kathryn Eccles (Oxford Internet Institute and TORCH, University of Oxford), Alexandra Franklin (Bodleian Libraries, University of Oxford), Christopher Green (Institute of Archaeology, University of Oxford), David Howell (Bodleian Libraries, University of Oxford), Matthew Kimberley (Bodleian Libraries, University of Oxford), Ruth Kirkham (Oxford e-Research Centre, University of Oxford), James Loxley (University of Edinburgh), Eric Meyer (Oxford Internet Institute, University of Oxford), Kevin Page (Oxford e-Research Centre, University of Oxford), Meriel Patrick (IT Services, University of Oxford), Megan Senseney (Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign), Judith Siefring (Bodleian Libraries, University of Oxford), Ségolène Tarte (Oxford e-Research Centre, University of Oxford), Andrea K. Thomer (Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign), Pip Willcox (Bodleian Libraries, University of Oxford), and James Wilson (IT Services, University of Oxford)

The Crowdsourcing for Academic, Library and Museum Environments workshop was organised by Victoria Van Hyning (Zooniverse, University of Oxford) and Sarah De Haas (Google). It gave participants an in-depth exposure to the full workflow of crowdsourcing and making use of the aggregate data. Speakers on this workshop included: Philip Brohan (Met Office Hadley Centre), Sarah De Haas (Google), Shreenath Regunathan (Google),  and Victoria Van Hyning (Zooniverse, University of Oxford).

The Digital Approaches in Medieval and Renaissance Studies workshop was organised by Judith Siefring (Bodleian Libraries). This workshop explored various innovative approaches in the field in use at Oxford. This included both image and text-based materials, and delegates had the opportunity to view original artifacts from the age of manuscripts and early print. Speakers on this workshop included: James Cummings (IT Services, University of Oxford), Geri Della Rocca De Candal (Faculty of Medieval and Modern Languages, University of Oxford), David De Roure (Oxford e-Research Centre, University of Oxford), Cristina Dondi (Faculty of History, University of Oxford), Iain Emsley (Oxford e-Research Centre, University of Oxford), Alexandra Franklin (Bodleian Libraries, University of Oxford), Matthew Holford (Bodleian Libraries, University of Oxford), David Howell (Bodleian Libraries, University of Oxford), Eleanor Lowe (Department of English and Modern Languages, Oxford Brookes University), Matilde Malaspina (Faculty of History, University of Oxford), Liz McCarthy (Bodleian Libraries, University of Oxford), Matthew McGrattan (Bodleian Libraries, University of Oxford), Monica Messaggi Kaya (Bodleian Libraries, University of Oxford), Kevin Page (Oxford e-Research Centre, University of Oxford), Alessandra Panzanelli (The British Library),  Judith Siefring (Bodleian Libraries, University of Oxford), Daniel Wakelin (Faculty of English, University of Oxford),  and Pip Willcox (Bodleian Libraries, University of Oxford).

The Digital Musicology workshop was organised by Kevin Page (Oxford e-Research Centre). This workshop provided an introduction to computational and informatics methods that can be, and have been, successfully applied to musicology. It brought together a well-rounded programme balancing lectures with practical sessions. Speakers on this workshop included: Chris Cannam (Centre for Digital Music, Queen Mary University London), Rachel Cowgill (Music & Drama, University of Huddersfield), Julia Craig-McFeely (Faculty of Music, University of Oxford), Tim Crawford (Computing Department, Goldsmiths, University of London), David De Roure (Oxford e-Research Centre, University of Oxford), J. Stephen Downie (Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign), Ben Fields (Computing Department, Goldsmiths, University of London), Ichiro Fujinaga (Schulich School of Music, McGill University), David Lewis (Computing Department, Goldsmiths, University of London), Richard Lewis (Computing Department, Goldsmiths, University of London), Kevin Page (Oxford e-Research Centre, University of Oxford), Christophe Rhodes (Computing Department, Goldsmiths, University of London), Carolin Rindfleisch (Faculty of Music, University of Oxford), Stephen Rose (Department of Music, Royal Holloway, University of London), David M. Weigl (Oxford e-Research Centre, University of Oxford), and Tillman Weyde (Department of Computer Science, City University London)

The From Text to Tech workshop was organised by Gard B. Jenset, (TORCH), and Kerri Russell (Faculty of Oriental Studies). This workshop HiCor research network (http://www.torch.ox.ac.uk/hicor) taught delegates the skills and understanding required to work computationally and quantitatively with corpora of historical texts. Speakers on this workshop included: Gard B. Jenset (The Oxford Research Centre in the Humanities, University of Oxford), Barbara McGillivray (The Oxford Research Centre in the Humanities, University of Oxford), Kerri Russell (Faculty of Oriental Studies, University of Oxford), Gabor M. Toth (University of Passau / The Oxford Research Centre in the Humanities, University of Oxford), Alessandro Vatri (Faculty of Classics and Faculty of Linguistics, Philology & Phonetics, University of Oxford)

The Humanities Data: Curation, Analysis, Access, and Reuse workshop was organised by Megan Senseney (Graduate School of Library and Information Science, University of Illinois Urbana-Champaign) and Kevin Page, (Oxford e-Research Centre). This workshop provided a clear introductory grounding in data concepts and practices with an emphasis on humanities data curation. Sessions covered a wide range of topics, including data organization, data modeling, big data and data analysis, and workflows and research objects. Case studies included examples from the HathiTrust, EEBO-TCP, and BUDDAH. Speakers on this workshop included: Laird Barrett (Taylor & Francis / Oxford Internet Institute, University of Oxford), Josh Cowls (Oxford Internet Institute, University of Oxford), David De Roure (Oxford e-Research Centre, University of Oxford), J. Stephen Downie (Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign), Tanya Gray Jones (Bodleian Libraries, University of Oxford), Scott Hale (Oxford Internet Institute, University of Oxford), Neil Jefferies (Bodleian Libraries, University of Oxford), Terhi Nurmikko-Fuller (Oxford e-Research Centre, University of Oxford), Kevin Page (Oxford e-Research Centre, University of Oxford), Allen Renear (Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign), and Sally Rumsey (Bodleian Libraries, University of Oxford)

The Leveraging the Text Encoding Initiative workshop was organised by  Magdalena Turska (DiXiT Project / IT Services, University of Oxford) and Lou Burnard, (Lou Burnard Consulting). This workshop tried to balance an introduction to TEI with more technical investigations of software to publish and interrogate TEI XML files. Speakers on this workshop included: Misha Broughton (DiXiT Project, University of Cologne), Lou Burnard (Lou Burnard Consulting), Emmanuel Château (École Nationale des Chartes), Elena Spadini (DiXiT Project, Huygens ING (KNAW)), and Magdalena Turska (DiXiT Project / IT Services, University of Oxford)

The Linked Data for the Humanities workshop was organised by Kevin Page (Oxford e-Research Centre). This workshop introduced the concepts and technologies behind Linked Open Data and the Semantic Web. It taught attendees how they could publish their research so that it is available in these forms for reuse by other humanities scholars, and how to access and manipulate Linked Open Data resources provided by others. Speakers on this workshop included: David De Roure (Oxford e-Research Centre, University of Oxford), Alex Dutton (IT Services, University of Oxford), Barry Norton (British Museum), Terhi Nurmikko-Fuller (Oxford e-Research Centre, University of Oxford), Dominic Oldman (British Museum), Kevin Page (Oxford e-Research Centre, University of Oxford), John Pybus (Oxford e-Research Centre, University of Oxford),

Poster Session

Each year DHOxSS has a peer-reviewed poster session, often held in conjunction with the welcome drinks reception. This gives delegates, speakers, and members of the University of Oxford, a chance to get to know each other and display their digital humanities work to each other. This year posters were presented by:

Evening Events

An important part of DHOxSS is the social events. This year these consisted of:

As mentioned above the Welcome Drinks Reception and Poster Session is an important networking event (for those attending and speaking at DHOxSS) but also other invited guests. In this case it was also used as a book launch event for one of the DHOxSS’s major sponsors the AHRC Digital Transformation Theme.  The guided walking tour gave visitors to Oxford a chance to explore the historic city.

Teaching Venues

The DHOxSS has reached a size where it can occasionally face venue capacity problems. There are only so many lecture theatres in Oxford which hold 163 delegates (plus additional speakers) and many of these are booked out well in advance. The DHOxSS events team is working on securing locations several years in advance, however, the unprecedented growth from DHOxSS 2014 to DHOxSS 2015 in number of workshop meant that additional venues needed to be found. The venues used were:

  • The Mathematics Institute: For the opening and closing keynote DHOxSS 2015 used the Mathematics Institute.

  • St Anne’s College: The additional lectures on the mornings of Tuesday – Thursday were held in the Tsuzuki Lecture Theatre, Seminar Room 9, and the Danson Room.  These rooms were also used for three of the DHOxSS workshops. There were some problems with using the Danson room for presenting, but other spaces worked well.

  • IT Services: Three workshops will head in the IT Services Thames Suite of teaching rooms.

  • Oxford e-Research Centre: Two workshops were held in the Oxford e-Research Centre.

  • The Weston Library Lecture Theatre: This was used for a joint session of two workshops.

Podcasts, Photos, and Social Media

The DHOxSS has always engaged with social media and the #DHOxSS was well-used by DHOxSS delegates and  the @DHOxSS twitter account was a source of information and advice. Although the DHOxSS Photo Group on Flickr was mentioned to delegates, it did not prove as popular as more instant open forums such as twitter and instagram.  Podcasts of the opening and closing keynotes as well as most of the additional lectures were made freely and openly available. (The only reason one lecture wasn’t, was technical difficulties with the footage.)  These are made available on the DHOxSS podcast series at http://podcasts.ox.ac.uk/series/digital-humanities-oxford-summer-school. Individually these are:

Lecture  Title

Description

People

Uneasy Dreams: the Becoming of Digital Scholarship

James Loxley, University of Edinburgh, gives the final keynote in the DHOXSS 2015.

James Loxley

The Online Corpus of Inscriptions from Ancient North Arabia

Daniel Burt, Khalili Research Centre, University of Oxford, gives a talk for the DHOXSS 2015.

Daniel Burt

If a Picture is Worth 1000 Words, What’s a Medium Quality Scan Worth?

David Zeitlyn, Institute of Social and Cultural Anthropology, University of Oxford, gives a talk for the DHOXSS 2015.

David Zeitlyn

Crowdsourced Text Transcription

Victoria Van Hyning, Zooniverse, University of Oxford, gives a talk for the DHOXSS 2015.

Victoria Van Hyning

Let Your Projects Shine: Lightweight Usability Testing for Digital Humanities Projects

Mia Ridge, Digital Humanities, Open University, gives a talk for the DHOXSS 2015.

Mia Ridge

Networking⁴: Reassembling the Republic of Letters, 1500-1800

Howard Hotson, Faculty of History, University of Oxford, gives a talk for the DHOXSS 2015.

Howard Hotson

Mapping Digital Pathways to Enhance Visitor Experience

Jessica Suess, University of Oxford Museums and Anjanesh Babu, Ashmolean Museum, University of Oxford, give a talk for the DHOXSS 2015.

Jessica Suess,Anjanesh Babu

Digital Image Corruption – Where It Comes From and How to Detect It

Chris Powell, Ashmolean Museum, University of Oxford, gives a talk for the 2015 DHOXSS.

Chris Powell

Digital Transformations

Panel discussion for th DHOXSS 2015.

David De Roure,Lucie Burgess,Tim Crawford,Jane Winters

How I Learned to Stop Worrying and Love the Digital

Jane Winters, Institute of Historical Research, University of London, gives the opening keynote talk for the 2015 DHOXSS.

Jane Winters

This continues a DHOxSS tradition of recording and making openly available the keynotes and additional lectures.

DHOxSS Statistics

Speakers

There were 83 speakers for DHOxSS 2015, 54 of which were from the University of Oxford. These were contributed by the following departments:

  • Bodleian Libraries: 13 Speakers
  • Oxford e-Resarch Centre: 9 Speakers
  • IT Services: 7 Speakers
  • Oxford Internet Institute: 6 Speakers
  • Faculty of History: 3 Speakers
  • The Oxford Research Centre for the Humanties: 3 Speakers
  • Oxford University Museums: 3 Speakers
  • Faculty of Music: 2 Speakers
  • Faculty of Classics: 1 Speaker
  • Faculty of English: 1 Speaker
  • Faculty of Medieval and Modern Languages: 1 Speaker
  • Faculty of Oriental Studies: 1 Speaker
  • School of Archaeology: 1 Speaker
  • Institute of Social and Cultural Anthropology: 1 Speaker
  • Khalili Research Centre: 1 Speaker
  • Zooniverse: 1 Speaker

Registration

There were 163 DHOxSS 2015 registrations which were as follows:

  • Academic/Standard/NFP: 92

  • Student: 53

  • Oxford: 16

  • Commercial: 2

DHOxSS2015-registrations.png

The registration charges were:

Registration Type

Fee

Full Commercial Rate: You work for a commercial or corporate organisation

695 pounds

Academic/Education/NFP: You work for an educational institution, library, charity or not-for-profit organisation in any capacity

590 pounds(15% discount)

Student (any institution/level): You are enrolled as a (full-time or part-time) student at any educational institution at any level

485 pounds(30% discount)

Staff or Student of the University of Oxford:You work or are a student at the collegiate University of Oxford

485 pounds(30% discount)

This covered the costs of venues, lunches, evening events, speaker travel and accommodation as well as any costs in running the workshops.

As part of the registration process delegates were optionally able to indicate the source of funding they were using to pay for their registration. While 33% chose not to answer, 31% had institutional funding, 22% were self-funding, 8% had project funding, 6% had a bursary/grant of some sort, and 1% indicated a different reason.

dhoxss2015-funding.png

The reasons for attending, when chosen from a list were mostly career development (38%), a specific project (20%), and general interest (10%), while 33% chose not to answer:

dhoxss2015-reason.png

Delegate Origin

Delegates came from all levels of professional standing, and from over 100 separate institutions. In aggregate the countries of origin can be totalled as:

  • UK: 80
  • Other Europe: 50
  • North America: 26
  • Far East: 3
  • Middle East: 1
  • Russia: 1
  • South America: 1
  • Australia: 1

dhoxss2015-country.png

Delegate Age

dhoxss2015-age.png

How Delegates Heard About DHOxSS 2015

DHOxSS 2015 was advertised through various media. While registering, delegates were able to indicate where they had heard about DHOxSS. Mostly delegates had heard about DHOxSS from colleagues (some of whom were previous attendees), others indicated that they had used online searches or found the website through one route or another. Less indicated that they had heard about it through social media but the effectiveness of this measure is hard to determine since this and mailing list may be how the colleagues referenced. Similarly, flyers were distributed at conferences and sent to various UK humanities departments which might have resulted in some of the institutional recommendations.

dhoxss2015-heard.png

Gender

DHOxSS strives to be a welcoming place for all participants. One of the statistics we have examined over the years is that of gender. In previous years gender was not asked of participants but tracked informally based on apparent gender identify.  This has shown that DHOxSS normally attracts approximately 69% female delegates. For the first time, in the registration for DHOxSS 2015, delegates were asked to declare their gender. The ratio of female to male delegates generally held but was slightly less because many of those choosing not to answer the question (for whatever reason) appear to be women.  The chart below looks at gender not only of delegates but all participants and provides 31% Delegate Female, 16% Delegate Male, 20% Speaker Male, 13% Speaker Male, and 19% Delegates who didn’t answer. This indicates room for improvement by increasing the number of female speakers so it is more representative of the DH community which attends DHOxSS.

dhoxss2015-gender.png

Feedback

In general the feedback from delegates and speakers was generally positive. There were a number of problems with workshops where abstracts didn’t entirely match the workshop content or there were too many topics being covered. There was very positive feedback for the organisation and administration of DHOxSS 2015. The feedback was summarised for the organisational committee and has formed part of the planning for DHOxSS 2016.

Plans for DHOxSS 2016

The DHOxSS 2016 will be held from the 4 – 8 July 2016 using St Hugh’s College, IT Services, the Oxford e-Research Centre, and other venues. The planning for this is already underway (and locations for 2017 and 2018 are being booked), and a call for workshops and additional lectures has already gone out. . If you want to subscribe to our DHOxSS announcements mailing list, email: dhoxss-announce-subscribe@maillist.ox.ac.uk and confirm by replying to the confirmation email that gets sent to you. We will notify this mailing list when registration opens.

Posted in DHOxSS | 1 Comment

childish toys

I count religion but a childish toy, and hold there is no sin but ignorance.
Jew of Malta, Christopher Marlowe

Occasionally, indeed almost cyclically, on some of the mailing lists I’m on a big theoretical war erupts where someone declares “XML is DEAD: We should all move to using $Thing“. Though to be honest, it could be any format or technology, not just XML.

Sometimes these are well-meaning hunters of the new and shiny: Someone has heard about this brand new shiny $Thing technology and heard that it is the replacement for XML technologies (or whatever existing technologies) and that we should all start using it. With little or no critical examination of their sources, perhaps a shiny youtube promotional video, this then starts a long and usually fruitless discussion. One of the reasons that $Thing technology is quicker, shinier, and much more fun, is that it has dropped lots of the baggage of the old technology — eventually people will realise that baggage was there for a reason and slowly add it back, but this time to a framework not designed to incorporate it. People chip in from both sides but the status quo remains.

Sometimes it is naively theoretically-based: Someone notices, or reads about, the inherent problems in XML (or whatever existing technologies) and sees that using $Thing technology doesn’t have those problems (and either doesn’t notice the other problems it does have, or they don’t apply to their narrow use-case). The poster in this case wants to know “Is this really the next big thing?” but is, or should be at least, open to the reasons why it isn’t. This usually brings up discussion by posters on both sides picking flaws in one of the technologies or the others or recycling of long-dead myths. (“XML has a problem with overlapping hierarchies, $Thing doesn’t! Ha!“, “There are lots of solutions to overlapping hierarchies in XML which enable you to use all these nice tools.”, “Ah, but you can’t do stand-off markup in XML or represent a graph!“, “Erm, yes, you can. Honestly, URI-based pointing, Out of line markup, Linking multiple disparate resources by various taxonomies, all common in XML“, etc.) This sniping back and forth is hardly productive and just makes people think there is a problem where there isn’t. People chip in from both sides and the status quo remains.

Sometimes it is sophisticatedly theoretically-based: Some philosophical guru has been studying the various technologies for quite some time and expresses that the problems inherent in one, from their point of view, are dealt with more elegantly in $Thing technology. This is probably true, but is mostly done as a theoretical exercise of trying to perfect the ideal technology and express it in a form that is elegant, beautiful, and rational. More often than not this results in a particular instance of $Thing technology that solves problems that most people didn’t really care about, and although it may be elegant it is not human readable and there is only the guru’s personal implementation of anything that reads it which works for their use-case.  While potentially useful, it is not pragmatic for the majority of people to care about it until it has reached mass adoption.  It will never reach mass adoption because this guru, let’s say, isn’t interested in community building. People will gently comfort the technological genius who doesn’t understand why we persist with the well-supported but suboptimal, and the status quo remains.

Sometimes it is religiously-based: A devotee of $Thing technology, or a die-hard opponent of XML (or whatever existing technology) finds some news article or development which they can use to claim the superiority and mass-adopting of $Thing technology.  The use of $Thing technology in this instance is then cast as a slow but measurable demise of XML (or whatever existing technology).  The increase in use of one technology is not necessarily related to the demise of another technology, and this may be misleading for people viewing this exchange. In my opinion it is usually intellectually dishonest to present such a news article or development as the death knell for another technology, especially when both can happily co-exist, and especially when it is done consciously as a technique by the devotee to intentionally discredit the existing technology. Dislike of any particular technology because of its flaws is reasonable, but doing so blindly is not what users should be basing the technological decisions on.  Users of the existing technology defend their conscious decision not to be trendy and inexperienced users choose $Thing technology because of the hype and then contribute to that hype. People chip in from both sides and try to patiently convert the masses or correct the fallacies of the devotee, but the status quo remains.

Sometimes it is implementation-based: A programmer needing to process lots of XML (or whatever existing technology) runs into a problem, often a limitation by the poor implementation of the libraries they are using, and either bemoans or is advised that $Thing technology doesn’t have these problems and look comes with the wonderful library of tools. People counter showing how if the programmer had been using the appropriate tools the problem would be easier to solve. Others point to the growing code base for $Thing technology and and get shown the huge amount of tools for the existing technology.  The code base might be growing because people have seen that $Thing technology is missing support for all their special cases, and thus it agglomerates bits and pieces of new areas of support. People chip in from both sides with examples of how their chosen technology does one thing better, or how they are all bad, but the status quo remains.

There are of course other ways this arises and plays out, and different actors playing many parts. In my case I find almost any of these discussions pathetically juvenile. How many times do we have to say it:

IT ISN’T ABOUT THE FORMAT, IT IS ABOUT GRANULARITY OF INFORMATION AND APPROPRIATE TECHNOLOGIES FOR APPROPRIATE USES!

Instead, lets help each other do good and useful things rather than needlessly wasting spare cycles proclaiming death or triumph of one useful format or technology over another. To do otherwise is tiring, pathetic, and just a waste of everyone’s time. Sure, any new project needs to get good and sensible advice on what formats, technologies, and methodologies are suitable for their project. These are rarely determined by abstract considerations of the inherent properties of the format, technology, or methodology, however, and instead are determined by what the staff already know, what the local infrastructure will support, and what will give the most useful answers to the research questions with the least amount of investment. The childish toys alluded to by my appropriation of Marlowe here isn’t the formats themselves, but the arguments people have about them.  Sure, geek out and enjoy the intricacies of your chosen technologies,  but if you find yourself posting to a mailing list how your $Thing technology is better than some other technology, please have a long hard look in the mirror and go do something more useful with your life.

Although I spend a lot of my time immersed in the world of one particular technology, XML, that doesn’t mean I need to believe it is the right and true answer for all situations.  If I was designing a mobile phone app, at the time of writing I’d almost certainly be using JSON or an SQLite DB for data storage. If I was constructing an ontology then RDF would be the way to go. If I want to structurally query a large number of documents I’d use a NoSQL Document Database like eXist-db.  If I’m encoding dearly held and deeply nested semantics in the text of a medieval manuscript … I would have to be a complete lunatic to sit down and hand-encode this in JSON or RDF.  In that case I’d use TEI XML, because of the power of schema constraints and validation to enforce consistency, human readable nature of it, and its resilience for long-term preservation.  I’d do this knowing that I could convert my work to any format I needed based on the granularity of the markup I provided. They are all appropriate at different times and places, what the base storage format is depends a lot on your project’s needs, the sources of information, and the technology stack you have available to you.

The growth in one of these or other technologies doesn’t ipso facto indicate in any way the ‘death’ of any other technology. Technology will always change, things will always move on. But we should never celebrate even the perception of the marginalisation of widely adopted formats — useful legacy data migration of existing resources, no matter what the format, takes time and effort. Some technologies will eventually become less supported and the mainstream with be using one new $Thing technology or other. This has happened before and will happen again.

I’m all for pointing out the technologies chosen by good and interesting projects, and learning from their successes, but even more importantly their failures, but this should be done honestly with a desire for education, not blindly with trolling attempts to start a war where there really isn’t any argument.

More people are using $Thing technology? This well-known project has adopted $Thing technology as one of their outputs? Great! Isn’t it good that people are using all these wonderful technologies… what is even more important is what they are doing with them! Maybe we should ask them why they chose to do that rather than making assumptions about the lifecycle of technologies? In fact, one things that contributes to the strength and power of modern information systems design is the ability to work between multiple formats simultaneously and sometimes even automatically. For example, to store something as XML, but auto-generate a subset of that as JSON metadata to then in a web frontend to link to some PDFs and EPUBs generated from the same XML. To say that “if you want to use JSON you shouldn’t being using XML”, is like saying “if you want to play with a Princess Elsa Doll, then you shouldn’t play with a Batman Action Figure”. It is nonsensical.  Anyone who thinks you can’t play with both just doesn’t deserve the oxygen of being listened to.

Posted in XML | Leave a comment

What is the TEI? And Why Should I Care? (A brief introduction for classicists)

Recently I gave a lecture to those interested in Digital Classics at the University of Oxford as part of the Digital Classics Seminar Series with people much more qualified to talk about Classics (digital or otherwise) than me.  I’m not, nor ever have been or ever will be a classicist. Ok, I did learn Classic Latin at one point but quickly replaced this with the much more complicated (though not necessarily more sophisticated) Medieval Latin as I did an MA and PhD in Medieval Studies.  So I was understandably nervous speaking to a room full of classicists.  Luckily I was talking about something I know fairly well, and only making reference to its use in Digital Classics.  In this case the title of my talk was “What is the TEI? And Why Should I Care? (A brief introduction for classicists)”.  There are versions of the talk online:

I needn’t have worried, of course, the audience was wonderfully attentive as I  through, at a fairly basic level a brief introduction to:

  • Markup: I looked at the differences between Procedural, Presentational, and Descriptive Markup, and why one might want to annotate information in this way
  • XML: I quickly covered the basic descriptions of how XML is formatted and what its rules are; the power of deeply nesting annotation; and compared the pros and cons of XML vs Databases
  • TEI: I surveyed what the TEI is, what it is not, how it is customisable, and how it is developed and used.
  • EpiDoc: Lastly I discussed a vibrant TEI community of epigraphers and the EpiDoc TEI P5 customisation they have made. As someone only on the very edge of this Digital Classiscist community I probably didn’t do it justice, but it is a very good example of people customising the TEI (as a pure subset) creating even more targeted resources that conform to the needs of their community.

I encourage people to go to the other Digital Classics Seminar Series lectures or follow them as they are live streamed that evening (or catch up afterwards). The live streams are advertised shortly before the talk at: http://users.ox.ac.uk/~corp1223/DigitalClassics.htm

Posted in TEI, XML | Leave a comment

Text Creation Partnership: Made for everyone

Oxford Text Archive TCP Catalogue

From the 1 January 2015 the first phase of EEBO-TCP (Early English Books Online – Text Creation Partnership) transcribed books entered the public domain. They join those created by ECCO-TCP (Eighteenth Century Collections Online – Text Creation Partnership) and Evans-TCP (Evans Early American Imprints – Text Creation Partnership). The goal of the Text Creation Partnership is to create accurate XML/SGML encoded electronic text editions of early printed books. They transcribe and encode the page images of books from ProQuest’s Early English Books Online, Gale Cengage’s Eighteenth Century Collections Online, and Readex’s Evans Early American Imprints. The work the TCP does, and hence the resulting transcriptions that they create, are jointly funded and owned by more than 150 libraries worldwide. Eventually all of the TCP’s work will be placed into the public domain for anyone to use and the release of Phase 1 of EEBO-TCP is a milestone in this process.

The TCP began in 1999 as a partnership among the libraries of the University of Michigan and the University of OxfordProQuest, and the Council on Library and Information Resources (CLIR).  As and when TCP texts have entered into the public domain we have made them available at the Oxford Text Archive. This was already distributing the public domain copies of ECCO-TCP, and now adds the phase one of EEBO-TCP and Evans-TCP this collection. The hard work of managing the creation, encoding, checking, and providing the texts have been done by the Bodleian Library at the University of Oxford and the University of Michigan Library, while the Academic IT group of IT Services at the University of Oxford has undertaken the task of bringing the encoding into full conformance with the Text Encoding Initiative P5 Guidelines and making the results available in various forms.

The Academic IT group of IT Services at the University of Oxford has made use of these texts for a number of projects and so wanted to make sure that the texts were easily available now that they have entered the public domain. To do so we have placed them in a special collection at the OTA which displays the metadata (stored in a postgresql database) as a jQuery dataTable enabling sorting and filtering by any aspect of this. This table currently lists 61315 texts, but this includes 28462 texts which are ‘restricted’. These are not in the public domain yet, but are available to those at the University of Oxford to use in the meantime. The remaining 32853 texts are freely available to the public. You can see only the free ones by filtering by ‘Free’ in the availability column. Each entry in the table provides basic metadata of the TCP ID, links, the title, availability, date, other IDs associated with the text, keyword terms TCP provided it, and a rough page count. The links provided are to:

  • Web: This is a basic HTML rendering using the XSLT Stylesheets of the Text Encoding Initiative Consortium
  • ePub: This is a basic conversion to ePub format, as above using the XSLT Stylesheets of the TEI Consortium, for reading on mobile and table devices which support this format
  • Images: This link is only present for certain texts and takes you to the JISC Historical Texts Platform entry for this text. Historical Texts is a JISC-funded service available via subscription to UK HE and FE institutions and Research Councils who are full Jisc Collections members. We recognise that this is not useful for those at institutions who do not subscribe to this service or are not in the UK. It may also be possible to go back to proquest’s EEBO and find the page images directly if your institution subscribes to that. It was decided it was better to include the link for the benefit of users at UK subscribing institutions rather than not include it.
  • Source: In the case of public domain texts we have created a github repository per text, and a couple of additional ones.  These are all part of the Text Creation Partnership organization at github which has representatives from the libraries at both Oxford and Michigan. This is located at https://github.com/textcreationpartnership/ and the repositories take the form of https://github.com/textcreationpartnership/TCP-ID where TCP-ID is the number provided this work by the TCP. e.g. https://github.com/textcreationpartnership/A00021  We hope that the TEI P5 XML provided in such repositories will serve as the base for enhancements and corrections.
  • Analysis: Currently there are no links to text analysis engines, but we are considering the possibility of adding them where they function by giving a simple link with the URL of a source in it. Obviously this will only be able to be provided for freely available texts.

A lot of the work to make these texts available via the Oxford Text Archive, after they were created by the TCP, has been done by Sebastian Rahtz, Magdalena Turska, and James Cummings. The research support team at IT Services can be reached at: researchsupport@it.ox.ac.uk.  You can read more about TCP and EEBO at http://www.textcreationpartnership.org/tcp-eebo/ and http://www.bodleian.ox.ac.uk/eebotcp/.

Posted in TEI, XML | Leave a comment

Report on the Digital Humanities at Oxford Summer School 2014

About

The Digital Humanities at Oxford Summer School (DHOxSS) is the annual training event at the University of Oxford which took place this year on 14 -18 July 2014. This year it took place primarily at Wolfson College and IT Services. The DHOxSS is a chance for for lecturers, researchers, project managers, research assistants, students, and anyone interested in Digital Humanities to learn new skills and find out about the DH research taking place in Oxford. DHOxSS delegates are introduced to a range of topics including the creation, management, analysis, modelling, visualization, or publication of digital data for the humanities. Each delegate follows one of the five-day workshops and supplements this with additional keynotes and morning parallel lectures. For more general information see: http://dhoxss.humanities.ox.ac.uk/2014/about.html 

DHOxSS 2014 Organisational Committee

The organisation of DHOxSS is a collaborative undertaking and overseen by an organisational committee representing the major DH stakeholders at the University of Oxford. For DHOxSS 2014 the organisation committee consisted of:

  • James Cummings, Director of DHOxSS, (IT Services)
  • Ylva Berglund Prytz (IT Services)
  • David De Roure (Oxford e-Research Centre)
  • Linda Edgar (IT Services)
  • Andrew Fairweather-Tall (Humanities Division)
  • Christine Madsen (Bodleian Libraries)
  • Eric Meyer (Oxford Internet Institute)
  • Kevin Page (Oxford e-Research Centre)
  • John Pybus (Oxford e-Research Centre)
  • Sebastian Rahtz (IT Services)
  • Pip Willcox (Bodleian Libraries)
  • Kathryn Wenczek (IT Services)
  • Martin Wynne (IT Services)

Content of DHOxSS 2014

The DHOxSS has a fairly regular daily structure of:

  • 9:30-10:30 Additional Plenary Keynotes or Parallel Lectures
  • 10:30-11:00 Break
  • 11:00-12:30 Individual Workshops
  • 12:30-13:30 Lunch
  • 13:30 – 14:00 Travel Time for those switching venues
  • 14:00-16:00 Workshops Continue
  • 16:00-16:30 Break
  • 16:30-17:30 Workshops Continue
  • Evening Events

Additional Plenary Keynotes or Parallel Lectures

Each morning DHOxSS 2014 started with either a plenary (opening or closing) keynote lecture  or a choice of three parallel lectures. Delegates registered their choices when booking onto the DHOxSS which enabled us to put each in the most suitable room available to us at Wolfson College.

Workshops

All workshops at DHOxSS run for the full 5 days. Delegates chose a single workshop and stayed with that workshop for the entire week. They are not usually allowed to switch workshops part-way through since this causes problems for workshop organisers and in some workshops it is difficult for those who switch to catch up. Each year some do, and this year there was a £25 administration fee for doing so to discourage it. All workshops had at least one organiser local to the University of Oxford, who acts as the point of contact for organisational and administrative queries concerning the workshop. Workshop organisers were responsible for designing and running the program of the workshop, providing the necessary information about it, liaising with the speakers, and ensuring it runs smoothly. Organisers also often were speakers on that workshop.

The workshops for DHOxSS 2014 were:

  1. Introduction to Digital Humanities
  2. Taking Control: Practical Scripting for Digital Humanities Projects
  3. Data Curation and Access for the Digital Humanities
  4. A Humanities Web of Data: Publishing, Linking and Querying on the Semantic Web
  5. Using the Text Encoding Initiative for Digital Scholarly Editions

1. Introduction to Digital Humanities

The Introduction to Digital Humanities workshop at DHOxSS 2014 was organised by Pip Willcox (Bodleian Libraries, University of Oxford). This was the most popular workshop at the summer school and included a survey of many Digital Humanities topics with contributions from many speakers: Alfie Abdul Rahman (Oxford e-Research Centre, University of Oxford), John Coleman (Faculty of Linguistics, Philology, and Phonetics, University of Oxford), James Cummings (IT Services, University of Oxford), David De Roure (Oxford e-Research Centre, University of Oxford), J. Stephen Downie (University of Illinois at Urbana-Champaign), Kathryn Eccles (Oxford Internet Institute, University of Oxford), Amanda Flynn (Bodleian Libraries, University of Oxford), Alexandra Franklin (Bodleian Libraries, University of Oxford), David Howell (Bodleian Libraries, University of Oxford), Zena Kamash (School of Archaeology, University of Oxford), William Kilbride (Digital Preservation Coalition), Matthew Kimberley (Bodleian Libraries, University of Oxford), Ruth Kirkham (Oxford e-Research Centre, University of Oxford), Eric Meyer (Oxford Internet Institute, University of Oxford), Meriel Patrick (IT Services, University of Oxford), Michael Popham (Bodleian Libraries, University of Oxford), John Pybus (Oxford e-Research Centre, University of Oxford), Mia Ridge (Open University), Judith Siefring (Bodleian Libraries, University of Oxford), Ségolène Tarte (Oxford e-Research Centre, University of Oxford), Pip Willcox (Bodleian Libraries, University of Oxford), Abigail Williams (Faculty of English, University of Oxford) and James Wilson (IT Services, University of Oxford).

2. Taking Control: Practical Scripting for Digital Humanities Projects

The Taking Control: Practical Scripting for Digital Humanities Projects workshop at DHOxSS 2014 was organised by Sebastian Rahtz (IT Services, University of Oxford). This workshop taught students the skills of transforming data from one format to another for a variety of purposes. It included talks from Alexander Dutton (IT Services, University of Oxford), Janet McKnight (IT Services, University of Oxford), Sebastian Rahtz (IT Services, University of Oxford) and Scott Wilson (IT Services, University of Oxford).

3. Data Curation and Access for the Digital Humanities

The Data Curation and Access for the Digital Humanities workshop at DHOxSS 2014 was organised by Kevin Page (Oxford e-Research Centre, University of Oxford) and Megan Senseney (CIRSS, University of Illinois at Urbana-Champaign).This workshop provided a strong introductory grounding in data curation concepts and practices, focusing on the special issues and challenges of validity and meaning for reuse of humanities research data. As part of this workshop invited experts from the University of Illinois at Urbana-Champaign participated and helped organise this workshop. We are especially indebted to them for this. This workshop included talks from: Lair Barrett (Taylor & Francis / Oxford Internet Institute, University of Oxford), Jonathan Bright (Oxford Internet Institute, University of Oxford), J. Stephen Downie (University of Illinois at Urbana-Champaign), Tanya Gray Jones (Bodleian Libraries, University of Oxford), Scott Hale (Oxford Internet Institute, University of Oxford), Neil Jefferies (Bodleian Libraries, University of Oxford), Kevin Page (Oxford e-Research Centre, University of Oxford), Carole L. Palmer (University of Illinois at Urbana-Champaign), Allen H. Renear (University of Illinois at Urbana-Champaign), Sally Rumsey (Bodleian Libraries, University of Oxford), Ralph Schroeder (Oxford Internet Institute, University of Oxford), Megan Senseney (CIRSS, University of Illinois at Urbana-Champaign) and Nicholas Weber (University of Illinois at Urbana-Champaign).

4. A Humanities Web of Data: Publishing, Linking and Querying on the Semantic Web

The A Humanities Web of Data: Publishing, Linking and Querying on the Semantic Web workshop at DHOxSS 2014 was organised by Kevin Page (Oxford e-Research Centre, University of Oxford). This workshop introduced the concepts and technologies behind the Semantic Web and taught attendees to publish their research so that it is available as Linked Data, using distinct but interwoven models to represent services, data collections, workflows, and — so to simplify the rapid development of integrated applications to explore specific findings — the domain of an application. Talks on this workshop were provided by: David De Roure (Oxford e-Research Centre, University of Oxford), Dominic Oldman (British Museum), Kevin Page (Oxford e-Research Centre, University of Oxford), John Pybus (Oxford e-Research Centre, University of Oxford) and Sebastian Rahtz (IT Services, University of Oxford).

5. Using the Text Encoding Initiative for Digital Scholarly Editions

The Using the Text Encoding Initiative for Digital Scholarly Editions workshop at DHOxSS 2014 was organised by James Cummings and Lou Burnard. This workshop provided a mix of lectures and practical exercises introducing the use of the TEI Guidelines for the creation of scholarly digital editions. Marjorie Burghart (L’Ecole des Hautes Etude en Sciences Sociales, Lyon / DiXiT), Lou Burnard (Lou Burnard Consulting), James Cummings (IT Services, University of Oxford) and Magdalena Turska (IT Services / DiXiT, University of Oxford).

Poster Session

DHOxSS 2014 featured a Poster Session at the welcoming reception at the Oxford University’s Natural History Museum. This is was a lovely location for a reception and poster session and it was enjoyed by all.  Presenter’s contributions were peer-reviewed by the DHOxSS Organisational Committee. Presenters were either attending the DHOxSS 2014 or were members of the University of Oxford. This poster session has several benefits in that it enables delegates to present the work they are undertaking to other participants at the DHOxSS, but also helps to justify their participation in this training event in some institutions. Moreover, participation by members of the University of Oxford who are not speakers or delegates at the DHOxSS gives an additional dissemination route advertising the DH work of the University.

  1. James Cummings (IT Services, University of Oxford) CatCor: Correspondence of Catherine the Great
  2. Rebecca Dowson; Margaret Linley (Simon Fraser University) Book Ecology and Migrating Collections: SFU Lake District Digital Humanities Project
  3. Bronwen Hudson (University of Vermont)
  4. Clare Hutton (Loughborough University) Collating Joyce’s Ulysses in the Digital Environment
  5. Alison Kay (Northumbria University)
  6. Hestiasari Rante; Michael Lund; Heidi Schelhowe (University of Bremen / Electronics Engineering Polytechnic Institute of Surabaya) A digital tool to support children understanding and designing the traditional batik patterns within a museum context
  7. Vincent Razanajao; Francisco Bosch-Puche; Elizabeth Fleming (Griffith Institute, University of Oxford) The Topographical Bibliography of Ancient Egyptian Hieroglyphic Texts, Statues, Reliefs, and Paintings
  8. Magdalena Turska et al. (IT Services, University of Oxford) The DiXiT Project
  9. Sarah Wilkin and Ylva Berglund Prytz (IT Services, University of Oxford) The Oxford Community Collection Model
  10. Pip Willcox (Curator of Digital Special Collections, Bodleian Libraries, University of Oxford) The Bodleian First Folio project
  11. Nicola Wilson (University of Reading) Modernist Archives Publishing Project
  12. Martin Wynne (IT Services, University of Oxford) CLARIN
  13. Mary Erica Zimmer (Boston University) Browsing the Bookshops of Paul’s Cross Churchyard

Evening Events

Monday Evening — Welcome Drinks Reception and Poster Session

Oxford University Museum of Natural History

On the evening of Monday 14 July 2014, there was a DHOxSS welcome reception from 7pm at Oxford University Museum of Natural History which has recently re-opened after a lengthy refurbishment. This reception gave DHOxSS delegates a chance to meet and talk to the other delegates and speakers. There was a peer-reviewed poster session (as described a above) at this event.

Tuesday Evening — Guided Walking Tour of “Oxford Past and Present”

On the evening of Tuesday 15 July 2014, there was an Oxford Official Guided Walking Tour of “Oxford Past and Present”. This is the tourist information office’s main introductory tour of Oxford. The guides led delegates through the heart of the historic city centre illustrating the history of Oxford and its University and describing the architecture and traditions of its most famous buildings and institutions.

Wednesday Evening — DHOxSS Dinner at Wadham College

Wadham Hall

On the evening of Wednesday 16 July 2014 the DHOxSS Dinner was held in Wadham College Hall. A pre-dinner drinks reception was followed by a three-course meal in a stunning Oxford setting! A menu for the DHOxSS dinner is still available. There was no specific dress code for the event. The cost of the DHOxSS dinner (£52.50) was not included in the registration fee.

Thursday Evening — TORCH Open Lecture

Torch Public Lecture

Martin Roth, director of the Victoria and Albert Museum, gave the annual TORCH (The Oxford Research Centre in the Humanities) open lecture at the DHOxSS 2014. This free public lecture was on the evening of Thursday 17 July 2014 held at the Mathematics Institute. Delegates and Speakers from DHOxSS 2014 reserved a place when registering for DHOxSS.

More information was available from: http://www.torch.ox.ac.uk/martinroth.

Friday Evening — Informal Pub Trip

Victoria Arms Pub

On the Friday evening just after DHOxSS ended some organisers and delegates of DHOxSS 2014 walked from Wolfson College lodge to the nearby Victoria Arms public house. It is described on their website as: “The Victoria Arms sits on the banks of The Cherwell River, just a short way from the dreaming spires of Oxford city centre, but you could be in the depths of the countryside. With large sweeping gardens down to the river we are a perfect spot in the summer, whether you walk, drive or come by river on a punt.”

Teaching Venues

The DHOxSS 2014 three teaching venues, all within a 20-30 minute walk of each other. A Google Map of the important venues and routes is available at: http://tinyurl.com/dhoxss2014-map.

Morning Venues

The DHOxSS registration and all morning sessions were at Wolfson College (Linton Road, Oxford, OX2 6UD). Some information and photos of the teaching spaces at Wolfson College are available at https://www.wolfson.ox.ac.uk/conference/rooms. Information concerning travel to Wolfson and a map of the site are available at: https://www.wolfson.ox.ac.uk/how-get-here. We used the Leonard Wolfson Lecture Theatre, seminar rooms 1 to 3, and the Buttery as teaching venues.

Afternoon Venues

Most workshops the afternoon sessions were in the Thames Suite at IT Services – Banbury Road (13 Banbury Road, Oxford, OX2 6NN). Some information and photos of the Thames Suite at IT Services is available at http://www.oucs.ox.ac.uk/thamessuite/tour/.

The afternoon of the first day of the Introduction to Digital Humanities workshop took place at the Pitt Rivers Museum (the entrance is via the Oxford University Museum Natural History on Parks Road, Oxford, OX1 3PW. The Pitt Rivers’ entrance is at the far side of the ground floor). For the rest of the week the Introduction to Digital Humanities workshop remained at Wolfson College all day.

Both the Introduction to Digital Humanities workshop (except Monday) and the Data Curation and Access for the Digital Humanities workshop spent all day at Wolfson College in the Lecture Theatre and Seminar Room 3 respectively. This was facilitated by the Introduction to Digital Humanities workshop being a lecture-based workshop and not needing computers for practical exercises and the Data Curation and Access for the Digital Humanities workshop used student laptops and DHOxSS also provided some borrowed from IT Services.

Future DHOxSS should consider the use of student laptops to enable a greater number of workshops or larger ones that are not limited to the size of teaching rooms in IT Services.

Videos, Podcasts, Photos, and Social Media

Videos

This year, prior to the DHOxSS, two videos were created advertising Digital Humanities at Oxford. These included one on the DHOxSS itself:

http://www.youtube.com/watch?v=lBO7kT3D94A

as well as one more generally on Digital Humanities at Oxford:

http://www.youtube.com/watch?v=zdlOC0sFo5k

Podcasts

As part of our commitment to the creation of open educational resources the DHOxSS filmed the opening keynotes and additional parallel lectures.  These are available at http://podcasts.ox.ac.uk/series/digital-humanities-oxford-summer-school.

Episode Title

Ukiyo-e to Emoji: Museums in the Digital Age
Beyond Digital Humanities: Skills, Application and Collaboration
Electrifying the ‘Via Lucis’: communication technologies and republics of letters, past, present and future
Creating and Sustaining DH Teams: Scaling from the Smaller to the Larger, from the Individual to the Institution and Beyond
Restoration and revelation: how digital images are far more than simply photographs in the digital medium
Ancient Lives: Classics and Digital Humanities at Oxford
Panel – The Future of Data Access and Preservation
Obtaining the Unobtainable: The Holy Grail of Seed Funding for Small-Scale Digital Projects
If a picture is worth 1000 words what’s a medium quality scan worth?
Panel – Scholarly Digital Editing
Community, Community of Practice, and the Methodological Commons

At DHOxSS 2014 the budget for this was postponed until other expenses had been finalised. It is recommended that this be included in the initial budget for DHOxSS 2015.

Photos

As of DHOxSS 2014, an open flickr group https://www.flickr.com/groups/dhoxss was created and some attendees uploaded photos.

Social Media

The @DHOxSS twitter account was used extensively before and during DHOxSS 2014, and one delegate created an archive of @DHOxSS and #DHOxSS tweets. Various other social media were used as advertising locations, including popular DH mailing lists.

Mobile Events App

We trialled an event app http://guidebook.com/g/rbkxa9vs/ which received 106 unique downloads. The points of access for these were: iOS: 36; Android:28; Web:42. The guidebook.com events app enabled us to provide information concerning the summer school, maps, a detailed (and personalisable) schedule, information on evening events, sponsors, as well as various connections to social media. The use of mobile event apps may become an expected part of events like DHOxSS and the summer school should consider leveraging any solution adopted by DHOxSS stakeholders.

DHOxSS Statistics

Registrations

The registrations for DHOxSS 2014 were:

Registration Type Numbers
Oxford Students or Staff 11
Students 31
Standard 65
Corporate 2
 Total 109

This includes one registration attending the OII’s Summer Doctoral Programme who did not attend workshops and at least another one who was unable to attend in the end. Only 11 registrations were from the University of Oxford. Although this is a reduction from 17 at DHOxSS 2013, this is because of the reduction in DHOxSS Oxford bursaries from 10 to 5. It would be beneficial to find other sources of funding or agreements to encourage the training of University of Oxford DPhil students and early-career researchers. The majority of registrations were ‘Standard’ registrations which includes anyone not in education or working for a commercial corporation. Students from any university and staff from the University of Oxford received a slightly discounted registration fee.

This was the first year that DHOxSS had block bookings, where 10+ bookings from a single institution received a 10% discount. This required a single purchase order payment and for the originating institution to aggregate the booking details. This resulted in a booking of 14 registrations from the University of Edinburgh. However, there were additional administrative burdens and we should work to streamline this in some way in the future.

registration

Country of Origin

The majority of DHOxSS 2014 delegates came from the United Kingdom, with 11 from Oxford, 42 from the rest of the UK, and 16 from the USA. If added together Europe as a whole is the second largest contingent after the UK.

Country Number of Students
Oxford 11
Other UK 42
USA 16
CA 5
GR 4
SE 4
ZA 4
ES 3
FR 3
IT 3
NL 3
PL 3
IE 2
BR 1
CL 1
DE 1
DK 1
FI 1
TR 1
  109

country

Events

As discussed above there were evening events each night. The numbers below are students who registered for these events.

Event Number of Students
Monday Poster Reception 97
Tuesday Walking Tour 64
Wednesday Dinner 56
Thursday TORCH Lecture 79

events

Accommodation

DHOxSS acted as a broker for accommodation at Wolfson College for the week of DHOxSS 2014. This does not include speaker bookings at Wolfson, Keble, and St Hugh’s colleges.

Day Number of Student Accommodation Bookings
Sunday 54
Monday 58
Tuesday 58
Wednesday 59
Thursday 58
Friday 36

accommodation

Gender

Of the 108 students attending workshops 75 were women and 33 were men.  This means that 69.44% of DHOxSS 2014 registrants were female. However, a strong caveat must be made here: this is apparent gender, based solely on my own observations. DHOxSS 2014 did not collect gender statistics but I’ve chosen to monitor such metrics unofficially because I want to ensure that we continue to offer a welcoming environment to all those wanting DH training, what workshops are preferred, and how this may compare to other DH events. This means that I have made my own determinations of apparent gender using basic binary categories. Clearly in a modern world this is not sufficient or representative of gender self-identity, but is only intended as a basic metric. I do not think the increase from 67% in 2013 to 69.44% is statistically significant given the increase in numbers this year.

Workshop Men Women Total
Data Curation and Access 2 10 12
Humanities Web of Data 5 10 15
Introduction to Digital Humanities 14 40 54
Practical Scripting 2 10 12
TEI 10 5 15
  33 75 108

I’m not sure if there are any conclusions that can be drawn from the attendance. Clearly we offer a welcoming environment for women interested in DH but do not have any clear data on why this may be so.

gender

Finances

We were able to keep the costs for registration the same as the previous couple years because of getting a good deal from Wolfson College. Registration included the costs of the venues, lunch, workshops, speaker’s expenses, and some of the evening events.

Registration Type Fee
Student (any institution/level) or University of Oxford Staff:You are enrolled as a (full-time or part-time) student at any educational institution at any level or are a member of staff of the University of Oxford 475 pounds
Standard: You work for an educational institution, library, charity or non-commercial organisation in any capacity 575 pounds
Commercial: You work for a commercial or corporate organisation 675 pounds

The DHOxSS attempts to be cost-neutral but any profits are put back into the following year’s summer school.

The DHOxSS is only able to run because of the selfless donation of time of the Workshop Organisers, Speakers, Event Administration and others. In 2014, as with previous years, the event administration and overall organisation was donated by IT Services. No speakers were paid to appear at DHOxSS 2014 but reasonable travel and accommodation costs were paid from the income of registration fees. The amount of time in aggregate donated by speakers, organisers, and administration is immense, and we are extremely grateful for this as the event would be impossible without it.

We had 4 sources of income: Registrations, Accommodation, Banquet Tickets and Sponsorship. From these we raised approximately 78,000 pounds. The accommodation expenses were directly passed on with no profit being made. The general headings of expenses (in order of cost) were: hire of venues with day delegate rate including lunches, delegate accommodation, speakers expenses, banquet, welcome drinks reception, registration materials (bags, badges, lanyards, etc.), filming, documentation, a contribution to the TORCH drinks reception, walking tour, and marketing. Precise amounts for each of these (and their breakdown) will be made available to the organisational committee.  The most significant costs any year are that for the Venues (and usually this involves a day delegate rate which includes lunch) and when we are handling it the accommodation. Of the income of approximately 78,000 pounds, we had expenses of approximately 77,700 pounds, which includes a deposit for DHOxSS 2015 to St Anne’s College.

Feedback

Both delegates and speakers were asked to fill in feedback surveys in order to capture what worked well, and what could improve. A synthesis of this feedback will be provided to the DHOxSS 2015 organisational committee to help improve the DHOxSS for next year. 45 Delegates responded to the survey (at time of writing). A summary of their feed back is:

  • Plenary and parallel lectures: Most feedback found these ‘Good’ or ‘Excellent’ with comments noting some problems with the venue or audio-visual. Any negative comments on content were primarily describing a mismatch between the advertised abstract and the talk so we should remind speakers to take care when crafting their abstracts.
  • Workshops: These were ranked on the ‘level of teaching’, ‘speed of teaching’, ‘range of topics’, ‘quality of lectures’, ‘quality of practicals’, and ‘Overall quality of teaching’. All scored highly with mostly ‘Good’ or ‘Excellent’. The Introduction to DH workshop had some comments about the lack of practicals (this was predominately a lecture-based workshop). Some other workshops received comments that the speed of teaching was occasionally too fast in some technical talks.
  • General aspects of the DHOxSS as a whole: Delegates were asked to rank “overall academic content”, “balance of workshops vs lectures”, “balance of academic vs social content”, and “having multiple venues”. Each were almost entirely ranked ‘Good’ or ‘Excellent’. Suggestions were made that each lecturer should produce a 1-page handout of key terms/points and related reading, several mentioned the distance of Wolfson College to the city centre, but many positive comments were also received.
  • Teaching Venues: The teaching venues (Wolfson College: Lecture Theatre, Seminar Rooms 1-3 and Buttery; Pitt Rivers Museum, and IT Services: Isis, Evenlode, and WindrushRooms) were all ranked. Mostly these received ‘Good’ or ‘Excellent’ ratings, but there were negative comments concerning the Wolfson College rooms relating to the heat (the week of DHOxSS was particular warm and the ‘passively cooled’ rooms of Wolfson’s Auditorium do not seem to cope well with this) and constant noise of staff and college members outside the Buttery. A minority did not think the Pitt Rivers lecture theatre was a satisfactory venue.
  • Food and Drink: Delegates ranked the Breakfasts at Wolfson, if staying there, Morning and Afternoon Tea Breaks, and Lunch.  Generally these were good, but there were suggests for more variety, more fruit and salad, not only at lunch breaks but fruit as an option instead of biscuits at breaks. Alternatives to tea and coffee were also suggested.
  • Evening Events: Delegates ranked the quality of food/drink, locations, quality of event, etc. for each of the evening events. All of these score highly. The location and posters at the welcome reception were praised. The walking tour was a great success. The responses from the banquet thought the after dinner talk was too long, and while the location was excellent, the food perhaps a bit overpriced. The TORCH lecture was ranked well, but some comments indicate that it was seen as too general.
  • Overall Organisation: The delagates ranked registration, administration, online payment store, announcements/publicity, joining instructions, support during DHOxSS, website information, a/v and teaching facilities, and wolfson accommodation (if stayed there). These were all ranked quite strongly as ‘Good’ or ‘Excellent’. with Wolfson’s accommodation receiving a few lower scores. Comments simultaneously thought Wolfson was great and complained about their beds/pillows, while other comments praised the running and communication received from event administrative staff.
  • How did they find out about the DHOxSS: Over 90% of respondents found out about DHOxSS from Colleague / Word of Mouth or the Website. Some noted that they found out from colleagues at their University or from an announcement during the DHSI summer school in Victoria.
  • Mobile Events App: Those who downloaded the mobile events app were asked to rate its usefulness. 83%  (of 21) thought it was good or excellent. Comments noted that the social aspects didn’t really kick off…as social media already fulfils that need, or that it was overly complex for what it provided.
  • Workshops/Lectures/Tech for future DHOxSS: Asked to provide suggestions for next year. These ranged wildly with Visualisation, Mapping, Digital Publishing, Corpus Linguistics, and XSLT being mentioned multiple times. A variety of other technologies, tools, and topics were also mentioned: all of these will be fed back the DHOxSS 2015 Organisational Committee.

Plans for DHOxSS 2015

DHOxSS 2015 will be held from the 20-24 July 2015 at St Anne’s College and IT Services. We are starting the planning for this and will be setting the registration charges as soon as we have a clearer idea of all of the expenses.  The organisational committee has been re-formed to give a wider distribution across the stakeholders of the University. If you want to subscribe to our DHOxSS announcements mailing list, email: dhoxss-announce-subscribe@maillist.ox.ac.uk and confirm by replying to the confirmation email that gets sent to you. We will notify this mailing list when registration opens.

If you have any additional feedback or suggestions for 2015 do not hesitate to contact the Director: James.Cummings@it.ox.ac.uk.

Posted in DHOxSS | 1 Comment

Self Study (Part 7) Customising the TEI

Self Study (Part 7) Customising the TEI

This post is the seventh in a series of posts providing a reading course of the TEI Guidelines. It starts with

  1. a basic one on Introducing XML and Markup then
  2. on Introduction to the Text Encoding Initiative Guidelines then
  3. one on the TEI Default Text Structure then
  4. one on the TEI Core Elements then
  5. one looking at at The TEI Header.
  6. and a sixth one on transcribing primary sources.

None of these are really complete in themselves and barely scratch the surface but are offered up as a help should people think them useful. This seventh post is looking at customising the TEI for your own uses.

The TEI has many different modules and lots of elements that you may or may not need for your project. One of the strongest aspects of the TEI Guidelines compared to other standards is that any project is able to constrain, customise and extend the Guidelines. One reason for customising the Guidelines is because most projects do not need the vast array of elements provided in the TEI Guidelines and in order to reduce human error and speed up encoding providing less choice is a good thing. The generalised Guidelines need to provide as much choice and flexibility as possible — in order to cope with the different needs of projects and intellectual methods to be captured — and yet I’d not be surprised if the consistency of a project is proportionally related to the amount it constrains that same flexibility.

Roma

The TEI Consortium provides a (quite dated) web interface to customise the TEI. This allows you to do some sorts of customisation. This is available at: http:/www.tei-c.org/Roma/. You should explore this. It is fairly straight forward. I recommend doing the following:

  1. Visit http:/www.tei-c.org/Roma/ and notice that you have various options on how to start your customisation, including being able to upload a customisation you had saved earlier.
  2. Choose the ‘Build Up’ method; This takes you to a screen which allows you
    to change some basic metadata about the customisation. If you change anything click ‘Save’.
  3. Click on the ‘Modules’ tab to see a list of modules on the left, which you can ‘add’ to the customisation, and a list of modules on the right which have already been added to your customisation. Notice that the core, tei, header, and textstructure modules are already selected.
  4. Add a few more modules, maybe manuscript description, names and dates, critical apparatus, and transcription of primary sources.
  5. Clicking on any individual module name in the customisation you are making takes you to the list of elements in that customisation. For example, click on ‘Core’.
  6. Clicking on ‘Core’ takes you to a list of the elements in the ‘Core’ module. You can choose to ‘Include’ or ‘Exclude’ elements from your customisation (by clicking the radio buttons for each element or clicking the ‘Include’ or ‘Exclude’ at the top to include/exclude all of the elements).
  7. Choose to exclude certain elements from the Core module. For example, you may wish to remove ‘analytic’, ‘biblStruct’, ‘binaryObject’, ‘imprint’, ‘monogr’, ‘series’ and then click ‘Save’ at the bottom of the school.
  8. Choose the ‘Schema’ tab and look at the options for generating a schema. I recommend a Relax NG schema (compact or XML syntax). Choose one of these and click ‘Generate’ to create and download your schema.
  9. In an XML editor like oXygen you can Associate a document with this schema that you’ve just generated.(Or take an existing TEI document associated with a schema and change the association to point to you new schema.) Hopefully this uses an xml-model processing instruction at the top of the file. Maybe try this out! You should find that you are unable to use any of the elements you excluded!
  10. Back in Roma (you didn’t shut the browser down did you? If so you’d have to go do the above again!) you should be able to return to the ‘Modules’ tab, click on the ‘textstructure’ module, and then notice that where the ‘div’ element is listed on the far right-hand there is a ‘Change Attributes’ link. Click on it!
  11. This lists all the attributes available on div (sometimes provided directly on element, sometimes by an attribute class it is a member of.
  12. Scroll down to the ‘type’ attribute and click on it. This takes you to some settings you can change about this attribute. Some of the things you can change include:
    • You can say that it is not optional (i.e. it is required)
    • You can say whether it is a closed list (whether the values you provide are the only ones)
    • You can provide a list of comma-separated values

    I suggest that you say that it is not optional, a closed list, and give “chapter,section,other” as values. Remember to click save.

  13. You could now go back to the ‘Schema’ tab, generate and download a schema, and re-associate it in your document (your operating system will most likely name it something different if there is already a file there…another option is to move the previous schema out of the way).
  14. Something else you should do is click on the ‘Save Customization’ tab. This should download an XML file (‘myTEI.xml’ if you didn’t change the name of the filename on the ‘Customize’ tab)
  15. Open that ‘myTEI.xml’ customisation in your XML editor and have a look at it. This records all of the details of your customisation.

It should look something like:

schemaSpec

Here a <schemaSpec> element contains <moduleRef> elements for each of the modules you included. In this case Roma defaulted to an ‘exclusion’ method of referencing the elements. (So “give me all elements from the ‘core’ module except this list of elements. Using the ‘include’ attribute could have had us give a list of specific elements to include. The difference between these is that if you save this customisation and come back to Roma at some point in the future, with the exclusion method you will get any new elements added by the TEI, whereas the inclusion method would never get any new elements.  Both approaches have their uses. Below that you have documented a change to the <div> element (using the <elementSpec>) where the <attDef> element records that usage of this is required, and has a closed <valList> replacing the existing one.

TEI ODD

Your customisation is written using the TEI ODD language, a part of the TEI Guidelines for describing markup. This is ‘One Document Does-it-all’, named because from this you can generate project-specific documentation. (The ‘Documentation’ tab in Roma.)  There are elements in the TEI for referring to phrase-level discussion of markup (with the <gi>, <att>, and <val> elements) as well as ways to document the customisation or extension of a schema (e.g. the <schemaSpec>).

Read more about these documentation elements at: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/TD.html When you have done so, you should have no problem in answering these questions (to make sure for yourself that you have read it)

  1. What is the difference between a <gi> element and a <tag>?
  2. What does the ‘atts’ attribute on <specDesc> record?
  3. What is the difference between <gloss>, <desc>, and <remarks>?
  4. How does one use the <equiv/> element?
  5. What is the difference between a <eg> and a <egXML>?
  6. What is a <content> element used for?
  7. Why might you want to use a <constraintSpec>?
  8. How do you provide a <gloss> for an <attDef>?
  9. What is a <classSpec> for?

There are many other important parts of this chapter, but if you understand the above that is a good start.

An example ODD showing some of the basic techniques (with a lot of documentation) from the LEAP project is available at: https://github.com/jamescummings/LEAP-ODD/blob/master/leap.odd.xml

Posted in SelfStudy, TEI | Leave a comment

Auto Update Your TEI Framework in oXygen

One of the great things about the oXygen XML Editor that I use is that it allows frameworks as add-ons (from version 14+, though actually for the TEI one you need 15.2+) for various document types. These can consist of template documents, XSLT files for transformations, CSS, and all manner of customisations to oXygen.

The TEI Consortium jointly maintains an open source and openly-licensed oxygen-tei framework at http://github.com/TEIC/oxygen-tei.

I’ve been asked a number of times to explain to someone how they can keep the TEI framework in their oXygen installation up to date automatically with releases of the TEI P5 Guidelines (and thus the underlying schema) as well as releases to the TEI-XSL Stylesheets.

The process for this isn’t entirely intuitive, but is not too difficult if you follow the steps below.

Add the oXygen-TEI Add-on

Go to Options/Preferences -> Add-ons and click ‘Add’
oxygen-tei-update1

The updateSite.oxygen File

Add the URL http://www.tei-c.org/release/oxygen/updateSite.oxygen this is a file which is updated every time there is a TEI Guidelines or Stylesheets release. Click ‘Ok’.

oxygen-tei-update2

Automatic Updates

Back in the preferences window check ‘Enable automatic updates checking’ and click ‘OK’.
oxygen-tei-update3

Check for Updates

Go to the ‘Help’ menu and select ‘Check for add-ons updates’
oxygen-tei-update4

Update Available!

If there has been a TEI Guidelines or e update since you last updated (or installed oXygen) then you should get prompted to install an update. Click ‘Review updates’.
oxygen-tei-update5

Review Updates

Click the checkbox next to ‘TEI P5’ and then click ‘Install’.
oxygen-tei-update6

Downloading…

The oXygen-TEI framework package will be downloaded, speed depending on your connection to the internet.
oxygen-tei-update7

Install the Update

Once oXygen has downloaded the package you must check to agree to all the license terms. (All TEI Consortium materials are dual-licensed as BSD 2-Clause and/or Creative Commons Attribution.) Accept, and then click ‘Continue’.
oxygen-tei-update8

Warning: Valid Signatures

When you install the package you’ll be warned that it doesn’t have valid signatures. If you trust the TEI Consortium then you should click ‘Continue anyway’.
oxygen-tei-update9

To Complete the Update: Restart oXygen

In order to have oXygen start using the new framework, you must restart the application.
oxygen-tei-update10

Next Time

Next time there is a TEI Guidelines or TEI XSL Stylesheets release you will get prompted to install the updates. That is all there is to it!

Posted in TEI, XML | 13 Comments

oXygen-like CSS Colours for XML

I use oXygen as my preferred XML editor. I highly recommend it, it has lots of good features. By default XML markup in oXygen looks something like:

xml-example

When I’m writing encoding documentation for users I thought it might be nice to have the CSS for the HTML manual (generated from a TEI P5 ODD XML customization file) use the same colours as are in the editor itself. Rather than try to guess the colours I asked the nice oXygen people and they told me that they are:

  • XML text nodes (plain text) – “0,0,0”; #000000
  • ‘<‘, ‘>’ – “0,95,242” ; #005FF2
  • XML tag name – “0,0,150”; #000096
  • XML comments – “0,100,0”; #009600
  • XML DOCTYPE declaration – “0,0,255”; #0000FF
  • XML EMBEDDED DOCTYPE, “0,0,255”; #0000FF
  • XML QUOTED VALUE – “153,51,0”; #993300
  • XML ENTITY – “150,150,0”; #969600
  • XML ATTRIBUTE NAME – “245,132,76”; #F5844C
  • EQUAL sign (for attribute value) – “255,128,64”; #FF8040
  • XML PROCESSING INSTRUCTION – “139,38,201”; #8B26C9
  • XML CDATA section – “0,140,0”; #008C00
  • XML PROLOG – “139,38,201”; #8B26C9
  • not well-formed fragment – “255,0,0”; #FF0000

Although my example above doesn’t show all of these, it now means that my CSS can use these values for colouring XML markup. Thanks oXygen!

Posted in XML | 5 Comments

Projects, Blog Pages, and Trello

Using Trello and Blogs

I have been using Trello to manage project to do lists for awhile now. Also I have a set of pages on this blog http://blogs.it.ox.ac.uk/jamesc/projects/ where I openly store some descriptions and basic information about a subset of the projects I’m working on. In order to give those who care an easy overview of where particular projects were at I can give them access to my Trello board for that project. However, copying that information out in a table or something here on my blog seems a waste of time. But, I wondered, is there a way to embed the Trello board in my blog post? That would not only save me trouble, but since the board would have to be ‘public’ it would also make me work in the light which generally is a good thing.

What Trello Provides

Trello, as you may know, is a clever collaborative to-do list managing web app which has a concept of ‘Boards’, each of which can have multiple ‘lists’ and each list can have multiple ‘cards’ on it. (And indeed, each ‘card’ can also have multiple comments, attachments, checklists, due dates, etc.) It is really easy to create boards/lists/cards and move cards  between lists, sort them on lists, etc. There is a handy Android App (I hear there is also an iOS version) that is feature rich. On each card you can see things like due dates, and who has been assigned that task.

So when you create a board it gets a unique URL like https://trello.com/b/g9mdhdzg/test-trello-board which is a test trello board I set up with some dummy lists and cards. What Trello gives us with and public boards from this URL (if we strip off the name (but leave the ID) and append different file suffixes) is:

WordPress Security

All that sounds great and you think I’d just embed an iframe with the trello in and it would always be up to date.  Unfortunately the way WordPress is set up here (and many places by default) is not to allow iframes in your posts/pages.  This makes sense: including lots of remote content on your blog is potentially dangerous in terms of that content then changing and misrepresenting your institution.  But what you can do is embed an image and link to the board.  So I do:

</pre>
<div>
<h2>Trello Board</h2>
<a href="https://trello.com/b/g9mdhdzg">
<img alt="" src="https://trello.com/b/g9mdhdzg.png" />
</a></div>
<pre>

And this gets me the below:

Trello Board



Limitations

I just mention this in case you are using Trello and want to embed your boards in wordpress. For an example of me actually using this see my LEAP project page. There are several limitations (which I’ve mentioned to Trello).  Even when you look at a particular card:


you only get to see very basic details (e.g. you can’t see the items on the card’s checklist). Moreover you can’t get a unique ID to do this with an expanded view of the card.

Also, as you can see from the image above you only really see 3 lists, and you get blank space under the lists if they are short (or get them truncated, presumably, if they are very long).

Also, if you click on the Test Trello Board above and go to it, you’ll notice it looks slightly different. That is because I have a ‘power-up’ called ‘card-aging’ on it which makes the cards look old when I’ve not touched them for a few weeks.  But here they look normal. As well, I’ve changed the background colour, and the standard Trello blue is used in this image.  Perhaps they might change these things at some point.

I hope that is useful or interesting. I plan to add more Trello Boards to my project pages in the future.

Posted in Project | Leave a comment

Batch File Renaming from CSV and Image Resizing

Renaming the files

A colleague in a different section (Software Solutions) of IT Services asked if I could help renaming and processing some images provided by the Bodleian Library.  They provided huge TIFF files of manuscripts of Wycliffite Bibles and he had a spreadsheet with column A being a filename (without extension) and column B being the image ID in the website he is building for the project.

That is it looked something like:

Image File Name Image ID Number
abc0014 123

with many, many, more entries of course!

The script I wrote to rename the files was:


#!/bin/bash

# Set up variables

INPUT_DIR="input"
OUTPUT_DIR="output"
CSV_FILE="filenames.csv"
EXT=".tif"
# separate on newlines only
IFS=$'\n'

# Loop all lines in CSV
CSV_LINES=($(cat "$CSV_FILE"))
for CSV_LINE in ${CSV_LINES[@]}
do
OLD_NAME=`echo "$CSV_LINE" | grep -o '^"[^"][^"]*"' | sed 's/"//g'`$EXT
NEW_NAME=`echo "$CSV_LINE" | grep -o '"[^"][^"]*"$' | sed 's/"//g'`$EXT

#output message if $OLD_NAME doesn't exist
if ! [ -f "$INPUT_DIR/$OLD_NAME" ];
then
echo "No $OLD_NAME"
fi

# continue only if old file actually exists
if [ -f "$INPUT_DIR/$OLD_NAME" ];
then
# continue only if new filename given
if [ -n "$NEW_NAME" ];
then
#copy file to new name
cp -vf "$INPUT_DIR/$OLD_NAME" "$OUTPUT_DIR/$NEW_NAME"
mv "$INPUT_DIR/$OLD_NAME" "$INPUT_DIR/processed/$OLD_NAME"
fi
fi
done

IFS=$' \t\n'

What this did was go through the CSV and if there was both an old name (column A) and a new name (column B) it copied the file to a new output directory. It also moved a copy of the input file out of the way into a ‘processed’ directory just to note that it had been processed. There are a lot of other ways one could do this, in bash and using other technologies, but you use the hammer you happen to have to hand.

Bulk resizing of images

Over the years I’ve done lots of image processing, cropping, resizing, rotating, extracting metadata, etc. My tool of choice when doing this is Imagemagick which is cross platform and incredibly powerful.  It can do incredibly complicated things, but to do simple things like scaling images, cropping them, making montages, etc., is all fairly easy. To do much more difficult things does take a bit of trial and error but really rewards study. To do monotonous repetitive things like this it is really quite easy.

Because I thought I might run this command many times, I created a Makefile  with a ‘resizeImages’ target.

resizeImages:
        cd output; \
        echo "Converting full sized" ;\
        for file in *.tif; do convert $$file[0] ../converted-images/full/`basename $$file .tif`.jpg; done; \
        cd ../converted-images/full/ ; \
        echo "Doing large / medium / small / thumbs / tiny" ; \
        for file in *.jpg; do echo "Doing $$file" ; \
        convert -scale 1000 $$file ../large/$$file; \
        convert -scale 500 $$file ../medium/$$file; \
        convert -scale 250 $$file ../small/$$file; \
        convert -scale 150 $$file ../thumb/$$file; \
        convert -scale 50 $$file ../tiny/$$file; done; \
        echo "Done";

What this does to begin with is to go into the output directory (from the renaming of files above), and using a standard bash for loop, takes each TIFF file and does:

convert $file[0] ../converted-images/full/`basename $file .tif`.jpg

This uses imagemagick’s ‘convert’ utility to convert one of the TIFF files, say 123.tif, to a jpeg. The $file ($$file above in the Makefile) has a [0] after it because we only want the first image embedded in the TIFF file. (The second is an embedded thumbnail.) It puts the output in the directory at ../converted-images/full/ and names the file using the ‘basename’ command. This linux command enables us to strip off the extension from ‘123.tif’ and be left with ‘123’, to which we append a ‘.jpeg’ to tell ‘convert’ that we want the output to be a jpeg. If we were converting just one file this might be:

convert 123.tif[0] ../converted-images/full/123.jpg

which is a really really easy way of converting image file formats. Wrapping it in a for loop is just a convenient way to have the right filenames, imagemagick can cope with wildcards in various ways as well.

After this Makefile target has converted all of the TIFFs to JPEGS it then changes directory to the converted-images/full/ directory and tells us that it is now converting them the large/medium/small/thumbs/tiny. This uses another simple bash for loop just saying

for file in *.jpg; do

and then a list of commands before ending with ‘done’.

In this case the commands are to use the convert the full-sized JPEGs to some agreed widths of large (1000px), medium (500px), small (250px), thumb (150px) and tiny (50px). This the width and the height will automagically scale to whatever it needs to maintain the aspect ratio. There is a lot more about imagemagick geometry that could be said.

This ends up with files under converted-images in full, large, medium, small, thum, and tiny directories with one image file per ID in the CSV file in  each of those directories (which the for loop assumes you’ve already made).

Posted in Bash, images | 1 Comment

Report on the Digital Humanities at Oxford Summer School 2013

The Digital Humanities at Oxford Summer School (DHOxSS) is one of the premier international training events in digital humanities. DHOxSS 2013 took place on 8 – 12 July 2013 at Wolfson College. As overall director of the DHOxSS it seemed a good idea to write a blog post reviewing the summer school, its content, statistics, feedback, and plans for 2014.

The DHOxSS 2013 had a programme of five parallel workshops running all week. The daily schedule usually had introductory parallel plenary lectures by invited scholars on topics relating to digital humanities in the early morning, followed by introductory lectures as part of the workshops, and then lunch (included in the registration cost), then most students moved to IT Services where they received more lectures with practical exercises done on desktop computers there. Two excepti

Students at Work

ons to this were the XSLT workshop (whose students used borrowed laptops and remained in Wolfson all day) and the Cultural Connections workshop whose students did desk-based (rather than computer-based) practical work.

Content

Morning Lectures:

The morning lectures that were available at DHOxSS 2013 included:

Workshops

After their morning lectures students attended the five-day workshop that they had
booked on. DHOxSS discourages switching between workshops because they are intended to build up over the week and students suddenly appearing in other workshops can be disruptive to the group work already organised. In the morning, since the students were at Wolfson College and not sat in front of computers, they received introductory lectures on the workshop topic.

At DHOxSS 2013 we had 5 parallel workshop which were:

  1. Cultural Connections: exchanging knowledge and widening participation in the Humanities
  2. How to do Digital Humanities: Discovery, Analysis and Collaboration
  3. A Humanities Web of Data: publishing, linking and querying on the semantic web
  4. An Introduction to XML and the Text Encoding Initiative
  5. An Introduction to XSLT for Digital Humanists

After the lunch at Wolfson College, students continued in their workshops (either in IT Services, Radcliffe Humanities, or Wolfson College).

Evening Events

DHOxSS attempts to provide students a good social environment in which to network and enjoy a variety of evening events. While these events are optional, we feel they are part of the content of DHOxSS and thus included an event of some sort each evening:

  • Monday Evening Welcome Drinks Reception: this welcome reception gave the students a chance to mingle at Wolfson College.
  • Tuesday Evening Poster Reception: this reception featured digital humanitiesPoster Reception posters submitted by delegates in response to an open call. It was sponsored by the Oxford Research Centre in the Humanities and took place in St Luke’s Chapel on the Radcliffe Humanities site. Unfortunately, maximum occupancy levels meant that not all of the DHOxSS could attend and this will be changed for DHOxSS 2014 when looking for a reception site.
  • Wednesday Evening Public Lecture: “Scholarly Social Machines” by Professor Dave De Roure, sponsored by the Oxford Research Centre in the Humanities. This lecture was followed by a smaller drinks reception and took place at the Wolfson College Lecture Theatre.
  • Thursday Evening Banquet: This banquet took place at the historic Queen’s College, and was not included in the registration charge. Feedback indicates that the quality of food received was not in-line with the price and so DHOxSS will be looking for a different venue in the future.
    Sebastian Punting
  • Friday Informal Pub Trip: On the Friday evening, as many students departed, some of those who were left went to the Victoria Arms pub for a relaxing evening. Some students even got a chance to punt!

DHOxSS 2013 Statistics

As you might expect we can produce a variety of statistics relating to the DHOxSS 2013. The total number of students booked on the summer school was 88. Of these, 26 registered as students, 50 registered as academics, and 2 registered as corporate. A ‘student’ in this case is anyone studying in higher education, whereas ‘academic’ was anyone working for an academic or non-profit institution. In addition the OUP John Fell Fund provided 10 Oxford DPhil or Oxford Early-Career Bursaries.

Workshops

The workshops had the following initial registrations on them:

  • Workshop 1: Cultural Connections: 13
  • Workshop 2: How to do DH: 22
  • Workshop 3: A Humanities Web of Data: 14
  • Workshop 4: XML and TEI: 32
  • Workshop 5: XSLT: 7

Plenary Lectures

The statistics on who attended which parallel morning plenary are of course hard to track as students were free to swap between them. However, at time of registration the expressed interest was:
Parallel Lecture at Wolfson College

  • Monday: Opening Keynote: 88
  • Tuesday: Parallel Session 1a (Varieties of Openness): 47
  • Tuesday: Parallel Session 1b (Studying People Who Can Talk Back): 35
  • Wednesday: Parallel Session 2a (Re-imagining the First World War): 35
  • Wednesday: Parallel Session 2b (CodeSharing): 48
  • Thursday: Parallel Session 3a (Agent-Based Computer Modelling): 21
  • Thursday: Parallel Session 3b (Digital Libraries): 60
  • Friday: Closing Keynote: 88

Accommodation

In total 249 nights of accommodation were booked via DHOxSS at Wolfson College. (Some DHOxSS students chose to stay elsewhere.)

Gender

Gender was not an aspect that DHOxSS gathered in the registration process, but having met the students and returning to their registration records later if basic binary gender categories were to be supplied the statistics would be that out of the 88 students there were 59 female students (67.04%), 29 male students (32.95%). If one includes everyone, both students and tutors (some appearing in workshops for only 10 minutes), then there were a massive 149 people of which 82 were female (55.03%) and 67 male (44.96%). Although these are a very rough measure (and ignore any form of self-identification) I believe it is generally important to track these. These numbers are good, but evince more male tutors than female – in many cases this is because DHOxSS is reliant on the good will of those who happen to be undertaken digital humanities research in Oxford which is out of our control.

Nationalities

Nationality is, of course, also a difficult category to assess. In this case what has been used is the address given by the student registering – which may be highly inaccurate given the peripatetic lifestyle of early-career researchers! Of the 87 students that I was able to retrieve this data the countries were as follow:

  • 36 from the United Kingdom, of which 17 were from Oxford
  • 16 from the USA
  • 5 from Italy
  • 4 from Ireland
  • 3 from Poland; 3 from the Netherlands; 3 from Denmark
  • 2 from Sweden; 2 from France; 2 from Belgium;
  • 1 each from Austria , Canada, Germany, Greece, Iraq, Israel, Lebanon, Portugal, Singapore, Spain, and Switzerland

Finances and Registration Costs

The DHOxSS as an event attempts to break even in its costs. It does not pay tutors for teaching, though does cover costs such as travel and accommodation, and does not cost any of the staff time involved in its organisation. (To do so would make it prohibitively expensive!) All income is spent on room rental, materials provided, evening events, etc. It is underwritten by the IT Services at University of Oxford as an outreach event and staff from around the University (and outside) donate their time. The registration costs for the DHOxSS 2012 and DHOxSS 2013 have been kept static and it is our intention to keep them the same for DHOxSS 2014. We make an initial budget based on how many students we think will register and if more students register we increase things such as the extras provided in the DHOxSS bag, or the quality of food and drink at the receptions.

The registration costs for DHOxSS 2013 (and 2012-2014) were:

  • Student: £475;
  • Academic: £575;
  • Commercial: £675

Feedback

The DHOxSS collects feedback on a wide variety of feedback on all aspects of the summer school. In general most aspects were rated as ‘good’ or ‘excellent’ which is pleasing, however we do take note of the feedback and look at the comments provided in detail. I’ll summarise some of the general aspects here. Aside from a known problem with the first day or so of one workshop (which we have put processes in place to avoid in the future), most of the workshops received glowing reviews rating the speed, level, range of topics, and quality of talks, exercises and handouts to be excellent. The majority of respondents rated all of the morning plenaries either good or excellent. In looking at the DHOxSS as a whole an overwhelming majority felt that the overall academic content was good or excellent and the same for the balance of workshops vs lectures and academic vs social content. Some feedback indicates that having multiple venues was not as positive, but we are limited by the necessity of some workshops in using desktop computers. In general the individual teaching venues were rated good or excellent, though there were complaints of one venue in particular being too hot which will be investigated more thoroughly in selecting venues for next year. The lunches (included in the cost of registration) and refreshment breaks generally received good or excellent ratings, though with some comments noting individual problems that have been fed back to those providing these refreshments. The quality of the evening events received glowing good and excellent ratings except for the Banquet where feedback indicates the location was good but the service and food was not as good as expected. We will take that into consideration when looking for a venue for DHOxSS 2014. The organisational and administrative aspects overall received good and excellent ratings.

All feedback was distributed to the DHOxSS Organisational Committee and will be used to improve DHOxSS in the following years.

Plans for 2014

DHOxSS 2014 is scheduled for 14 – 18 July 2014 at Wolfson College and IT Services. More news in time will be available from http://digital.humanities.ox.ac.uk/dhoxss/2014/. Workshop proposal forms have been circulated inside Oxford and will be reviewed by the DHOxSS organisational committee in early November. By the end of November we hope to be able to announce basic details of the DHOxSS 2014 workshops and other deadlines. We intend to repeat the peer-reviewed poster reception which allowed students to display their work. We intend for the registration costs which include lunches, but not accommodation, to be the same as the previous couple years (Student: £475; Academic: £575; Commercial: £675 ).

Dr James Cummings

Director of DHOxSS
Posted in DHOxSS | Leave a comment

ODDly Pragmatic: Documenting encoding practices in Digital Humanities projects

[This is the rough draft text of a plenary lecture I will have given at JADH2013 http://www.dh-jac.net/jadh2013/abst21.html#plenary2 (click to expand abstract). This isn’t necessarily the precise text I delivered but the notes from a couple days before.

It is written very much to go with the slides, really a prezi, at: http://tinyurl.com/jc-JADH2013 and doesn’t really make lots of sense without it.  (I’m not claiming it makes lots of sense with it either!) It re-uses much material I’ve discussed and written about in other locations so I’m not making claims of originality either. Credit is due to everyone involved in the TEI, DH Projects mentioned, and many ideas from the DH community at large. All errors and misrepresentations are mine and unintentional, I apologise in advance. The intention is to superficially expose a slightly larger audience at JADH2013 to some of the concepts and benefits of TEI ODD Customisation.]

 The TEI

Use of the TEI Guidelines for Electronic Text Encoding and Interchange is often held up as the gold standard for Digital Humanities textual projects. These Guidelines describe a wide variety of methods for encoding digital text and in some cases there are multiple options for marking up the same kinds of thing. The TEI takes a generalistic approach to describing textual phenomena consistently across texts of different times, places, languages, genres, cultures, and physical manifestations, but it simultaneously recognises that there are distinct use cases or divergent theoretical traditions which sometimes necessitate fundamentally different underlying data models.  Unlike most standards, however, the TEI Guidelines are not a fixed entity as they give projects the ability to customise their use of the TEI — to constrain it by limiting the options available or extending it into areas the TEI has not yet dealt with. It is this act of customisation and the benefits of it that I will speak of today.

But what is the TEI?

The Text Encoding Initiative Consortium (TEI) is an international membership consortium whose community and elected representatives collectively develop and maintain the de facto standard for the representation of digital texts for research purposes. The main output the community is the TEI Guidelines which provide recommendations for encoding methods for the creation of digital texts. Generally the TEI is used by academic research projects in the humanities, social sciences, and linguistics, but also by publishers, libraries, museums, and individual scholars, for the creation of digital texts for research, teaching, and long-term preservation.

It is also a community of volunteers, institutions like the University of Oxford donate a fraction of staff time (like part of mine) towards the TEI, as do other institutions with elected volunteers or contributors working on research projects.

The TEI is also the outputs that it creates such as the Guidelines themselves, definitions and examples of over 530 markup distinctions, and various transformation software to convert to and from the TEI. It is also a consensus-based way or structuring textual resources – it isn’t determined by the weight of a single institution or commercial company but by the Technical Council members elected by the membership. The TEI is a way to produce customised, internationalised, schemas for validating a project’s digital texts. It is a format that allows you to document your interpretation and understanding of a text, but it is also a well-understood format suitable for long-term preservation in digital archives. But most of all, it is a community-driven standard so it is a product of all of those involved in it.

What the TEI is not:

It isn’t the only standard in this area. It is the most popular but there are others, and people re-invent the wheel unnecessarily all the time. It isn’t objective or non-interpretative: the application of markup is an interpretative act that shouldn’t just be left to junior research assistants – it is the intellectual and editorial content of a digital text. The TEI isn’t used consistently in different projects, and often not even in the same project. (Which is why TEI customisation for consistency is an important form of documentation.) The TEI isn’t fixed and unchanging. Unlike most standards which are static the TEI evolves as the community finds new and important textual distinctions. But customisation gives you a way to document precisely what version of the TEI you are using. It isn’t your research end-point: The creation of a collection of digital texts isn’t an end in itself — it is what you can then do with those texts, the research questions they can enable you to answer that is important.

It isn’t also automatic publication of your materials in a useful way. Any of the off-the-shelf TEI publication systems all will need customising to deal with the specific and interesting reasons you were encoding these texts in the first place. In general though experience teaches us that the benefits of a shared vocabulary far outweigh any difficulties in adoption of the TEI

Generalistic Approach:

The TEI takes a generalistic approach to describing textual phenomena consistently across texts of different times, places, languages, genres, cultures, and physical manifestations, but it simultaneously recognises that there are distinct use cases or divergent theoretical traditions which sometimes necessitate fundamentally different underlying data models. The ability to customise the TEI scheme is something which sets it apart from other international standards. At first glance this may seem contradictory: how can one have a standard that any project is allowed to change? This is because the TEI’s approach to creation of this community-based standard is not to create a fixed entity, but to provide recommendations within a framework in which projects are able to extend or constrain the scheme itself. They can constrain it by limiting the options available to their project or extend it into areas not yet covered by the TEI.

It is nonsensical for a project to dismiss use of the TEI because it does not yet have elements specific to its needs as that project is able to extend it in that direction.

This combination of the generalistic nature and ability to customise the TEI Guidelines is both one of its greatest strengths as well as one of its greatest weaknesses: it makes it extremely flexible, but this can be a barrier to the seamless interchange of digital text from sources with different encoding practices. Any difficulty can be lessened by documentation through proper customisation.

TEI ODD Customisation

Every project using the TEI is dependent upon some form of customisation (even if it is the ‘tei_all’ customisation with everything in it, that the TEI provides as an example). The TEI has many elements covering textual distinctions from linguistics and the marking up of speech to transcribing or describing medieval manuscripts and more. The TEI organises all these elements into a wide array of modules. A module is simply a convenient way of grouping together a number of associated element declarations. Sometimes, as with the TEI’s Core module (containing the most common elements) these may be grouped together for practical reasons. However, it is more usual, for example with the ‘Dictionaries’ module, to group the elements together because they are all semantically-related to one particular sort of text or encoding need. As one would expect, an element can only appear in one module lest there be a conflict when modules are combined.

Almost every chapter of the TEI Guidelines has a corresponding module of elements. In the underlying TEI ODD customisation language both the prose of that chapter of the Guidelines and the specifications for all the elements are stored in one file. It is from this file that both the TEI documentation and the element relationships that are used to generate a schema are created. So there is a chapter on dictionaries and it also creates the module for dictionaries.

The TEI method of customisation is written in a TEI format called ‘ODD’, or ‘One Document Does-it-all’, because from this one source we can generate multiple outputs such as schemas, localised encoding documentation, and internationalised reference pages in different languages. A TEI ODD file is a method of documenting a project’s variance from any particular release of the full TEI Guidelines. The TEI provides a number of methods for users to undertake customisation ranging from intuitive web-based interfaces to authoring TEI ODD files directly. These allow users to remove unwanted modules, classes, elements, and attributes from their schema and redefine how any of those work, or indeed add new ones. One of the benefits of doing this through a meta-schema language like TEI ODD is that these customisations are documented in a machine-processable format which indicates precisely which version of the TEI Guidelines the project was using and how it differed from the full Guidelines. This same format is what underlies the TEI’s own steps towards internationalisation of the TEI Guidelines into a variety of languages (including Japanese).

This concept of customisation originates from a fundamental difference between the TEI and other standards — it does not try to tell users that if they want to be good TEI citizens they must do something this one way and only that way, but instead while making recommendations it gives projects a framework by which they can do whatever it is that they need to do but document it in a (machine-processable) form that the TEI understands. This is standardisation by not saying “Do what I do” but instead by saying “Do what you need to do but tell me about it in a language I understand”.

The result of a customisation might be only to include certain modules, and by doing so lessen the amount of choice available when using a generated schema to encode a digital text. But of course, even inside these modules there will be elements that your project does not need.

ROMA

We do not necessarily need to learn the underlying TEI ODD format to create our customisation. The TEI community provides various tools to do this, such as ‘Roma’ which is a basic web interface for creating customisations. It gives you a way to build up from the most minimal schema, reduce down from the largest possible one, use one of the existing templates, use one of the common TEI example customisations, or upload a customisation that you had saved previously.

And of course the TEI strongly believes in internationalisation so wherever we can get volunteers to translate the website and the descriptions of elements into their own languages, we can incorporate that into the interface. What’s more is that this means the schemas you generate can have glosses and tooltips in your XML editing software that come up in that particular language.

On the ‘Modules’ tab we see a list of all of the modules and it is an easy thing to click ‘Add’ on the lefthand side and the modules will then be included in our schema and appear on the list on the righthand side. Removing them is just as easy.

Clicking on any of the modules, enables us to include or exclude those elements we want from the schema we are building

But what is happening underneath? In this case we’re generating a TEI ODD XML file which stores the changes we have made. We document that we want to included these modules, but also that we want to delete these elements, or in the case of the last one an attribute on an element. Back in the web interface we could look at the attributes for the <div> element and choose to include or exclude those that we want.

And for each of those attributes, here the @type attribute, we could choose whether it was required or not, whether its list of values was closed or open, and what those values might be.

Again, underneath this is XML that documents how we are changing the TEI schema. Here making the @type attribute required, and giving it a closed value list of prose, verse, drama, and other.

But there are limitations to this web interface: for example it currently doesn’t allow you to provide a description to each of these values (the <desc> element here). There is no reason it shouldn’t, just that the creators haven’t had time or money to improve the software in the last few years. The TEI Council is actively looking at ways to encourage the community to create newer ODD editors. From our customisation we can generate a variety of documentation and this documentation will be localised, meaning that your changes will be reflected in it, as well as internationalised in that it will use your choice of language where it can. One of the great things about TEI ODD files is that you can also include as much prose as you want describing your project’s encoding practice. And, of course, you can also generate a variety of schema languages to validate your documents. The TEI tends to recommend Relax NG as its preferred format. And although you can generate DTDs from it, this is now a dated document validation format that I would not recommend.

One of the interesting recent developments is that a user can now ‘chain’ customisations together. Their TEI ODD file points at an existing one as a source and so on. This means that if there is an existing customisation that you like (for example like the EpiDoc customisation for classical epigraphy), then a project can point at that to use it as a starting point, and add to it, but regenerate their schemas with new additions any time the original source has changed.

Such documentation of variance of practice and encoding methods enables real, though necessarily mediated, interchange between complicated textual resources. Moreover, with time a collection of these meta-schema documentation files helps to record the changing assumptions and concerns of digital humanities projects.

OxGarage

OxGarage is the web front end to a set of conversions scripts the TEI provides to convert to and from TEI. They are really easy to use, you choose what type of input document you have and if it can get from that, to any format, and from that to another other format in a pipeline then you can choose that as an output format. Once you’ve chosen the output you can convert to it, or there are all sorts of advanced options for handling things like embedded images. One of the benefits of this freely available tool is that it is a web service, and so you can build it into other platforms. For example the Roma tool we saw when it converted to HTML documentation, or indeed to the Relax NG schema, behind the scenes it is sending it to this OxGarage web service to do the conversion.

The Stationers’ Register Online

The Stationers’ Regsister Online project is a good example of how TEI ODD customisation can save a project money and further their research aims. This project received minimal institutional funding from the University of Oxford’s Lyell Research Fund to transcribe and digitize the first four volumes of the Arber’s edition of the Register of the Stationers’ Company. The Register is one of the most important sources for the study of British book history after the books themselves, being the method by which the ownership of texts was claimed, argued, and controlled between 1577 – 1924. This register survives intact in two series of volumes which are now at the National Archives and the Stationers’ Hall itself. The pilot SRO project has created full-text transcriptions of Edward Arber’s 1894 edition of the earliest volumes of the Register (1557—1640) and the Eyre, Rivington, and Plomer 1914 edition (1640—1708). It has also estimated the costs involved in the proofing and correction of the resulting transcription against the manuscript originals, as well as potential costs of transcription of the later series from both manuscript and printed sources.

A typical entry lists the members of the company registering the book, to ensure their right to print it, the name of the author, and title of the book. There is also an amount shown which is the cost of registering it. In this case the book is the Comedies, Histories, and Tragedies, of one Mr William Shakespeare. As Edward Arber’s nineteenth-century edition of the Stationers’ Register existed as a source, it was decided that this was a much better starting point for the pilot than the manuscript materials themselves. In the earlier volumes the register is also used as a general accounts book for the Stationers’ Company, but over time evolves into a more or less formulaic set of entries following a fairly predictable format.

Although hard to see in this low-res scan, even in the nineteenth century Arber recognized the potential usefulness of markup and thus marked particular features of the Register surprisingly consistently in the volumes he edited. The encoding tools at his disposal, however, were only page layout and choice of fonts. The ‘nineteenth-century XML’, as the presentational markup he chose was termed within the project, was used to indicate basic semantic data categories. For Members of the Stationers’ Company Arber uses a different font, Clarendon. Other names are in roman small capitals, but the names of authors are in italic capitals.

Arber’s extremely consistent use of this presentational markup, and the subsequent encoding of it by the data keying company, meant that the project could generate much of the descriptive markup itself. If this presentational markup had not existed then a pilot project (with very minimal funding) to produce a digital textual dataset would not have been possible. As with all TEI customisations, this was done with a TEI ODD file. This TEI ODD file used the technique of inclusion rather than exclusion (that is, it said which elements were allowed instead of taking all of them but deleting the ones it did not want). What this meant was that when the project regenerated its schemas or documentation using the TEI Consortium’s freely available services, only the original requested elements were included, and new elements that had been added to the TEI since the project created the ODD would be excluded.

The Bodleian Library’s relationship with a number of keying companies meant that the SRO project was able to find one willing to encode the texts in XML to any documented schema. And indeed, very importantly, this particular keying company charged for their work by kilobyte of output. Owing to this, the project realised that it would save money if it could create a byte-reduced schema which resulted in files of smaller size. Our ODD customisation replaced the long, human-readable names of elements, attributes, and their values with highly abbreviated forms.

For example, the <div> element became <d>, the @type attribute became @t, and the allowed values for @t were tightly controlled. This meant that what might be expanded as <div type=”entry”> (24 characters with its closing tag) was coded as <d t=”e”> (13 characters). The creation of such a schema was intended solely to reduce the number of characters used in the resulting edited transcription, as an intermediate step in the project’s workflow — document instances matching this schema are not public, since it is the expanded version that is more useful. This sacrificed the extremely laudable aims of human-readable XML and replaced it with cost-efficient brevity. Because of this compression of elements we called our customisation tei_corset.

This sort of literate programming becomes fairly straightforward once one is used to the concept. However, there is an important additional step here, which is the use of the <equiv> element. This informs any software processing this TEI ODD that a filter for this element exists in a file called ‘corset-acdc.xsl’ which would revert to, or further document or process, an equivalent notation. In this case a template in that XSLT file transforms any <ls> element back into <list>element. In addition to renaming the @type attribute to be @t, some of the other element customisations constrain the values that it is able to contain. For example, in the <n> element (which is a renamed TEI <name> element) the @t attribute has a closed value list enabling only the values of ‘per’ (personal name), ‘pla’ (place name), and ‘oth’ (other name). In most cases though the names are documented by Arber using his presentational markup, and this is captured with the @rend attribute (or its renamed version as @r).

As with many TEI customisations designed solely for internal workflows, the tei_corset schema is not in fact TEI Conformant. The popular TEI mass digitisation schema tei_tite has the same non-conformancy issues. Both of these schemas make changes which fly in the face of the TEI Abstract Model as expressed in the TEI Guidelines. The tei_corset schema, in addition to temporarily renaming the <TEI> element as <file>, changes the content model of the <teiHeader> element beyond recognition.

This bit of the customisation documents the renaming of the <teiHeader> element to <header> which compared to other abbreviations is quite long, but it was only used once per file so had less pressure to be highly abbreviated. The @type attribute is deleted and more importantly the entire content model is fully replaced. This uses embedded Relax NG schema language to say that a <title> element (which is later renamed to <t>) is all that is required, but can have zero or more members of the model.pLike class after it. This enabled the keying company to put a basic title for the file (to say what volume it was), but gave them nothing but some paragraphs as a place to note any problems or questions they had. Usually TEI documents have more metadata, but this is unproblematic because these headesr were replaced with more detailed ones at a later stage in the project data workflow. Other changes meant that elements that were usually empty would be (temporarily) allowed text inside. In the process of up-converting the resulting XML, these were replaced with the correct TEI structures. In this customisation of the TEI <gap> element in addition to allowing text, the locally-defined attributes, @agent, @hand, and @reason are removed.

In a full tei_all schema the <gap> element would have the possibility of many more attributes, but these are provided by its claiming membership in particular TEI attribute classes. For the tei_corset schema many TEI classes were simply deleted which meant that the elements that were claiming membership in these classes no longer received these attributes.

The result of the customisation is a highly abbreviated, and barely human-readable form of TEI-inspired XML. For example here we have a <n> element marking ‘Master William Shakespeers’ with the forename and surname marked with ‘fn’ and ‘sn’. The conversion of this back to being a <persName> element with <forename> and <surname> is very trivial renaming in XSLT.

Passing a couple centuries worth of records through the transformation results in much more verbose markup.

But it isn’t just simple renaming that we undertook in reverting this highly compressed markup to a fuller form. But more detailed up-conversion. Such entries contain fees paid and they are almost always aligned to the right margin by Arber and recorded in roman numerals. The keying company was asked to mark these fees (the <num> element having been renamed to <nm>) and to use the @r attribute to indicate its formatting of ‘ar rm’ (aligned to the right and roman numerals). The benefit to the project of them doing this is that it meant that the SRO project could up-convert this simple number into a more complex markup for the fee.

The up-conversion I wrote here isn’t simply to revert numbers back to the correct TEI markup, but to make them to even better markup by deriving information from the textual string that is encoded. The tokenization of the provided amounts into pounds, shillings and pence, and consistent encoding of the unit indicator as superscript are key parts of this. Arber’s edition provided all the markers of pounds/shillings/pence as superscript, so the keying company was not asked to provide it, as the project realised this could be done automatically after the fact and would save even more characters. I also converted the roman numerals to ‘arabic’ numbers so that easy calculations of total amount of pence (for comparative purposes) could be provided. To do this, the XSLT stylesheet converted the keyed text string back into pure TEI and simultaneously broke up the string based on whether it ended with a sign for pounds, shilling, pence, or half-pence. An additional XSLT function converted the roman numerals in-between these to arabic, and then to pence so that the individual and aggregate amounts could be stored. The markup that results provides significantly more detail than the original input.

The benefit of this customisation was based entirely on the keying company both using whatever XML schema we gave them, and charging per kilobyte of output. Originally we’d calculated that by having them use this schema rather than full TEI we were saving around 40%. In the end, if we include the up-converted information as well, this rises to a 60% saving. The extra money we had left meant that we were able to include the 1640-1708 material as well even though it had been out of scope for the original project.

The Godwin Diary project

The Godwin Diary project was funded by the Leverhulme Trust to digitise and do a full-text edition of the 48 years of William Godwin’s diary. William Godwin, 1756-1836 was a philosopher, writer, and political activist. He is perhaps most commonly known as the husband of Mary Wollstonecraft and the father of Mary Wollstonecraft Shelley, the author of Frankenstein. Godwin faithfully kept a diary from 1788 until his death in 1836; the diary is now preserved in the Abinger collection in the Bodleian Library. It is an extremely detailed resource of great importance to researchers in fields such as history, politics, literature, and women studies. The concise diary entries consist of notes of who Godwin ate with or met with, his own reading and writing, and major events of the day. The diary gives us a glimpse into this turbulent period of radical intellectualism and politics, and many of the most important figures of this time feature in its pages, including Samuel Coleridge, Richard Sheridan, Mary Wollstonecraft, William Hazlitt, Charles Lamb, Mary Robinson, and Thomas Holcroft, among many others.

The project team was small consisting mostly of Mark Philp and David O’Shaugnessy and a couple of their students in the politics department. It is important to note that it is the politics department since it is less Godwin’s life as a literary person, but the social network of relationships which concerned the project.

The Bodleian has provided hi-res images of the diary, and done so under an open license that has already significantly benefited research in this area. In providing the technical support to the project it is important to note that I gave them only 2 days of technical training for the project. Partly this is a benefit of the TEI ODD customisation; they didn’t have to learn the entirety of the TEI, only the bits they were using. I provided this training, created the TEI ODD customisation, developed the website and was also a source of general technical support during the life of the project.

However, even with basic training they were able to mark up the 48 years of the diary, categorise every meal, meeting, event, text mentioned, and person named. In addition they identified more than 50,000 of the ~64,000 name instances recorded in the diary and linked these to additional prosopographical information.

Godwin’s diaries are simultaneously immensely detailed (recording the names of almost everyone he ever met with) and frustratingly concise (he only rarely gives details of what they talked about). Godwin’s diary is quite neatly written and easy to read. The dates, here in a much lighter ink, are usually given (and given correctly) and generally a day’s entry forms the basic structural unit of the diary. In only a very few instances do the notes from one day stray into the page area already pre-ruled for the following day. Occasionally there are marginal notes to provide more information, but in most cases the textual phenomena are quite predictable – mostly substitutions and interlinear additions. In many ways the hierarchical nature of a calendrical diary entry makes it ideal for encoding in XML.

There is some indication that Godwin may have returned to certain volumes at a later date to rewrite or correct them. And yet, it is certainly impressive that there are entries for most days, and that whatever minimal information is given, the names of those attending the frequent meetings Godwin had with those in his circle are recorded. The majority of his diary entries were seen to be able to be broken down into several categories and sub-types. These include his meals, who he shared them with, who he met, very rarely what they talked about, and what works he was reading or writing at that time. The political historians, it is easy to understand, were eager to use the resource to explore which individuals might be meeting with what other friends of Godwin’s at specific times. Meanwhile those exploring Godwin’s writings might be interested in knowing what works he was reading when he was writing specific parts of some of his works.

But that is enough about Godwin, back to the project itself. Of course having the hi-res images means that I included a typical pan/zoom interface, here built on top of google maps, to show each page of the diary. Two links are important to notice on this screenshot though, one is the link to the creative commons ‘full image’. There is no barrier in getting the full image, no one that researchers need to ask, they can just download it. The same is true for all the underlying XML. The other link is a direct link to the diary text for this page. This means that one can browse the diaries based on their physical manifestation, as a series of images, and jump to the text at any point. Or one can read the transcribed text and jump to the image for that page. The project specifically asked for there not to be a side-by-side facing image/text view because they wanted to preserve the distinction between these two experiences of reading the text.

The customised TEI ODD in the case of the Godwin project wasn’t made to create highly abbreviated element names for some keying company. Instead it was to create aliases for elements to give those encoding the diary a small and easy set of elements through which to categorise the parts of a diary entry in terms that made sense to them.

So there were element specifications created for divisions that renamed them to be diary year, month, and day. There were specialised elements to mark segments of text, really re-namings of the TEI seg element, for those portions of diary entries for meals, meetings, events, and more, all with specific names that made sense to the project.

For example with the element specification showing here, it creates a new element called ‘dMeal’ which is a diary-entry meal. There is an <equiv> element pointing back to an XSLT file which can revert this to pure TEI.

There is a description of the new element, and some information about what classes it is a member of and what is allowed inside it. There is a locally-defined @type attribute which which has been made required, and has a list of values for each type of meal, but also indicates whether the person was dining at Godwin’s place or whether he was visiting them.

As with the Stationers’ Register project markup, this was easily converted back to pure TEI P5 XML. You can see some of the @type attribute values preserve the original name of the customised markup. Once restored this dMeal element becomes a <seg type=”dMeal”>.

In this case it is a supper, where Godwin has sup’ed at his friends the Lamb’s with a variety of other people. While at the meal he has had a short little side meeting with H Robinson.

The structure of the diary is also quite straightforward. As you can see each month has an @xml:id attribute which gives its year and month, each day has precisely the same thing, but with the day. These were required by the ODD customisation, and moreover, the schema requires that each day entry have a date element with a @when attribute encoded in it. This means that in creating the processing for the diary entries I could be sure that each diary entry would have a day, and each month a clearly understandable ID and so creating transformations of this which produce the website by each year, month, or day becomes very straightforward.

The changes to the TEI scheme, in renaming elements this time not for brevity but simplicity, meant that the project’s ability to mark up the documents in XML increased dramatically. The other changes, such as requiring a date element with a @when attribute, meant that the processing of the documents was even easier. In short, the customisation made both my life and the encoders lives easier.

In the resulting webpages, one can toggle on or off a variety of formatting for indicating all these categories of information they recorded, people, places, meals, meetings, reading, writing, topics mentioned, and events. The general website is clear, cleanly minimalistic, and intuitive with a calendar for each year one is looking at, and anything that can be a link has been turned into one. But one of the great strengths of the website is the amount of work they have put into the marking of all those people’s names. Because they have done that it means that we can pull out dataTables of information about the people, birth date, death date, gender, occupation, and how many times they are mentioned in all of the diary volumes and whether this was when they are acting as a venue (Godwin visits them) or were listed by Godwin as ‘not-at-home’.

For each person we produce a prosopographical page listing biographical details, editorial notes, a bibliography of works, and a generated graph showing when and how much they are mentioned in the diary. Of course, each of these references links back to that diary entry for a very circular navigation through the resource.

Extracting information from the diary was the reason the project team put so much effort into adding this encoding to the XML files. This means that we’re able to extract this information for any of the categories that they marked and each of the sub-types within that. In this case one of the subtypes of events was ‘theatre’ used to note when he went to the theatre and if known which theatre he was going to. With this data available in the eXist XML database that powers the resource, it is then easy to pull out all of the trips to the theatre, which theatre, and show the event usually containing the title of the play he went to. The website does this for every single category and sub-type of information they marked, so researchers can indeed compare how many times he ate supper with someone at his house compared to how many time he ate supper at their house. (If they really want!)

EEBO-TCP

Another benefit of the documentation of local encoding practice is for the legacy data migration of document instances in the future. Even the conversion of closely related documents such as those from the Early English Books Online – Text Creation Partnership into pure TEI P5 XML can be an onerous task. We recently converted the more than 40,000 texts of the EEBO-TCP corpus to TEI P5 XML. As the first phase of these will become public domain in 2015 we’re testing and improving the conversions we have for them to do fun things like create ePubs so we can read these early printed books on our iPads and phones.

The EEBO-TCP markup was based on TEI P3 but then evolved separately when it encountered problems the TEI hadn’t yet dealt with. However, it did not document these in a TEI extension or customisation file. In converting them to TEI P5 we used the TEI ODD customisation language to understand and record the differences of variations between EEBO-TCP and the more modern TEI P5. One proven approach to comparing texts is to define their formats in an objective meta-schema language such as TEI ODD, and in doing so the precise variation between the categories of markup used is exposed, and more importantly, provided in a machine-processable form. As part of the process of converting these to TEI P5 one of the things we looked at was the markup before and after conversion, and thus the frequency of certain elements. The resulting markup has almost 40 million instances of highlighting, but this is because this is one of the basic things captured by the TCP project.

Most of the elements that are highest in frequency are structural in nature. Remember how in the Stationers’ Register project limited the schema to a tiny 34 elements? In all of EEBO-TCP there are only 78 distinct elements used in the entire corpus. This reflects the nature of the TCP encoding guidelines of capturing basic structural and rendering markup. There are very few Interoperability problems between EEBO-TCP texts, as their markup is fairly consistent and basic. But what is interesting about these newly converted EEBO-TCP files is that now that we are able to convert them they are becoming the source for further research. Projects can take our TEI P5 XML files and add more markup to them to document the aspects of the texts that they are interested in.

Three EEBO-TCP Projects

Very briefly I’d like to mention 3 projects which have benefited from these conversions of EEBO-TCP materials, each of which I could go into more detail about at another time. This project (Verse Miscellanies Online) recently went online at the Bodleian, we took the converted EEBO-TCP texts and some researchers from another university edited them and provided information about genre, rhyme scheme, and editorial notes for each of the poems. They also glossed any unfamiliar words and provided pop-up regularisations for others. From these enhanced texts we built them a website to use for teaching and reading of the 8 verse miscellanies they encoded during the project. Similarly in this project (Poetic Forms Online) some researchers again took the TEI P5 converted versions of the EEBO-TCP texts that we supplied them and provided highly detailed metrical analysis, counted syllables, marked the type and location of all rhyme words as well as a regularisation of their rhyme sounds. From these enhanced texts we built them a faceted searchable website with all of the categories which they plan to expand by adding more texts as time goes on. The Holinshed project was slightly different, one of the earlier conversions of EEBO-TCP material that we did. In this case there are two editions of a very large text, Holinshed’s Chronicles of England, Scotland and Wales, one published in 1577 and the other published in 1587. The academics in question were writing a secondary guide to this huge work and wanted a way of following where paragraphs in one edition had been fragmented and moved around in the creation of the second edition. Sometimes whole sections had been moved, sometimes parts of paragraphs had been moved around and mixed with others, etc. In this case we converted the texts to TEI P5 and then designed a fuzzy string comparison system to find the most probable matches and record their paragraph ID numbers. We then built a website where the researchers could confirm that these were indeed the correct matches, before using the resulting links between the two to generate a website where when reading the text a user could jump to the same paragraph in the other edition and see how the social changes during Queen Elizabeth’s reign had affected the topics, especially religious topics, in the chronicle. All of these projects have benefited from our ongoing work to improve the transformations of EEBO-TCP to TEI P5, which itself is dependent on the TEI ODD Customisation language.

The Unmediated Interoperability Fantasy

One of the misconceptions about the TEI, and indeed any sufficiently complex data format, is that once one uses this format that interoperability problems simply vanish. This is usually not the case. Following the recommendations of the TEI Guidelines does, without question, aid the process of interchange especially when there is a fully documented TEI ODD customisation file. However, interchange is not and should not be confused with true interoperability.

I would argue that being able to seamlessly integrate highly complex and changing digital structures from a variety of heterogeneous sources through interoperable methods without either significant conditions or intermediary agents is a deluded fantasy. In particular, this is not and should not be the goal of the TEI. And yet, when this is not provided as an off-the-shelf solution some blame the format rather than their own use of it. The TEI instead provides the framework for the documentation and simplification of the process of the interchange of texts. This is a good thing and is a much better goal for the TEI. If digital resources do seamlessly and unproblematically interoperate with no careful or considered effort then:

  • the initial data structures are trivial, limited or of only structural granularity,
  • the method of interoperation or combined processing is superficial,
  • there has been a loss of intellectual content, or
  • the results gained by the interoperation are not significant

It should be emphasised that this is not a terrible thing, nor a failing of digital humanities nor any particular data format, but instead this truly is an opportunity. The necessary mediation, investigation, transformation, exploration, analysis, and systems design is the interesting and important heart of digital humanities.

Open Data

While proper customisation of the TEI and open standards generally are a good start, what still isn’t happening as much as it should is the release of the underlying data openly. All projects, especially publicly funded projects, need to release their data openly, but they also need centralised institutional support to enable them to do so. If other people can’t see your data, then we can’t re-use it, test it, and if so there is little benefit to the world to make the data.

I don’t know the situation here in Japan, but in the UK and the USA it is certainly the case that funding bodies are increasingly requiring data to be open.

I leave you with the final thought that the “coolest thing to be done with your data will be thought of by someone else”.

Posted in Conference, TEI | 1 Comment

Self Study (part 6) Primary Sources

Self Study (Part 6) Primary Sources

This post is the sixth in a series of posts providing a reading course of the TEI Guidelines. It starts with

  1. a basic one on Introducing XML and Markup then
  2. on Introduction to the Text Encoding Initiative Guidelines then
  3. one on the TEI Default Text Structure then
  4. one on the TEI Core Elements then
  5. one looking at at The TEI Header.

None of these are really complete in themselves and barely scratch the surface but are offered up as a help should people think them useful. This sixth post is looking at how to represent primary source documents, including transcription, linking transcriptions to facsimiles, and genetic editing. Already in the core module of the TEI a number of elements are defined specifically for encoding primary sources. If you’ve got this far then you’ve already read about those, for example unclear or the choice element and its component parts abbr/expan, sic/corr, orig/reg. Some of these are further supplemented with additional elements if the ‘transcr’ module (the ‘Primary Sources’ chapter) is included in your schema. For example, the addition of am to abbr to record the abbreviation marker and ex inside expan to mark an editorial expansion.   Other elements provided if the ‘transcr’ module is included in the TEI ODD file
that created your schema include:

addSpan am damage damageSpan delSpan ex facsimile fw handNotes handShift
line listTranspose metamark mod  redo restore retrace sourceDoc space subst substJoin
supplied surface surfaceGrp surplus transpose undo zone

Annotating the activities of transcription and the relationship of this transcript with the original source document is at the heart of this chapter. This has several aspects including: more detailed encoding of the act transcription, the creation of digital facsimiles, and recording the writing process.

Transcription

As you already know from reading about it the choice element is a way of encoding multiple transcriptional interpretations at a single point in a text. For example, an
abbreviation with abbr and its expansion with expan, or an apparent error with sic and its editorial correction with corr, or an original reading with orig and a regularised form with reg.
These children of choice are repeatable, so it is possible to encode an abbreviation with multiple possible expansions (hint one could use the @cert attribute to indicate which of these expansions is more certain).

What could be encoded as:

could also be encoded as:

As mentioned above this chapter also adds am to abbr to record the abbreviation marker and ex inside expan to mark an editorial expansion. The abbreviation marker may or may not be present but is the thing in the original text which indicates to you that the word should be interpreted as an abbreviation. In this case, although NATO is more commonly abbreviated as an initialism with no ‘.’ marking the individual letters, here it has these. We could encode this, using ex inside expan to mark the expanded portions of text as:

This form of markup is different from the superficially similar to subst (added in this module) which contains add and del to record additions and deletions. The prime difference is that whereas in choice only one of the child elements is truly transcribing something in the text, with subst both the deletion and addition are present to be transcribed.   There are also elements to record damage to the source, text that has been supplied by the editor, is considered surplus to editorial requirements, or for recording unusual space in the document. There is also a way to note a change of scribal hand, using the handShift element.

Digital Facsimiles

In many scholarly editions the provision of digital images acts a facsimile or surrogate of the original document to such a degree as to enable primary source research without recourse to the original object. Although the TEI stands for the Text Encoding Initiative, it is indeed possible to have a TEI document which does not contain a text element. Inside the TEI, after the teiHeader there must be either a facsimilesourceDoc, or text element. But you could have a document which at this point only had a facsimile and no transcribed text.   The facsimile element contains images rather than text, and these can appear either directly as graphic elements, or be organised by surface elements for each surface and zone to
specify sections on those surfaces. The surfaceGrp element can be used to group multiple surfaces together (e.g. recto and verso of a folio, or indeed gatherings).
A basic facsimile element may have looked like:

These instead could be grouped as individual surfaces:

or with zones and coordinates:

These can be given as x/y coordinates for the upper left and lower right to draw a rectangular bounding box. The @xml:id attributes in these examples can be pointed to from the page breaks in the transcription of the text (if provided):

Or from any other element, such as a division:

Recording the Writing Process

Linking a transcription to a zone in a facsimile by pointing to it with the @facs attribute is one way to relate the text to images. Another does not prioritise the final text, but the process which was undertaken to create this text. In this case the transcription can be made in the sourceDoc element. When surface, zone and line (for transcription of topographic lines on the document) are used inside sourceDoc they are for transcriptions of the text as they appear as units on the physical document without the semantic interpretation that we find in transcriptions that use the text element. (For example, deciding that they form paragraphs or speeches by particular characters.)

Of course, all of the surfaces, zones and lines can have coordinates on them or use the @points attribute for a series of coordinates for non-rectangular areas.   There are other elements, for use in recording the text which also relate to the process of writing. These include metamark which records any symbol which indicates how it should be read rather than forming part of the content. (For example, an arrow ‘moving’ a paragraph above the one which proceeds it in the document.) A general mod element can also be used to record a modification in the document without the semantic interpretation of some of the other transcriptional elements.

Additional elements are available that help to record the process of writing the document, including the restore element to indicate a deletion that has been marked as reverting to a previous version by cancelling some textual interaction. This is used for comparatively simple cases, whereas the more general undo element can be used to indicate any form of cancellation. If a cancellation is then marked as being reaffirmed or reasserted in some manner then a redo element can be used. There is also a way to record the act of transposition by using a transpose, sometimes gathered in a listTranspose, to point to the elements that are transposed.

A retrace element can be used where writing has been overwritten, usually with the intention of clarifying or fixing the text. This is sometimes a distinct phase in the production of the text. Any distinct stages in the text, such as campaigns of revision or editing phases, can be recorded using the listChange and change elements. These are not provided in this chapter, but the Header, where they are used in revisionDesc to record stages of revisions in the creation of the electronic file. When used in the creation element in the header it instead records phases of development of the text itself.

Questions about Encoding Primary Sources

As usual I’ve got some self-assessment questions for you to test that you’ve read the chapter carefully.

  1. What is the difference between a surface and a zone?
  2. What are the options for child elements of the TEI element?
  3. Can a zone be larger than its parent surface?
  4. How can you point from a surface element to a page break rather than the other way around?
  5. What do you use to break up textual transcription inside a line element?
  6. Think of an example of metamark used in documents your familiar with. How would you encode it?
  7. Why might you use a g element inside an am element?
  8. What is a substJoin element used for?
  9. What is the difference between using damage and unclear with textual content?
  10. How is line different from zone? When would you use line?
  11. How would you record how large an unexpected space was?
  12. Can you think of a reason why recording stages of production of the texts you are interested in
    might benefit your own work?

You may wish to look at the Image Markup Tool, written by Martin Holmes from the University of Victoria in Canada. This uses the facsimile, zone and surface elements to record the coordinates of the annotation and links the transcription to these.

Posted in SelfStudy, TEI, XML | 3 Comments

Self Study (part 5) The TEI Header

This post is the fifth in a series of posts providing a reading course of the TEI Guidelines.  It starts with

  1. a basic one on Introducing XML and Markup
  2. an Introduction to the Text Encoding Initiative Guidelines
  3. and one on the TEI Default Text Structure
  4. and one on TEI Core Elements

None of these are really complete in themselves and barely scratch the surface but are offered up as a help should people think them useful.

This fifth post is looking at The TEI Header.

The <teiHeader> is an essential part of every TEI file; it is where you record metadata for the digital text you are creating, document what you have done and why, as well as put additional information which may be useful in understanding or interrogating this file.

The <teiHeader>, often just casually referred to as ‘the header’, is in some ways the most important part of your TEI file. Without it we can’t know what the file consists of, what you were trying to do when you created it, what we are allowed to do with it, or anything else about this electronic file. A digital file without proper metadata is only of very limited use. However, the provision of basic metadata need not be an onerous task only completed by well qualified librarians and bibliographers: you too can provide decent metadata for your digital text.

At its very minimal the TEI requires that the header have a <fileDesc> element and that in turn this have child elements for a  <titleStmt> (information about the title of the digital file), a <publicationStmt> (information about the publication of the digital file), and a <sourceDesc> (information about the source of the digital file even if newly created).

Minimal teiHeader Element

As siblings to the <fileDesc> one could also have the elements <encodingDesc> (to store information about the encoding of the digital text), <profileDesc> (a text profile of additional information), or <revisionDesc> (to store information about major revisions).

The <fileDesc> Element

Inside <fileDesc> you can store all sorts of information about the file. The RelaxNG Compact Syntax for this content model (excluding its membership in attribute classes) is:

(titleStmt, editionStmt?, extent?, publicationStmt,
seriesStmt?, notesStmt?), sourceDesc+

This means that there is:

  • a required <titleStmt> which allows you to record one or more  <title> (required) and responsibilities such as <author>, <editor>, <funder>, <meeting>, <principal>, <sponsor>, or general purpose <respStmt> followed by
  • an optional <editionStmt>, to record information about this digital edition followed by
  • an optional <extent> element to give a place for information about size followed by
  • a required <publicationStmt> to record necessary information about the publication of the digital file either as prose paragraphs or structured information on the <distributor>, <authority>,  <availability>, <address>, <date>, <publisher>, <pubPlace> or one or more <idno> element. This is followed by
  • an optional <seriesStmt> gives a place for relating this digital file to a series of any sort of which it might be a part
  • an optional <notesStmt> gives a place for any notes relating to the file not encoded elsewhere
  • and after all of this at least one <sourceDesc> is required to record information concerning one or more sources for this electronic file. This can contain either prose paragraphs or more structured information about the bibliographic sources in a variety of formats.

Note that the required elements inside <fileDesc> are <titleStmt> (itself with a required title), <publicationStmt>, and <sourceDesc>.

And that is it! That is all that is required for a valid and useful <teiHeader>.

The <encodingDesc> Element

But of course, sometimes we don’t want to only record the minimal amount of information, we may wish to record other things. As mentioned above after the <fileDesc> we can also have an  <encodingDesc> (to store information about the encoding of the digital text), <profileDesc> (a text profile of additional information), or <revisionDesc> (to store information about major revisions).

The <encodingDesc> element is where one can store information about what decisions were made in the encoding of the text. Like many metadata categories in the TEI this can either be given as prose paragraphs or more structured forms concentrating on the following:

  • when the header module (required) is loaded:
    • information about an application which has edited the TEI file: <appInfo>
    • taxonomies defining any classificatory codes used elsewhere in the text: <classDecl>
    • details of editorial principles and practices applied during the encoding of a text: <editorialDecl>
    • a geographic coordinates declaration: <geoDecl>
    • a list of definitions of prefixing schemes used in data.pointer values: <listPrefixDef>
    • a project description: <projectDesc>
    • a declaration specifying how canonical references are constructed for this text: <refsDecl>
    • a description of the rationale and methods used in sampling texts in the creation of a corpus or collection: <samplingDecl>
    • information about the language in which style information used to describe the original object is supplied: <styleDefDecl>
    • detailed information about the tagging applied to a document: <tagsDecl>
  • when the gaiji module is loaded:
    • information about nonstandard characters and glyphs: <charDecl>
  • when the iso-fs module is loaded:
    • a feature system declaration comprising one or more feature structure declarations: fsdDecl
  • when the tagdocs module is loaded:
    • a specification of the schema the document is intended to validate against: <schemaSpec>
  • when the textcrit module is loaded:
    • a declaration of the method used to encode text-critical variants: <variantEncoding>
  • when the verse module is loaded:
    • a metrical notation declaration: <metDecl>

Of course, these are all optional or instead of using structured elements you can just use the <p> element (or if the linking module is loaded the <ab> element) to provide one or more prose paragraphs.

The <profileDesc> Element

After the <encodingDesc> it is possible to have a <profileDesc> element to record various non-bibliographic aspects of a text. The information recorded again depends on what modules are loaded when creating your schemas. This allows metadata categories including:

  • when the header module (required) is loaded:
    • a record of the calendaring system used in the dating elements:  <calendarDesc>
    • information about the creation of a text: <creation>
    • a description of the languages, sublanguages, registers, or dialects, represented within a text: <langUsage>
    • a collection of information describing the nature or topic of a text in terms of a standard classification or keywords scheme: <textClass>
  • when the corpus module is loaded:
    • information about identifiable speakers or other participants (of any sort) in the text: <particDesc>
    • a record of the setting(s) within which a language interaction takes place:  <settingDesc>
    • a description of a text in terms of its situational parameters: <textDesc>
  • when the transcr module is loaded:
    • documentation of the different hands identified within the source texts: <handNotes>
    • a list of transpositions, each of which is indicated at some point in a document typically by means of metamarks: <listTranspose>

Unlike <encodingDesc> you cannot provide just paragraphs inside <profileDesc>, however, you can do so inside many of its child elements.

The <revisionDesc> Element

The final component of the <teiHeader> is an optional single <revisionDesc> which summarises the the revision history of the file. Inside <revisionDesc> you usually place a series of change elements ordered so the most recent is at the top. The change element has a both dating attributes like @when to provide the date of the change as well as a @who attribute to point to information (such an author, editor, or more general respStmt in the <titleStmt>.

And that is the <teiHeader>!

Ok, there are indeed lots more that can be said about each of those individual grandchildren in the XML hierarchy, and some aspects, such as the description of manuscripts and early printed books (using <msDesc>) even gets a chapter of its very own (Manuscript Description) that I’ll cover in another post. But this is meant to be a series of blog posts as a reading course of the TEI Guidelines. So below are some basic questions you should be able to answer if you’ve read the TEI Header chapter.

Questions About the <teiHeader> Chapter

  1. What are the four major components of the <teiHeader>?
  2. Inside <titleStmt> inside a <fileDesc> what element would you use to record who transcribed a manuscript?
  3. What is the difference between a new edition of your file and a revision of it? How would you document each of these?
  4. Where would you put general notes about your text?
  5. What element would you use inside <sourceDesc> to provide a manuscript description? What about a script for a spoken text? What about the recordings used to produce a transcription?
  6. Inside the <editorialDecl> how do you indicate whether end-of-line hyphenation has been retained in a text?
  7. What is the rendition element used to describe? What global attribute do you use to reference it from the text?
  8. What elements do you need to construct an arbitrarily-deeply nested taxonomy?
  9. If you were writing a computer program which modified a TEI file, where in the <teiHeader> would you store information about how your program had modified the file?
  10. How (and where) would you indicate that approximately 80% of a text was in Latin and 20% was in English?
  11. How do you provide information about a date that is in a non-Gregorian calendaring system?
  12. The TEI Guidelines can not enforce the provision of all possible metadata. What information do you think should be provided as a minimum? What would you include as recommended components of the <teiHeader> for your own project? How might this differ if you aren’t encoding just one document but hundreds or thousands of them?

Encoding Your Own Material

Continue encoding your own material, but this time return to the <teiHeader> and improve it as much as you can. Think about those aspects that might be useful for you to encode to be able to find this text amongst many others; think about those aspects of the text that might be helpful for you to encode for those that wish to study texts like this in large collections through examining their metadata through (semi)automated means. Hopefully but doing so you’ll make better use of the <teiHeader>.

 

 

Posted in SelfStudy, TEI, XML | 1 Comment

Self Study (part 4) TEI Core Elements

This post is the fourth in a series of posts providing a reading course of the TEI Guidelines.  It starts with

  1. a basic one on Introducing XML and Markup
  2. an Introduction to the Text Encoding Initiative Guidelines
  3. and one on the TEI Default Text Structure

None of these are really complete in themselves and barely scratch the surface but are offered up as a help should people think them useful.

This fourth post is looking at Chapter 3 of the TEI P5 Guidelines: Elements Available in All TEI Documents. This is, of course, a terrible name for this chapter.  It has been called this or something similar for quite a number of versions of the TEI so probably not worth campaigning to change it until a major new version of the TEI is on the cards.  The reason it is a bad name, of course, is that you cannot guarantee that the elements listed in this chapter are available in all TEI Documents. Every use of the TEI is through some form of customisation. Even if someone uses one of the pre-prepared schemas generated by the TEI Consortium, they are all the result of a TEI customisation stored in an TEI ODD file.

Go and read the chapter Elements Available in All TEI Documents and answer the following questions to force yourself to make sure you’ve read it.  While you do so imagine how you might use these common elements in a document you would like to encode.  This will be useful because the assignment following reading the chapter will be to encode a small amount of material from your chosen document.

Elements Available in All TEI Documents: Questions

  1. What is the difference between paragraphs, ‘phrase-level’ elements, ‘chunks’  and ‘inter-level’ elements? (Give an example element name of each!) Is this way of describing elements useful?
  2. Think about the hyphens (and hyphen like symbols) that occur in the documents you are interested in.  How do they function? Is there a difference in how you might encode them if they are at the end of line?
  3. Highlighting is often how texts indicate that a segment of text has a feature or characteristic that is different, in some way, from the surrounding text. Think about the way text you are interested in is highlighted: What does it use colour, special marks, or characters to highlight? List what the text is trying to convey through this highlighting and how you would mark it using TEI elements.
  4. How do you mark a bit of Lingua Latina that appears in the middle of some English text?
  5. Quotation marks, another form of highlighting, are used to indicate a wide variety of things.  What is the difference between <q>, <said>, <mentioned>, <soCalled>, <quote>, and <cit>? Can you think of instances in the material you are interested in where you might use these?
  6. The <term> element is used to mark technical terms. Why might you wish to mark technical terms? Which might be useful to mark in your material?
  7. When might the @cert attribute be useful in your encoding?
  8. The <choice> element enables you to present two or more conflicting editorial choices at the same time — does this mean that software processing this needs always to choose just one of these? The <choice> element enables us to group: <abbr> (abbreviations) with their <expan>, a <sic> (apparent error) with a <corr> (corrected form), and an <orig> (original form) with a <reg> (regularization). When might it be useful to have multiple <expan>, <corr>, or <reg>? Is there something fundamentally different between abbreviations and expansions compared to the other two sets of elements concerning which is the original?
  9. How do you indicate that some material is missing because you cannot read it? What if you want to provide your guess as to what the material is?
  10. We’ll skim over looking the <name> element in detail because there is a whole chapter about more detailed names; but, the <name> element has a @type attribute, what types of name occur to you? When there are specialised forms of this, such as <persName> (personal name, which will be introduced in a later blog posting) why might you want to use the simpler <name> element?
  11. How would you encode your own address using the more semantically-rich forms rather than <addrLine>?
  12. As with names, there are more complex discussions to have about <date> elements in a later post; But using as precise attributes as possible how would you encode the following dates (try it out in oXygen making sure your document remains valid):
    • The date text: “17 March 1999”
    • From 17 March 1999 to April 2013
    • The phrase “the thirteenth century”
    • A single date where you know it did not occur before 1971 and certainly could not have happened after the 1st of January
    • The 17th March when you do not know the year
  13. What is the difference between a <ptr/> and a <ref> and why might you prefer one over another?
  14. Lists are very ubiquitous in texts of most periods and cultures: How would you encode a list from a text of your choice? When might you encounter nested lists?
  15. The <note> element can appear many places: what different types of notes can you envision using if you were encoding a modern edition of your favourite text?
  16. The <graphic/> element enables you to point to an image to include at this point. Why might this be a bit limited? (Hint: The <figure> element is defined in chapter 14.)
  17. What are milestone elements? What is the main difference between <milestone/> and <pb/>, <gb/>, <cb/>, <lb/>? Can you think of instances when you would use <milestone/>? What might you do if you want to record that a line-break is artificially breaking a word?
  18. There are three main forms of bibliographic citation <bibl>, <biblFull>, and <biblStruct>: Why might you choose <bibl> over <biblStruct> (<biblFull> is used a lot less frequently)? What kind of elements are allowed inside them (compare using their reference pages) and how might that inform your decision to use them?  Try to encode the bibliographic reference for an academic journal article of your choice using both <bibl> and <biblStruct> … of these which do you prefer and why?
  19. What is the difference between <biblScope> and <citedRange>?
  20. How would you mark up this simple piece of drama from Hamlet?

    QUEEN GERTRUDE: Came this from Hamlet to her?
    LORD POLONIUS: Good madam, stay awhile; I will be faithful.
    [Reads]
    Doubt thou the stars are fire;
    Doubt that the sun doth move;
    Doubt truth to be a liar;
    But never doubt I love.

Encoding Your Own Material

That is an awful lot of questions above!  Sorry! If you still have time left then try to encode a small amount of material that you are interested in creating a valid TEI XML file. (If you don’t, well, do it next time you get a chance before moving on to the next blog post!) Where appropriate encode:

  • The structure of the text including any paragraphs and lists
  • Forms of highlighting, colours, or what you feel the highlighting is indicating
  • Quotations and citations if they exist
  • Notes, both existing, and editorial notes you wish to make
  • Expand some abbreviations if there are some using  <choice>, correct any errors
  • Mark any page breaks, line-breaks (if not encoding metrical lines), gathering breaks, column breaks, etc.
  • In a <p> in the <sourceDesc> element in the header make a note of the date of the material using as precise a date as you can
  • The list of elements created by the core module (and chapter) are at the bottom of the chapter; are there features you want to encode which are not covered by these? Make a list of them and think about what chapter may enable you to encode these.
  • What other problems or limitations in encoding your text do you find? Are these problems likely to be unique? Try to find a good TEI way of solving them!

Next time, we’ll move on to looking at the <teiHeader> element and how to make better use of it.

 

Posted in SelfStudy, TEI, XML | 1 Comment

Self Study (part 3): The TEI Default Text Structure

This (long) post follows on from posts on a basic one Introducing XML and Markup, and one on an Introduction to the Text Encoding Initiative Guidelines. Neither of these are really complete in themselves and barely scratch the surface, but are offered up as a help should people think them useful.

In this post we look at the overall basic structure of a TEI File. In many ways this is much more concrete than the infrastructure of the TEI where it is possible to get lost in the differences between TEI ODD files and the schemas generated from them, or modules, model classes, and attribute classes. Instead here we’re looking at the markup that is part of almost every TEI file, its default text structure. Readers may notice that the ‘Default Text Structure’ chapter of the TEI Guidelines comes after two that I’ve skipped: ‘The TEI Header‘ (chapter 2) and the slightly inaccurately named ‘Elements Available in All TEI Documents‘ (chapter 3). Have no fear if you are following this set of blog posts, I will be returning to chapter 3 next and then chapter 2, I just feel it is good to get a sense of a TEI file as a whole before learning about all the core elements and metadata.

A Basic TEI File Structure

A basic TEI file might look like this image below.

In this image the element names are in blue and XML comments (delineated by <! –– comment –-> ) are in green.

An XML file always should start with an XML Declaration (here at the top in purple). After that we have a <TEI> element in the TEI Namespace (http://www.tei-c.org/ns/1.0). Inside all <TEI> elements the TEI Guidelines require there to be a <teiHeader> element. In order for this to be a real and valid TEI P5 file, there are some elements which would need to appear inside the <teiHeader> element, but I’ll talk about those in another post.

After the <teiHeader> element you can have one or more optional <facsimile> or <sourceDoc> elements. These are for recording image facsimile information, or for a non-interpretative transcription method sometimes used for creating genetic editions.

After these we have a <text> element. Technically this is optional if you have <facsimile> or <sourceDoc> elements but really for most introductory uses of the TEI it is probably a good idea to have a <text> element. If you do use one it has to come last.

Inside a <text> element you can optionally have a <front> element. This is for containing front matter like titlepages or prefaces, anything that comes before the main body of the text.

The <body> element is required, because whatever text you are creating (whether a transcription of ancient clay tablets, medieval manuscripts, modern web-pages or teaching slides) it will have a body of some sort. Inside <body> you might get divisions (the <div> element) or just paragraphs (the <p> element) or a wide variety of other things. (We’ll talk more about these in a bit.)

The <back> element which follows the <body> element, as with <front>, is optional but is intended for back matter such as indexes, appendices, bibliographies, addenda, etc.

Now one of the things you might notice about this is it brings to bear certain assumptions of the TEI. This default text structure reflects the assumption that most text-bearing objects can be transcribed and editing in a way which resembles something that we might usually associate with a codex-like structure. (e.g. front matter, the main body stuff, then stuff that comes after). Our association of this with an assumed codex structure probably is a bit misplaced. For example, manuscript rolls, for example, often have optional ‘stuff at the top’ then ‘the main body stuff’, then optional ‘stuff at the end’ and many other cultures and methods of writing text on objects also have such systems. People have used the TEI to successfully encode a huge variety of texts from different times and cultures so it is unlikely that this structure will impose too much of a semantic burden on your own use of it.

 The TEI Default Text Structure Chapter

This is a long chapter which covers a lot of ground. It looks at the default text structure of the TEI (that I’ve tried to explain briefly above), and then investigates the kind of things that happen inside the <text> element. This includes looking at the types of divisions available inside the <body>, <front> and <back> elements and the elements available inside these divisions. It includes ways of encoding groups of texts (such as anthologies and collections), virtual divisions that can be automatically generated such as tables of contents. It also looks at the <front> element, title pages, and the <back> element.

Read this chapter and in order to make sure you have, answer these questions:

  • How might you decide whether a text is unitary or composite?
  • Personally I have a strong preference for almost always using un-numbered divisions <div> rather than numbered ones <div1>. In what circumstances might numbered ones be more appropriate to use?
  • Why does the TEI not use numbered headings (c.f. HTML where there are elements <h1>, <h2>, <h3>, etc.) but just a <head> element?
  • If you were digitising my love letters (who knows why?!), how would you mark up the closing bit at the end of a letter where I say:
With love and cuddles,
James
xxx
  • When would you use <group> element rather than have separate TEI files?
  • What is a <floatingText> element used to indicate? Try to think of examples from your own area of work?
  • Do the texts you work with have front matter that you would encode in the <front> element? How would you encode it? How do you decide to encode something as front matter rather than as the body of the file?
  • On a title page how would you encode a title that has several parts to it?
  • Are there differences between what is allowed in <front> and what is allowed in <back>? Why is this the case?

Try it out

I always think, if possible, it is good to have practical exercises to reinforce things you have learned. If you have time try this:

  • Start up the oXygen editor
  • Create a new document by going to File ? New and double-click to expand ‘Framework templates’ scroll down inside it and do the same to open ‘TEI P5’. Inside this select ‘All’, and click on ‘Create’ to open a new document.
  • Ignoring the schema declarations at the top you should get a file which looks something like this:

  • Assuming you’ve not turned off automatic document checking, you should have a happy green square in the upper right-hand corner of the editor, near where a scrollbar would appear if our document was longer. This tells you not only that it is well-formed but also valid according to the rules of the tei_all schema.
  • Delete the entire paragraph element (including <p> and </p> tags) that says:
<p>Some text here.</p>
  • Does that happy green square disappear? Is it angry and red? If document checking is turned on the opening <body> tag should be underlined in red, that happy green square should now be red and there should be a red line part way down the right-hand side indicating where the error is in the document.
  • At the bottom of the screen there will be an error message, in this case saying ‘element “body” incomplete’ because it is expecting one of any number of elements.
  • Instead of replacing this paragraph, let’s instead add a division. Move to inside the <body> element between the opening tag and the closing </body> tag where the paragraph was previously. Press the < key and wait a second; oXygen should be helpful and give a drop down list of the elements allowed by the TEI at this point. Scrolling up and down this list can give you a sense of the vast array of things you could be encoding at this point, but is also a bit of a mixture because you can have texts with divisions or without them at this point. Select the <div> element and notice what oXygen does.
  • oXygen should have added both an opening and closing division tag: <div></div> . Move the cursor between these two tags and press enter a couple times to get some space.
  • Add a <head> element and inside it put the text content “My First Heading”.
  • After the closing </head> tag, add a paragraph using the <p> element and the text “My first paragraph.”
  • In all cases make sure you only stop when you have a happy green square indicating that your document is well-formed and valid.
  • Your <body> element should now look something like:

  • Add at least one more division after this. (If you had a document with only one division, you don’t really need to use the <div> element at all.) Inside this second division, try nesting a sub-division!
  • If you do your <body> element might look something like:
  • Save your document.
  • The oxygen-tei framework comes complete with some transformations to other formats. From the oXygen menus choose Document ? Transformation ? Configure Transformation Scenario(s) and select ‘TEI P5 XHTML’ and click on ‘Apply associated’ (though this may be slightly different if you are using a different version of oXygen).
  • You should get a minimal HTML rendering of your file appearing in a browser. Note some of the information that the transformation has added. Try some other transformations or changing the document and seeing the effect.
  • Think about the nature of your own materials and how you might structure them if encoding them according to the default text structure of the TEI!

I’ve intentionally glossed over the introduction of many of the core TEI elements (such as <p>), but don’t worry we will survey these next time!

Go on to Self Study (part 4) TEI Core Elements next!

Posted in SelfStudy, TEI, XML | 3 Comments

Self Study (part 2): Introduction to the Text Encoding Initiative Guidelines

Quite awhile ago I posted http://blogs.it.ox.ac.uk/jamesc/2012/03/15/self-study-introducing-xml-and-markup/ as a list of reading and steps I would recommend someone follow if they were wanting to learn TEI XML and related technologies. This first step was to learn a little bit about XML and markup languages like HTML to a bit of background.

The next step I’d recommend  is to learn a bit more about the Text Encoding Initiative and the Guidelines it produces.

Questions:

  1. In what markup language did documents using TEI P1 to TEI P3 use?
  2. How was this changed for TEI P4 and then TEI P5?
  3. In what way is the TEI ‘extensible’?

Questions:

  1. What does ‘ODD’ stand for? What can one generate from a TEI ODD file?
  2. What is a TEI module? What is the relationship between modules and chapters?
  3. What language does one use to define a TEI schema?
  4. Why might a single project use more than one schema at different stages in their project workflow?
  5. What is an attribute class? The att.global attribute class provides @xml:id and @n attributes to every element in the TEI; what is the difference between these two attributes? When might it be useful to use @n to number verse lines? When might this be a silly waste of time?
  6. What is the @xml:lang attribute for?
  7. What is the difference between the @rend, @style, and @rendition attributes?
  8. What is @xml:space for?
  9. What is a TEI model class, and what do members of the same class share?
  10. Why are model and attribute classes a good idea?
  11. What is a TEI datatype?

Note: If you are confused about modules vs model classes vs attribute class the following blog post might help: http://blogs.it.ox.ac.uk/jamesc/2008/09/01/modules-vs-model-classes-vs-attribute-classes/

  • Next, familiarise yourself with the table of contents of the TEI Guidelines: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index-toc.html
  • And then browse http://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ELEMENTS.html which contains a complete list of elements provided by the TEI.
  • Choose a couple elements which you think you might know what they are used to encode and click on them to explore their reference pages. For example http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-address.html
  • The information on this page seem confusing but it lists:
    • the element’s definition; which ‘module’ it comes from
    • what attributes it has (and if they come from attribute classes)
    • what model classes the element might claim membership of (which controls where it is allowed to appear in your document)
    • a list (by module) of these elements which are allowed to contain this element
    • a list (by module) of which elements this element is allowed to contain
    • a declaration of the content model of the element (which can be toggled between Relax NG compact syntax and XML syntax)
    • one or more examples
    • possibly some additional notes on usage.

Questions:

  • The address element does not define any attributes of its own. How does this compare in layout to the availability element? What attribute does this (at time of writing) define for itself rather than getting it from a class?
  • The address element has two examples; what is the difference between them?
  • If you click on the ‘Show all’ link in one of the examples what do you get? Notice, for example how address is used inside the publicationStmt element to give the address of the publisher of the electronic text.

This is a very basic survey of some of the initial things you might want to learn before diving into the Guidelines in more detail.  I plan to continue this with similar directed reading and questions on some of the topics they cover in the future. In fact, the next post in this series is http://blogs.it.ox.ac.uk/jamesc/2013/01/31/self-study-part-3-the-tei-default-text-structure/ which looks at the TEI’s Default Text Structure.

Posted in SelfStudy, TEI, XML | 1 Comment

Tokenizing and grouping rhyme schemes with XSLT functions

There is a project I work for which has encoded rhyme schemes in TEI using the @rhyme attribute on <lg> elements.  This contains some complex strings as they have used parentheses to indicate an internal rhyme and asterisks to indicate whether a particular rhyme is a feminine (multi-syllable) rhyme. Rhymes are also marked   So for example you get values that look like:

rhyme=”(a*)a*(a*)b(c*)c*(c*)bddee(f)fg(h)hg/”

But I need at any particular point to be able to get at least 2 things from this string:

  1. The documented rhyme above for the current <rhyme> element that I’m processing
  2. Whether the current rhyme is an internal (parentheses) or a feminine (asterisk) rhyme or not.
  3. The set of rhymes for the current line
  4. Whether the current line has any internal (parentheses) or feminine (asterisk) rhymes or not.

So the first step with this is to tokenize the given rhyme scheme.  I do this as an XSLT function and if I want to output it I could have something like:

 <xsl:variable name="rhyme">
(a*)a*(a*)b(c*)c*(c*)bddee(f)fg(h)hg/
</xsl:variable>
<tokenized-rhymes>
  <xsl:copy-of select="jc:tokenizeRhymes($rhyme)"/>
</tokenized-rhymes>

Here, inside some unseen template, I’ve got a variable with the rhyme scheme in it, and I’m getting a copy-of the output of a function I’ve created called jc:tokenizeRhymes(). This isn’t a very difficult XSLT function it just consists of some xsl:analyze-string as so:

<xsl:function name="jc:tokenizeRhymes" as="item()*">
<xsl:param name="rhyme"/>
<xsl:variable name="rhymes">
<list>
    <xsl:analyze-string select="$rhyme" regex="\(*[a-zA-Z]\**\)*">
        <xsl:matching-substring>
            <item>
                <xsl:value-of select="."/>
            </item>
        </xsl:matching-substring>
        <xsl:non-matching-substring/>
    </xsl:analyze-string>
</list>
</xsl:variable>
<xsl:copy-of select="$rhymes"/>
</xsl:function>

All this does is have a function which takes a single parameter (rhyme), and creates a variable containing a list with a bunch of items inside. To do this is uses a regular expression on xsl:analyze-string which looks optionally for an opening parenthesis \(* then any letter from a-zA-Z optionally an asterisk \** follow by an optional closing parenthesis \)* … see, simple. The output from this lookst like:


  <list>
         <item>(a*)</item>
         <item>a*</item>
         <item>(a*)</item>
         <item>b</item>
         <item>(c*)</item>
         <item>c*</item>
         <item>(c*)</item>
         <item>b</item>
         <item>d</item>
         <item>d</item>
         <item>e</item>
         <item>e</item>
         <item>(f)</item>
         <item>f</item>
         <item>g</item>
         <item>(h)</item>
         <item>h</item>
         <item>g</item>
      </list>

Well then, getting the current rhyme when I’m processing a rhyme is fairly easy then. I just create a variable $rhymePosition (the current number of rhymes I’m on) and then can call another function jc:getCurrentRhyme with that and the rhyme variable.

<xsl:variable name="currentRhyme">
  <xsl:value-of select="jc:getCurrentRhyme($rhyme, $rhymePosition)"/>
</xsl:variable>

The jc:getCurrentRhyme function is fairly straightforward as well. It looks like:

<xsl:function name="jc:getCurrentRhyme" as="item()*">
   <xsl:param name="rhyme"/>
   <xsl:param name="currentRhyme" as="xs:integer"/>
   <xsl:variable name="rhymes" select="jc:tokenizeRhymes($rhyme)"/>
   <xsl:copy-of select="$rhymes/list/item[$currentRhyme]"/>
</xsl:function>

It takes two parameters, the $rhyme and the $currentRhyme (which is an integer of how many rhymes there are so far in the <lg> including the one we are processing). It then creates a new variable $rhymes which has the output of the jc:tokenizeRhymes above. Then getting the current rhyme from the list is easy because we know its number so we just make a copy of the <item> we’ve created in that variable by using xsl:copy-of and filtering it by the number $currentRhyme. (This is why we made sure that this parameter was an integer.)

In order to check whether these are internal or feminine rhymes it is now very straight-forward, we just test the $currentRhyme we’ve created above to see whether it contains($currentRhyme, ‘)’) or contains($currentRhyme, ‘*’).

In order to get all the rhymes for this line, we need to re-process this tokenized list somewhat. We want to group those items which have parentheses together with the letter which follows them, splitting on each non-parenthesised letter (optionally having an asterisk). It took me awhile to get my brain around that but eventually I came up with:

<xsl:function name="jc:groupRhymes" as="item()*">
<xsl:param name="rhyme"/>
<xsl:variable name="rhymes" select="jc:tokenizeRhymes($rhyme)"/>
<xsl:variable name="groupedRhymes">
  <list>
   <xsl:for-each-group select="$rhymes/list/item"
      group-ending-with="*[matches(., '^[a-zA-Z]\**$')]">
     <item>
      <list>
       <xsl:for-each select="current-group()">
        <item>
         <xsl:value-of select="."/>
        </item>
       </xsl:for-each>
      </list>
     </item>
    </xsl:for-each-group>
  </list>
</xsl:variable>
<xsl:copy-of select="$groupedRhymes"/>
</xsl:function>

This function takes in the parameter $rhyme and tokenizes it using the earlier function, so now we have a list with some individual items in. It then creates a new list and uses xsl:for-each-group to select all the tokenized items. It creates groups ending with any item where the content matches a full line going from start to finish of a letter followed by an optional asterisk. This means each group will end with a normal rhyme letter and any internal rhymes (in parentheses) will be included in that group. For each group it puts out a new item with a nested list and makes each rhyme in that line an item in that nested list. This might seem overkill to some, but having the extra nesting, regardless of whether there are 1, 2, or 20 rhymes in the line just makes things easier. So this output from this looks like:

<list>
<item>
    <list>
        <item>(a*)</item>
        <item>a*</item>
    </list>
</item>
<item>
    <list>
        <item>(a*)</item>
        <item>b</item>
    </list>
</item>
<item>
    <list>
        <item>(c*)</item>
        <item>c*</item>
    </list>
</item>
<item>
    <list>
        <item>(c*)</item>
        <item>b</item>
    </list>
</item>
<item>
    <list>
        <item>d</item>
    </list>
</item>
<item>
    <list>
        <item>d</item>
    </list>
</item>
<item>
    <list>
        <item>e</item>
    </list>
</item>
<item>
    <list>
        <item>e</item>
    </list>
</item>
<item>
    <list>
        <item>(f)</item>
        <item>f</item>
    </list>
</item>
<item>
    <list>
        <item>g</item>
    </list>
</item>
<item>
    <list>
        <item>(h)</item>
        <item>h</item>
    </list>
</item>
<item>
    <list>
        <item>g</item>
    </list>
</item>
</list>

Which, admittedly, is fairly verbose. But you can now have a function that just gets the individual line’s items that you are interested in which would look something like:

<xsl:function name="jc:getCurrentLineRhymes" as="item()*">
  <xsl:param name="rhyme"/>
  <xsl:param name="currentLine" as="xs:integer"/>
  <xsl:variable name="rhymes" select="jc:groupRhymes($rhyme)"/>
  <xsl:copy-of select="$rhymes/list/item[$currentLine]"/></xsl:function>

Which when called with something like:

 <xsl:copy-of select="jc:getCurrentLineRhymes($rhyme, 4)"/>

(where ‘4’ here usually would be a variable containing the current line number) it will produce something like:

<item>
 <list>
  <item>(c*)</item>
  <item>b</item>
 </list>
</item>

Which a simple string test using contains() can again tell you whether there are any feminine (asterisk) rhymes or internal (parentheses) rhymes, etc.

Hurrah! See that wasn’t that difficult after all. In this case it makes a good example of using XSLT2 functions to call other functions to break the overall task down into manageable more object oriented-like tasks which can be re-used for a variety of purposes. (There are a lot of efficiencies which could be implemented here… the jc:getCurrentLineRhymes and jc:getCurrentRhyme are almost identical, except that one uses jc:groupRhymes() and the other uses jc:tokenizeRhymes(). This could be one function which tests a parameter to see which is intended.

The whole XSLT stylesheet is available from https://github.com/jamescummings/conluvies/blob/master/xslt-misc/tokenize-rhyme-test.xsl.

Posted in TEI, XML, XSLT | Leave a comment