The Text Creation Partnership (TCP) creates standardized, accurate XML/SGML encoded electronic text editions of early printed books. This work, and the resulting text files, are jointly funded and owned by more than 150 libraries worldwide. All of the TCP’s work will be released the public domain for anyone to use.

Starting January 1, 2015 TCP has arrived at a major milestone: all restrictions were lifted from EEBO-TCP Phase I, which consists of the first 25,000 texts transcribed and encoded by the TCP. These texts are now freely available to anyone wishing to use them, and there are no longer any restrictions on sharing these files, which are now licensed under the Creative Commons Public Domain Dedication (CC0 1.0 Universal).

To make this tranche of texts available not only according to law and theory but also in practice, the team at University of Oxford had to provide the means for accessing the HTML, ePUB and TEI P5 XML versions via the Oxford Text Archive.


TCP is full of all sorts of advice so why go to the self-help section of the bookstore when there’s this huge collective wisdom to enjoy?

Sebastian Rahtz and James Cummings of Oxford were mainly responsible for the launch while my part in these efforts was to create the script that extracts the catalogue data from PostgreSQL relational database and presents it to the world as the searchable table that can be now enjoyed at TCP catalogue page.We wanted the complete catalogue of TCP texts, both freely available and restricted to be available and displayed in a manner that enables simple and quick search, filtering and browsing of the resources.

DataTables jQuery plugin is perfect for such application, but I had to make sure that the performance on a set of more than 60,000 records will be satisfactory. Luckily DataTables come with a server-side processing option which means that all paging, searching and ordering actions that DataTables typically perform in a browser are handed off to a server where an SQL engine (or similar) can perform these actions on the large data set much more efficiently. DataTables website hosts example implementations with PHP and various database engines. The only obstacle was that TCP catalogue records were hosted in a Postgres database and the PHP/PostgreSQL script was definitely not in a mood to work on our setup. Eventually I ended up porting one of PHP/MySQL examples to Postgres.

This involved changing all MySQL specific dialect into something that PostgreSQL can grok.

And plugging it back into the html catalogue page requires just this bit of JavaScript

$(document).ready(function() {
   $('#example').dataTable( {
       "processing": true,
       "serverSide": true,
       "ajax": "scripts/server_processing.php"
   } );
} );

even though we finally ended up with something a bit more elaborate to allow for filtering on individual columns and some automatically generated contents, including links to html version and xml sources.


Excerpt from the source summary

Each text has its own repository on GitHub comprising of the XML TEI P5 source plus a MarkDown readme file gathering some information extracted from TEI source. The scripts to generate the latter are again my doing and can be found in a MDown subdirectory of a special TCPTools repository Source files of TCP texts can now be forked from gitHub to do as one pleases. If one should want more of those there’s yet another interesting repository that lists all TCP repositories in csv and json formats and provides scripts to clone everything at once. It would be interesting to see what people do with all this bounty!


Posted in Uncategorized | Leave a comment

Leave a Reply