Boxing up

In 36 years you accumulate a lot of paper even if yr schtick is supposed to be all about digitization, etexts, etc. You still accumulate reports and articles and minutes and logs and off prints and publicity glossies and lists, things you wrote, things your friends wrote, things you read and enjoyed, things you felt you ought to read and enjoy, things you printed out because they looked as if you might enjoy them if you ever got round to reading them, things people sent you with touching dedicatory notes, things with sentimental association with times and places long since disappeared…. What’s to do with all that paper?State of play on 30 Oct 2010

Most of it is by no means unique to me, so I could confidently think of its preservation as Someone Else’s Problem. but there seems to be a law somewhere about the inverse relationship between the size of the SEP field and the size of one’s ego, since I keep finding myself thinking about how jolly interesting this collection of stuff might be to anyone interested in the last three decades of digital humanities, as refracted through my experience of it. I know such beings must exist, I see they even get degrees in it, though who knows for how much longer.

So I spent the last couple weeks of October skimming through the junk.

Fortunately (or not) OUCS has a large dry basement, formerly used to house the air conditioning systems for the machine room, now used to store old furniture and other things people cannot bear to be parted from. I have been allocated some space and into it my boxes are to be conveyed just as soon as I finish packing and labeling them. As of 1 Nov, when I gave up, there are neatly queued up for the basement store

  • 41 small. numbered boxes containing miscellaneous articles, offprints, manuals, reports etc.
  • 10 chronologically-ordered small boxes containing detritus from assorted events and conferences I attended between 1977 and 2000
  • 2 boxes containing the delegate packs and abstract volumes from the ALLC/ACH or ACH/ALLC or DH conference, also going back to the late seventies, with just a few gaps.
  • 3 boxes of papers and other material relating to the short glory days of the Humanities Computing Unit
  • One or two more about the Oxford Text Archive (already very well represented down in the OUCS Basement)
  • One box of very dusty old junk relating to my activities during 80s, in the field of database design and support
  • Four or five boxes of TEI memorabilia, including committee documents and drafts and working papers never digitized. Several copies of Ps 1 to 5… CDs from various TEI training events.

The contents of many of these boxes was catalogued back in the days when I had time to catalogue and an assistant to actually do it, on a Macintosh using a hypercard stack I wrote. Many years ago also I wrote a script to export that in SGML (of course), and that SGML file has since been converted into a TEI XML file (quelle surprise), so I know what (at least in theory) should be in many of them without looking. So I contented myself with cataloguing very roughly only the conference fallout boxes in addition, and put off to a rainy day any investigation of the others.

And acting on the suggestion of the current librarian of Senate House, who just happens also to be the current chair of the DRHA standing committee, I also put into a separate box every surviving document relating to or produced by the DRH(A) series of conferences…  yes every one of them from 1996 to 2010 inclusive. Said box has now been shipped off to the library at senate house for digitization, and will form the raw material for my forthcoming intensely interesting study on the evolution and sociology of academic conferences concerning the emergence of the Digital medium and its effects on the Humanities. Maybe.

Posted in Uncategorized | Tagged | Comments Off on Boxing up

Three dozen years at OUCS — vol three: 1996-2010

Somewhere around 1995, about the time that Harold Short and I were busy designing the ill-fated UK Arts and Humanities Data Service, I made the transition into management. It’s a bit of a blur now, but I think someone must have noticed I was having too much fun. For whatever reason, in one of those fits of enthusiasm for reorganisation which periodically grips OUCS, it was decided to knock together various hitherto discrete activities and create a new thingie called the “Humanities Computing Unit” with me as its manager. This rapidly proved to be not so much herding cats, as trying to prevent the huskies from disagreeing too much about the destination of our sledge: you will appreciate the justice of this metaphor if I remind you that the huskies in question included a comparatively youthful pair of doctors called Lee and Fraser, and Mr Michael Popham, amongst others.

Running the HCU was my introduction to the wonderful world of Oxford committees, the complexities of which rapidly made database knitting, international encoding research projects, and even corpus design look like child’s play. I also found myself having to develop something I’d never heard of called “interpersonal skills” (some will assure you I was as bad at this as at bit twiddling). Still, it was actually quite a lot of fun running a unit dedicated to something that I had vigorously argued in public didn’t exist, on a largely fictious budget, along with a bunch of other enthusiastic lunatics. The HCU had a strong sense of its own identity, largely because it was sequestered away from the rest of OUCS in the afore-mentioned soggy basement, but also because its members were all so, how can I put this tactfully, weird. It also, I can proudly say, introduced many fine traditions to OUCS, not least the Xmas party. And I am particularly proud of the fact that its lunatics have now taken over the asylum.

Talking of lunatics, on Sept 11, 2001, as everyone knows, the world changed forever. Quite apart from some bizarre incident in New York, that was when we realised that the HCU had become too noticeable to continue as a discrete entity within OUCS, and that unfortunately the newly invented Humanities division didn’t have the money or the will power to offer it any other home. (It probably now has the latter, but sadly still not the former). We made a few attempts to redefine ourselves without the dread “humanities” badge we had been so proud of only five years earlier, but no-one was fooled. It was time to redeploy the components of the HCU across the rest of OUCS and I had to grow up and start doing serious things like amalgamate the OUCS front line support services, help square the circle that is co-ordination of distributed IT services across the university, manage a Research Technologies Service, set up and run an internal forum called the User Services Team, and manage the Core User Directory pilot service. Did I mention staying awake at SMG meetings? Yes, I had to do that too.

Fortunately I was still able to take some time off for good behaviour. Amongst other things, the TEI was reborn as an open source community, largely under the prodding of the very same Sebastian Rahtz I had known in the late eighties, and also underwent some long overdue technical enhancements which I could expatiate on at length, but won’t. The success of the TEI also inspired a number of other projects in which I was involved, notably a European effort to standardize manuscript descriptions originally masterminded by Peter Robinson, which led to my (and others) spending much of the start of the 21st century trying to explain the delights of XML to bewildered librarians across many parts of eastern Europe. And a bit nearer home, when some time around 2008 les digital humanités suddenly became cool again, I found myself invited to participate in some French events and projects, most notably an infrastructural project called ADONIS to which I was seconded for a year in 2009.

I should probably close this entry with some kind of magisterial summary, but the muse is fickle and the retirement party is close — and later today I expect to have to summarize at least some of this for the amusement of those gathered at the nice party OUCS is laying on for me here in Oxford. So all I will say now is that despite the occasional grumbling from my perspective it’s hard to conceive of a better employer than the University of Oxford has been. Throughout my career, I’ve worked closely with all sorts of people who don’t work for OUCS, and time and again I’ve found myself being quietly smug, and even mildly surprised by the way other institutions don’t seem to function in quite the same way. It’s not just the free coffee, or the readiness to fund the occasional cricket match or Artsweek event; it’s not even just the professionalism and the mutual respect that permeates all of those responsible for our technically very sophisticated environment: it’s the presence of a culture that allows, encourages, even requires people to find their own way and to develop their own enthusiasms. I used to joke that I’d always benefitted from years of benign neglect, but I have also learned from the other side of the fence just how hard it is to find the right balance between helping people to build their own way, and making sure that they build something worth while. So my thanks for that are due to the many previous OUCS directors and managers I’ve reported to — from Alan, via Christopher, Alan (a different one), Linda, Alan (yet another), Alan (the same one again), Alec (just for a change), Paul, and Stuart.

What next? That would be telling. If they let me keep this blog going….

Posted in Uncategorized | Tagged | Comments Off on Three dozen years at OUCS — vol three: 1996-2010

Three dozen years at OUCS – vol two: 1986-1995

This decade began as the age of the BBC micro, the Amstrad Word Processor, and the Computers in Teaching Initiative. At OUCS, the ICL 1906A had been switched off in 1981 and replaced by an unloved but bright orange 2900 series mainframe, which we had the difficult task of persuading a sceptical university they liked better than the really boring but much more efficient VAX VMS system which complemented it. But the days of the mainframe, whatever colour, were clearly numbered. In my office, an Olivetti microcomputer running something called MS-DOS appeared (catch me using an IBM like everyone else); a bit later on I installed a Macintosh with an A4 monitor and started preparing overheads for my talks on that instead of Charles Curran’s ICL PERQ. Computing stopped being something you did in batches or at a terminal, and became something you did on your desk. The phrases “information technology”, “desktop publishing” and “word processing” were heard in the land. Amongst other seminal events, in November 1986, I attended a conference at the University of Waterloo on the possibilities offered by the forthcoming first ever digitized edition of the Oxford English Dictionary. In April 1987, Sebastian Rahtz organised a conference on Computers and Teaching in the Humanities at Southampton University. And in November 1987, I attended an international conference at Poughkeepsie College in upstate New York from which was born the Text Encoding Initiative.

The Internet started its insidious transformation of everyday life. From 1989 onwards, earnest intellectual discussion on the newly founded Humanist mailing list led to new acquaintances and new social networks (only we didn’t call them that). In 1993 I went to a Network Services Conference, organised by something called the European Academic Research Network. Here a man called Robert Cailliau from CERN demonstrated live a program called Mosaic which could display data from sites on three different continents, which was almost as amazing as the sight of a room full of people from Eastern and Central Europe taking advantage of Poland’s recent accession to EARN (the European end of BITNET) and consequent unwonted connectivity by sending email messages back home in dozens of funny languages. To say nothing of the crazy notion, which I first heard voiced there, that one day people would actually use this World Wide Web thing as a means of making money. In Oxford, of course, after much deliberation, at this time we had just installed something called Gopher to run our information services.

I did a huge amount of travelling in the nineties, much of it on behalf of the TEI, which I joined as European editor in 1989. Between 1990 and 1994, when the TEI Guidelines were finally published as two big green books, I must have made more than a dozen trips to the US, and as many to various places in Europe, to attend the umpteen committee and workgroup meetings whose deliberations formed the basis of the TEI Guideline, to argue with the TEI’s North American editor — one Michael Sperberg-McQueen — about how those deliberations should best be represented in SGML, and to make common cause with him in defending our decisions to the TEI’s occasionally infuriating steeering committee. Michael has described the first of these processes as resembling the herding of cats but Charles Goldfarb, self-styled inventor of SGML, called it an outstanding exercise in electro-political audacity, which I like better. My participation in the TEI as “European Editor” was financed partly through a series of European grants obtained by the ingenious, charismatic, and sadly missed Antonio Zampolli, at that time one of the first people to identify and successfully tap the rich sources of research funding headquartered in Luxembourg.

My other major activity of the nineties was the creation of the British National Corpus: another attempt to create a REALLY BIG collection of language data, this time for the use of lexicographers. This project got some serious funding due to an unusual coincidence of interest amongst commercial dictionary publishers, computational linguists, and the UK Government, which was at that time keen to develop something called the “language engineering industries”. At OUCS, it led to the installation of some massive Sun Workstations in what is now a rather soggy meeting room called Turing but was then a rather soggy basement, along with three people to run them, one of whom (not me) actually knew how to design and implement a workflow for the production of several thousand miscellaneous texts and text samples with detailed SGML markup. It seems astonishing that both the TEI and the BNC are still alive and well, despite occasional reports of their obsolescence, but they are. The history of the TEI has yet to be written, if only because it’s far from over; I have however written an article about the history of the BNC called “Where did we go wrong?” a title which regularly confuses the non-native English speakers who keep buying copies of the corpus year after year.

Almost every project or institution mentioned in this blog entry now has an entry in Wikipedia. I don’t know what to make of that, but it is certainly sobering to reflect that when the decade I’m talking about began, we were somehow muddling along without any of mobile phones, Wikipedia, Facebook, or the Channel Tunnel, most of which had become unremarkable parts of everyday life by its end. That sort of thing makes a chap feel old.

Posted in Uncategorized | Tagged | Comments Off on Three dozen years at OUCS – vol two: 1986-1995

Three dozen years at OUCS — vol one

I first joined OUCS as a data centre operator in the autumn of 1973. Newly returned with wife and baby from Malawi, I needed the money and OUCS wanted someone to look after its new “remote job entry” facility — a room containing a couple of teletypes, a card reader, a lineprinter, and a large mysterious white box called a MOD 1 located in the Atmospheric Physics department. My job was to keep an eye on the teletype monitoring the state of this computer by means of a noisy typewriter-like device which would tell me the the time every five minutes, and the date every fifteen minutes on a very long roll of paper. Occasionally, it would also display a message to or from the other operators on the system. If it stopped working, I was responsible for calling out Geoff Lescott (my first boss) who would usually suggest rebooting the MOD1 by running a paper tape through it. If that didn’t work, we’d send for the engineers and take the rest of the day off, hoorah. I was also responsible for sorting out the reams of paper being churned out by the line printer in the middle of the room. There was a right way of doing this and a wrong way. The right way was to find the banner page with the name of the person whose job output followed and fold it neatly in half horizontally, without detaching it from the rest, tear off the rest of the batch belonging to that username, and place it tidily on the table in the corner. The wrong way was to just leave the whole pile in the corner, and let the users deal with it themselves. Since the users concerned were either boffins pressed for time or D Phil students who didn’t know their left hand from their right, the wrong way usually worked quite well.

In April 1974 I applied for, and got, a more white-collar kind of job, as an application programmer. This was almost like moving from the shop floor into middle management. My new contract was “academically related” and conferred on me the right to use the staff tearoom at the Computing Services’ swish new premises at 19 Banbury Road, as well as my very own desk in the corner of one of the top floor offices. I shared this with Carol Bateman, Bob Douglas, and Edwin Taylor. Charles Curran made a similar transition across the class divide at about the same time. Access to the time-sharing facility (“MOP”) on the 1906A was via a couple of teletypes outside the office, but most of my time was spent leafing through huge wodges of lineprinter output and scribbling on it. In those distant days, tobacco smoke was an acceptable, even accepted, accompaniment to office life, so I used to consume large amounts of fragrant Gold Block while wondering why my latest attempt to move some bits from one end of an 8 bit byte to the other had failed ignominiously.

The bit twiddling was done in a language called PLAN, which was the low level machine code for the ICL 1906A. My initiation as a programmer was to write a routine which could be used to pack six-bit characters from the ICL filesystem into eight-bit bytes on the PDP 8 mini used to drive Oxford’s state of the art graphics display devices. It took me weeks, and a lot of patience on the part of one of the real system programmars, Alan Fuller, who had been charged with my initiation into the mysteries of machine code programming, an art which I took to like a brick to water.

OUCS had a team of about 12 programmers in those days, divided into “system” programmers, who knew how to make the computer sit up and do stuff, and “application” programmers, who knew what sorts of stuff might actually be useful to anyone, chiefly libraries of arcane mathematical routines. The boundaries were fluid however, and further confused by the introduction during the eighties of a third estate called “communications” programmers who knew all about the bits of string by which computers and computer users communicated. The top man on the systems side was a charismatic fellow called Chris Cheetham, while the applications side reported to a strong-minded lady called Linda Hayes.

Once a week every programmer had to sit in a little room called “Programme Advisory” for a couple of hours, and attempt to console harassed users, usually clutching reams of Fortran printout. This was not as difficult as it sounds — nine times out of ten the problem was caused by misaligned pointers to COMMON blocks, wrong length variables (Ah the joys of the fortran IMPLICIT statement), or NAG routines being called with faulty arguments. If all else failed, you could get rid of the user by suggesting insertion of some PRINT statements at random points in the code, or in really dire cases summoning another expert to repeat the diagnostic process. The chief benefit of an Oxford education, as any fule kno, is the ability to sound convincingly knowledgeable on anything, so I survived this experience unscathed, even though I didn’t (and don’t) know an eigen vector from a fast fourier transform.

In those days of course, any university worth its salt had a massive mainframe computer, partially funded by a now defunct government agency called the Computer Board, and also maintained a small staff of technicians to look after it. This was necessary since those mainframe computers were temperamental beasts, and (moreover) all had to be controlled using different languages and command systems; us ICL experts remained for the most part blissfully ignorant of the workings of the IBM system at Liverpool or the CDC system in London. “Information Technology” as a career was in the process of being invented at the time. I was encouraged by the management to take an active role in representing the University on what we would now call community development activities such as user groups and inter-university committees for this that and the other. My collected visit reports from the period record indicate how mutually baffling an experience it was for me and (say) the DP Manager of West Midlands Gas to find outselves jointly lobbying ICL to get on with delivering some enhanced database product or other. They also contain a fair amount of academic gossip, in which some famous names pop up occasionally.

When it became apparent that bit-twiddling was not my forte, I took up (or invented) the role of database expert, which I occupied for most of the eighties. I also backed up Susan Hockey in providing some support for bewildered boffins from non-scientific or mathematical faculties who had heard rumours about the possibility of applying computing hardware to literary or linguistic research questions. I was of course already doing this for my own interest (I had spent quite a lot of my time as a data centre operator in producing the world’s first computer-generated concordance to the works of Bob Dylan, distributed to discerning friends as a Christmas present in 1973). In 1977, I also invented a role for myself as custodian and evangelist for something called the Oxford Text Archive, the object of which was partly to encourage the sharing of those expensively produced machine-readable source texts, despite the laws of copyright and the immense variability of practice amongst the strange people who transcribed texts for manipulation by computer in those days. And of course to satisfy my own curiosity about what might be done with a really big collection of such things, if it ever existed.

In the days before Google, the process of searching through large amounts of computer-held text was moderately problematic, and academically the concern of a minor subdiscipline of librarianship optimistically called “information retrieval”. The current wisdom was that you needed lots of predefined indexes until such time as expert systems came along to do the reading for you. At the end of the eighties I had a lot of fun exploring the possibilities of a wizard gadget called the “Content Addressable File Store” or CAFS. ICL wanted to use this now-forgotten British invention to revolutionize their transaction processing and database systems; I used it to provide high speed non-indexed searching of oh huge amounts of text e.g. the complete works of Shakespeare, or the Bodleian pre-1920 catalogue, which was weird enough to feature extensively in ICL’s marketing literature, as was I.

Posted in Uncategorized | Tagged | Comments Off on Three dozen years at OUCS — vol one


ExitThe deal being now more or less wrapped up, it seems appropriate to announce it here first. I’m taking early retirement from OUCS, and moving on after (count them) thirty-six years in academic computing support. When you start thinking more about the decades that have disappeared than those which are yet to come, it’s usually a sign that you need to shake things up a bit, so that’s my plan. A touch regretfully of course, because Oxford University has been a very good employer, and I have thrived here in ways I couldn’t begin to imagine being possible elsewhere. Thanks for everything chaps: you’ve all been wonderful. There are loose ends to be sorted out in the next few weeks, but my leaving date is fixed for 30 September, so those that don’t get sorted by then probably won’t.

Posted in Uncategorized | 2 Comments

TEI Prospects and Practice in France

Attached to both the Institut des Sciences d’Homme and the ENS de Lyon, there is in France an interesting project called MuTEC whose role in life (it says here) is to promote and to share expertise and experience in the digital humanities, in particularly with respect to the creation of digital editions and corpora. Organization of discussion amongst representatives of some of the major users and would-be adopters of the TEI in France therefore seems to fall well within their remit. Their recent two-day event (financed by TGE ADONIS) included presentations on a carefully chosen range of topics with speakers from several different centres, plenty of debate, and two half days of highly concentrated training sessions. And, this being Lyon, a respectable amount of good eating.

Marie Luce Demonet (Bibliotheque Virtuelle des Humanistes, Univ of Tours) was down to talk about the BVH as an exemplary case of how such projects can achieve long life and happiness; the range and variety of activities and output which this project has achieved, and continues to achieve, remain exceptional however, particularly in view of its resources. Discussion focused on the way that the TEI was now being proposed as the glue which held its texts, databases of authority files, documentation, and other resources together, rather than simply one of the possible outcomes from the project. Tours is also organizing another TEI training session later this year, as a part of a “Masters Pro” course.

Bertrand Gaiffe from the Centre National de Ressources Textuelles et Lexicales in Nancy gave a good overview of the features provided by the TEI for use in linguistic analysis, protesting however that he knew nothing about linguistics (which is manifestly not true). He managed to convey the essential aspects of such arcana as the ISO data category register and the TEI feature structure system in a painless manner, such that even the casual TEI text creator could see why (and how) you might want to use them; no mean feat; I did however have to protest when he asserted that <s> elements could self-nest. (Note to self: must add schematron constraint in P5 to prevent such folly).

The “Intensive TEI Training” parts of the programme consisted of two morning sessions in classrooms at ENS, jointly taught by Florence Clavaud from the Ecole Nationale des Chartes and myself. The plan was to focus on just a few topics in some detail rather than give the usual overview of TEI Super Lite: selection of the topics was carried out in advance by polling the participants for their preferences – which revealed somewhat to my surprise a strong desire to know more about the TEI Header, amongst other things. Each session contained two talks and two hands-on practical sessions and the programme, complete with sample texts (contributed by Florence), talks, and workshop exercises is all on t’web so I will not describe it here. The participants’ expertise varied enormously, but everyone managed to get through the exercises somehow and even seemed to be enjoying the experience.

Alexei Lavrentev from the ICAR lab at the ENS de Lyon gave a presentation about how TEI schemas come into being, with some good material on ODD, and a description of the Base du Francais Medieval project. This sparked vociferous debate, both on the traditional philological issue of what texts should be edited, and on the traditional TEI argument about whether TEI-ALL should ever be used, ever.

Standing in for my Adonisian colleagues Richard Walter and Stephane Pouyllau, I presented a hastily confected overview of the current state of things TEI in France, using a nice map, and making some suggestions about ways in which a francophone TEI network might be further invigorated. I then somehow coerced a seven person panel (consisting largely of provocateurs handpicked from amongst the persons present) into orchestrating an energetic and wide ranging discussion. Although no-one had any concrete suggestions for how we might reclaim the south west corner of the hexagon for TEI, there was a general feeling that we ought to try a bit harder to use such facilities as the tei-fr mailing list, and a recognition that the TEI was very much a part of the new enthusiasm for a Digital Humanities agenda in France.

For the final session of the two days, we heard from Dominique Roux aand Pierre-Yves Buard about ways in which TEI can fit well into both technical and economic models for small academic publishers, based on their experience at the Presses universitaires de Caen, which remains (sadly) unique amongst the numerous small French University presses in actually putting into practice some enlightened views about the role of a university press in the digital age, particularly with respect to the use of XML in the publishing process. This was held in a lecture room basement lecture room converted from a dungeon formerly used by the Gestapo, or so I was told; it felt more like a Turkish bath.

Posted in Reports, TEI Chat | 1 Comment

Why I was right about Project Gutenberg all along

Just came across this nice, if old, article about the limitations of crowd sourcing:

“I do not want the arguments above to suggest that Gracenote is worthless or Project Gutenberg useless. Far from it. Both are immensely useful. Nonetheless, both suffer from problems of quality that are not addressed by what I have called the laws of quality — the general faith that popular sites that are open to improvement iron out problems and continuously improve. In the case of Gracenote, it may be that only users with minority tastes suffer and they should be prepared to look after themselves. In the case of Project Gutenberg, by contrast, the Project does greatest disservice to those it most seeks to serve, the general reader who may not know enough about the the texts he or she is reading to be able to distinguish nonsense from complexity, editorial misjudgment from authorial teasing, bowdlerization from Nordic prudery. In both cases, whether to guide users better or to improve the system, these limitations need to be recognized.”

Posted in Uncategorized | Comments Off on Why I was right about Project Gutenberg all along

What I am doing in Paris, since you ask

Last week I attended a face to face meeting of almost all the staff now working directly on the ADONIS project, a fairly unusual event (I think this was the first one in the current development cycle) since the project now has many people working in different locations around France. Probably as a punishment for typing too noisily during the meeting, I was asked to draft a brief report on the meeting for the ADONIS website, which seems a good pretext for me to write up a blog posting here.

ADONIS is a TGE (Très Grand Equipement) directly financed by the Ministry of Higher Education (MESR) and reporting to the CNRS, the National Research Council. (Inevitably these two bodies occasionally  have different points of view and priorities,  which makes life, shall we say, interesting). ADONIS is so far the only TGE to have been specifically charged with responsibility for infrastructural support of the humanities and social sciences (SHS) and has defined an ambitious programme of work, currently underway. The TGE is co-ordinated by a small team based in Paris but is highly distributed, with most of its key activities and services being run by people attached to other labs and service organizations around the country, some of whom had never met before this meeting, though they feature on  the official staff list.

After a set of mutual introductions around the table, Yannick Maignien (Director) and Richard Walter (Assistant Director) began by sketching out the TGE’s overall structure and objectives, placing them in the international and national contexts respectively. The mission of ADONIS is to improve the quality and efficiency of French research in SHS, by facilitating better access to shared resources, and promoting best practice; in defining and implementing the key infrastructural resources and services needed for the humanities research environment. These include, for example, provision of archival resources; development of an intelligent search engine for existing digital resources; training on the use of specific relevant technologies; promotion of open access and other digital publishing methods.

Administratively speaking, ADONIS is a “Unité Propre de Service” (Specific Service Unit) attached to the CNRS, and funded like others such on a four year rolling programme. It has a Steering Committee (comité de pilotage), chaired by Michel Spiro, with representatives of the Ministry and the CNRS Institute for Human and Social Sciences, and other interested parties. It also has an Advisory Board (comité scientifique) with a dozen or so distinguished members (e.g. Simon Hodson from JISC, Stephan Gradman from Humboldt University, Francoise Genova from the Strassbourg Observatory).

Since 2005, the CNRS has also funded a number of centres de ressources (National Resource Centres). These are subject-specific centres attached to one or more existing labos (research units) and charged with the task of sharing their expertise and their services with other research units. In the hard sciences, typically, research is carried out at one of a small number of large labos; in the Humanities by contrast, there is a very large number of small units. Hence the importance of developing and sharing shared solutions to common problems, and the important role that ADONIS has with regard to the centres. Just to make life even more interesting, there are other relevant national organizations, notably the CNRS network of professional staff (collectively known in the CNRS as ITA: Ingénieurs, Techniciens, Administrateurs), and the completely different regional network of Maisons des Sciences de l’Homme which provide support services to the Universities, rather than via the CNRS.

Stephane Pouyllau, the third member of the central ADONIS team to speak, gave an overview of the Digital Humanities (sciences numériques) à la française, a set of topics which overlaps significantly but not entirely with the concerns and activities promoted under that badge elsewhere in the world. In France, it is preservation of, and access to digital resources of all kinds which are the major concerns and which constitute the “digital turn” (le tournant du numérique); the major threats are seen to be such things as loss of data, dispersion of expertise, duplication of effort, solutions which do not scale, lack of international visibility and recognition. The need for skills in helping non-technical experts gain confidence in technical areas is largely unrecognized by existing professional training for computing support staff, with consequent problems of communication. And of course, in France the word “science” includes the study of the Humanities as well as the study of “les sciences dures“.

ADONIS aims to address these problems in several ways. It will provide a socle de services (core set of services), including such key activities as archival services and cataloguing of resources needed by many labos, and, from the end of this year, it will also be offering a sophisticated search engine called Isidore. Isidore (Integrated Service for Indexing the Data of Research and Education — or something like that) will crawl and index a wide variety of existing data sources, analysing a variety of standard metadata formats (OAI-PMH, RSS, Sitemap, Z39-50…) to access and merge information into a unique RDF system with its own SPARQL endpoint. Although this may entail working closely with service providers, the need to involve them in the project should mean that the quality, relevance, and accuracy of data provided will be much higher than is currently available from (e.g.) Google.

Returning to the topic of the National Resource Centres, Richard gave a brief overview of the activities and responsibilities of each, as currently configured, noting in passing that the ANR (Agence National de Recherche; the main French research funding agency) had funded dozens of digital data creation projects in the past but had no policies in place to ensure their preservation or their continued access. The skill sets available within the five existing centres include archaeology and 3-D modelling, modelling in social science, iconographic and visual indexing, and linguistic and terminological analysis, as well as appropriate technologies for the digitization and encoding of manuscript or spoken materials.

Stephane then presented the recently completed pilot project on long term archiving, for which the CRDO (Centre de ressources pour la description de l’oral: the National Resource Centre concerned with spoken data) had served as guinea pig. Financed by ADONIS, this project had involved the CRDO as source of the original data and manager of access to its archived form, and also the CINES (National Computing Centre for Higher Education) which had managed the whole archival process, conformant to Open Archival Information System norms, and the CC-IN2P3 (Computing Centre of the National Institute for Nuclear and Particle Physics) which had provided the computing resources — a FEDORA database– for the resulting system. The project had thus demonstrated the viability of ADONIS’ distributed approach to the resourcing of such services, and shown the value of an “e-Research” mode of operation: the storage facilities of CC-IN2P3 are able to preserve the results of fieldwork on surviving old French dialects just as well as the experimental data resulting from nuclear reactions, while the CINES’ expertise in the international standard methodology for creating and managing long term meta data is equally applicable to either.

Thus encouraged, we broke for a pizza lunch (see photo) at Casa valentino, rue St Jacques.

After lunch, we heard more technical detail about the archival experiment from Pierre-Yves Jalud, who is responsible for managing the project at CC-IN2P3. Pierre-Yves had of necessity become expert in the use of Fedora Commonsas a repository management system; there was some discussion as to the merits of this open source solution as opposed to dSpace, dSpace, which seems to be its closest rival. Pierre-Yves noted that the latter did not support iRods iRODS, the de facto software platform for grid applications. He also cited Carl Lagoze’s article from 2006 as a foundational text for the definition of what a digital library should be.

His colleague Huân Thebault spoke in more detail about the authentication and authorisation solutions adopted for the project. RENATER (the French equivalent of JANET) already provides a Shibboleth-based national network of trust, so that the credentials from any RENATER site can be used to log in to any of the others, including the CCN2P3. However, this clearly needs to be complemented by something else for users coming from outside RENATER. The solution adopted is a hybrid architecture combining Shibboleth with openSSO, which is used to authenticate “foreign” users. The drawback, from some points of view, is that they have to maintain their own LDAP directory to hold authorisation data — but they would have to do this in any case.

Jean-Baptiste Génicot described some of the technical problems behind implementing an efficient Z39-50 based access where the various repositories concerned have wildly varying notions both concerning what data items should be made available, and what technical infrastructure should be used to deliver them. His solutions used the existing SOAP library for php to handle data delivered via WSDL, derived from legacy formats such as BIBLIOML and MARCXML in MODS3.3.

The CCSD (Centre for Direct Scientific Communication) is one of the key bibliographic service providers in France, responsible amongst many other things for HAL: HyperArticles Online, the major open access archive of French research papers and resources. It also hosts another ADONIS-supported project, the Open Archive for Photographs and Scientific Images. Philippe Correia and Loïc Comparet reported on a new service under development there called Scienceconfs which (when it opens) will provide a full range of conference management services, from announcements (already well served by e.g. through reviewing and programme planning to proceedings manufacture. The project is still under development and there is good scope for consultation and review.

The CLEO (Centre for open electronic publishing) also supported by ADONIS has established itself as a key academic publisher in France and beyond, with several hundred journal titles published through its portal and a raft of other complementary services. It was represented at our meeting by Andréa Pirastru, a new recruit, who talked about his experience in working with Drupal, the content management system on which depends.

As I noted above, ADONIS also has an important role outside the hexagon, as the sole French contributor to the European Research Roadmap for infrastructural support in the Humanities. Britta Moehring described some current activities in this connexion: notably the elaboration of a business plan and organizational structure for DARIAH, the EU-funded project which is supposed to be defining a Digital Research Infrastructure for the Arts and Humanities at a European level. The model that is emerging is, appropriately, a highly distributed one, in which a number of essential “competences” are identified, and then provided by possibly many partners acting in collaboration. This is being worked out with other DARIAH partners, notably the Max Planck Institute, the University of Göttingen, King’s College London, and DANS (Data Archiving and Networked Services), the Dutch project leaders. Further collaboration is to be anticipated with other key infrastructural initiatives, such as CLARIN (in which OUCS is also represented by the way), and CESSDA

ADONIS is well placed to do this work, since it faces exactly the same issues at the local (i.e. national) level. No single institution can provide all the components of an infrastructure of the kind needed, whether for financial or skill set limitations; organization and maintenance of productive partnerships and distribution of specialist services seem the only way forward.

My own brief intervention came at the end of a long day, so I kept it short. I commented on “internationalisation” aspects and activities of ADONIS, some of which I have already touched on in this report. As well as DARIAH, and knowledge transfer with the non-francophone Digital Humanities community, I suggested that work with the Text Encoding Initiativewas also an important component of the project (well, I would, wouldn’t I). I described briefly the TEI Demonstrator project in which we are collaborating closely with the Max Planck Digital Library and noted its synergy with the development of the Isidore platform. But mostly I showed the following nice picture of a Virtual Research Environment, as envisaged in France at the start of the 20th century. ADONIS is the box on the right, and its team is the hard working boy turning the handle.

L'Utopie  1910	 (

L'Utopie 1910 (

Posted in Reports | 3 Comments

XAIRA meets her Maj

Looking around for an interesting data set to play with (for the TEI Demonstrator project inter alia) the other day, I discovered that the British Royal Family’s very own website includes transcripts of every one of the Queen’s Christmas Day broadcasts, from 1953 to date. A fascinating slice of English social history, reflecting our sovereign lady’s unchanged obsession with family values and the Commonwealth over the last half century and also pretty easy to hoover up and reprocess into something susceptible of automatic analysis. (I’m not the first to notice this, by the way; I stole the idea from those clever chaps at the Times Online Labs)

This post just summarizes what I did to make the corpus.

  1. I used wget to download the relevant chunks of the website (the bits I wanted were conveniently all in one subdirectory (ImagesandBroadcasts/TheQueensChristmasBroadcasts), but it proved easier to just grab the whole site and throw away countless uninteresting photos)
  2. I wrote an XSLT stylesheet to extract from the XHTML files on the website just the chunks I wanted and spit them out into separate plain TEI XML files. There were two files which didn’t follow exactly the same coding conventions as all the others, so I hand-edited them into conformity. There were three files which were not valid XHTML (weirdo character entity references) so I wrote a perl script to hack them into submission. It happens.
  3. This gave me a bunch of files which start off like this:
    <div n="1974">
    <head>Christmas Broadcast 1974</head>
    <!--The Queen's Christmas Broadcast in 1974 alludes to problems such as continuing violence in Northern Ireland and the Middle East, famine in Bangladesh and floods in Brisbane, Australia. -->
    <p>There can be few people in any country of the Commonwealth who are not anxious about what is happening in their own countries or in the rest of the world at this time.</p>
    <p>We have never been short of problems, but in the last year everything seems to have happened at once. There have been floods and drought and famine: there have been outbreaks of senseless violence. And on top of it all the cost of living continues to rise - everywhere.</p>
    <p>Here in Britain, from where so many people of the Commonwealth came, we hear a great deal about our troubles, about discord and dissension and about the uncertainty of our future.</p>
  4. Finally, I used treetagger to add simple linguistic analysis to the texts. By default, treetagger takes XML marked up text, leaves the markup alone, tokenizes the text, one word or punctuation mark per line, and adds POS codes and lemmata. I keep meaning to do something about making it output the results in a nice clean TEI conformant version, but somehow it’s always quicker to just run an after-the-event perl script to tidy up its output. Which gave me a bunch of files that contained lines like this
    <div n="1974"><head><s><w type="NP" lemma="Christmas">Christmas</w>
    <w type="NP" lemma="Broadcast">Broadcast</w>
    <w type="CD" lemma="@card@">1974</w>
    </s></head><p><s><w type="RB" lemma="there">There</w>
    <w type="MD" lemma="can">can</w>
    <w type="VB" lemma="be">be</w>
    <w type="JJ" lemma="few">few</w>
    <w type="NNS" lemma="people">people</w>
    <w type="IN" lemma="in">in</w>
    <w type="DT" lemma="any">any</w>
    <w type="NN" lemma="country">country</w>
    <w type="IN" lemma="of">of</w>
    <w type="DT" lemma="the">the</w>
    <w type="NP" lemma="Commonwealth">Commonwealth</w>
    <w type="WP" lemma="who">who</w>
    <w type="VBP" lemma="are">are</w>
    <w type="RB" lemma="not">not</w>
    <w type="JJ" lemma="anxious">anxious</w>
  5. Finally I wrote a TEI Header file to put all the files together into a single TEI document or corpus.

Then, just for fun, I moved the corpus onto a (virtual) Windows machine (why? Because the all singing all dancing web client for XAIRA is not quite ready yet) and followed the handy Indexing with Xaira Tutorial to produce a XAIRA-searchable version of it. I’ll put up a few screen shots to prove the point later.

Posted in Hackery, TEI Chat | Comments Off on XAIRA meets her Maj

apt-get install hadopi : la loi installée par les geeks – Numerama

This hits so many of my favourite buttons at once it’s hardly likely  anyone else will get it. But it still made me chortle.

apt-get install hadopi : la loi installée par les geeks – Numerama.

Posted in Uncategorized | Comments Off on apt-get install hadopi : la loi installée par les geeks – Numerama