The Oxford Text Archive in 2016

Analysis of the logs for downloads of resources from the Oxford Text Archive in the calendar year 2016 reveal a continuing increase in usage. A few years ago there was a big leap in downloads thanks to the ingest of a large number of texts from the Text Creation Partnership which became available via the OTA, when it became legally possible to share them openly.

Other factors aiding increased usage of the OTA include:

  • BNC free for download: during 2016 the British National Corpus was made available for direct download without having to fill in a form and wait for authorization, and as a result downloads continue to increase;
  • Freeing the texts: an ongoing programme of reassessing legacy data, and, where possible, removing access restrictions;
  • Higher visibility: resource discovery via the CLARIN Virtual Language Observatory, which aggregates OTA records and offers a new way for users to find the texts;
  • Shibbolization: a small number of resources are available currently for UK users only, but also slowly being opened up Europe-wide thanks to the CLARIN and EduGAIN;
  • More digital research: demand grows as more users in the humanities start to engage in digital scholarship.

The grand total for the discrete downloads of resources from the Oxford Text Archive was 1263810, or 1.26 million. Each of these represents the successful download of the content of a resource, and the numbers were calculated after filtering out all hits from spiders, crawlers, robots and other automated processes, and ignoring failed downloads.  The total is an increase of around 38% on last year’s total. Of these 395812 could be identified as originating from users in the University of Oxford, approximately 40%, and more than double the number from last year. Of the total downloads, more than 99.6% were direct downloads of resources made available at open URLs, the rest made up of the various resources where access restrictions require authorization.

Here are this year’s top ten:

Number of downloads Title Author ID Class
9313 The poems of John Keats Keats, John, 1795-1821 3259 text
8351 VOICE: Vienna-Oxford International Corpus of English Barbara Seidlhofer 2542 corpus
6543 British National Corpus, XML edition BNC Consortium 2554 corpus
4936 British National Corpus, Baby edition BNC Consortium 2553 corpus
4616 The four seasons, and other poems. By James Thomson Thomson, James, 1700-1748. 3549 ECCO
4407 An account of the proceedings against the rebels, and other prisoners, tried before the Lord Chief Justice Jefferies: and other judges in the west of England, in 1685. for taking arms under the Duke of Monmouth. … To which is prefix’d, the Duke of Monmouth’s, the Earl of Argyle’s, and the Pretender’s declarations, that the reader may the better judge of the cause of the several rebellions. 4431 ECCO
3696 Beggar’s opera. Libretto. Gay, John, 1685-1732 3257 text
3663 New York newspaper advertisements and news items: 1777-1779 3151 text
3613 The history of the most noble Order of the Garter: Wherein is set forth an account of the town, castle, chappel, and college of Windsor; … To which is prefix’d, a discourse of knighthood in general, … Collected by Elias Ashmole, … The whole illustrated with proper sculptures. Ashmole, Elias, 1617-1692. 5268 ECCO
3564 The peerage of Scotland: containing an historical and genealogical account of the nobility of that Kingdom. … By George Crawfurd, Esq;. Crawford, George, fl. 1710. 5301 ECCO

There is also a  table with the top 20 downloads of 2016. Overall, more than 36000 different resources were downloaded.

The table below shows the most popular items with access restrictions, which required an online application and manual authorization before they could be downloaded. There were 4401 of these downloads – over the year an average of more than ten per day which needed to be manually authorized by a member of staff. Last year there were 3681. Some tf the resources below were made freely available during the year, and so were accessed via direct download as well.

Number of downloads ID Title Notes
2664 British National Corpus, XML edition 2554 Also 3543 direct downloads and 356 via Shibboleth
320 British National Corpus, Baby edition 2553 Also 4439 direct downloads  and 176 via Shibboleth
236 The Lancaster Corpus of Mandarin Chinese 2474
111 Helsinki corpus of English texts 1477
97 British Academic Written English Corpus 2539 Also with 663 direct downloads
97 Complete corpus of Old English: the Toronto dictionary of Old English corpus / compiled by the University of Toronto Centre for Medieval Studies 0163
74 Parsed Corpus of Early English Correspondence (PCEEC) 2510
67 British Academic Spoken English corpus 2525
55 Cat on a hot tin roof / Tennessee Williams 1233
43 A Corpus of English Dialogues 1560-1760 (CED) 2507
43 British National Corpus Sampler 2551
43 The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE) 2462
31 Dictionary of Old English Corpus in Electronic Form (DOEC) 2488

There were 556 downloads from the experimental site hosted by the Oxford e-Research Centre, where users can download one of a small number of resources (of which the BNC is the most popular) by authenticating with their institutional single sign-on. This is an increase from 321 last year, despite some periods of down-time for the service. Only thirty-six of these downloads were from the University of Oxford.

Posted in Uncategorized | Comments Off on The Oxford Text Archive in 2016

Oxford Text Archive moving to the Bodleian Libraries

Photo of the

Bodleian Library

The Oxford Text Archive (OTA) is getting a new lease of life. It is moving to the Bodleian Library, in a transition which should guarantee its long-term sustainability, and open up many new opportunities.

The OTA has had a home in Oxford University Computing Services, now named IT Services, since it was founded by Susan Hockey and Lou Burnard in 1976. You can read a little more on the history in the OTA at 40 post. The OTA archives, preserves and makes available digital texts and related resources, and is involved in a number of collaborations with other repositories, researchers around the world, and the CLARIN European Research Infrastructure. In recent times it has become increasing clear that these activities are nowadays a better fit for the mission and strategic plan of the university library, and so, the decision has been taken to move the OTA into a new partnership with the Bodleian Library, starting from the 1st November 2016.

The OTA will continue to offer the same services that it offers now, mainly texts for download for free for academic, educational and research use, from the website at http://ota.ox.ac.uk/, and will remain committed to the long-term preservation of digital literary and linguistic resources, and making them available for re-use. The OTA will work closely with Electronic Enlightenment – letters and lives online to help people find connections between primary texts and online information about people and social networks. With the backing of the extensive research data management facilities and services of the library, the OTA will be closer to the centre of exciting ongoing developments in digital preservation, data access, resource creation, and digital publishing in the University of Oxford.

In the short term, as far as users are concerned, there should be no visible differences in the service, apart maybe from some subtle changes to the branding. But in the longer term, watch out for lots of improvements and more links with digital collections in the library and beyond!

Posted in Uncategorized | Comments Off on Oxford Text Archive moving to the Bodleian Libraries

The Oxford Text Archive at 40

The Oxford Text Archive (OTA) was founded by  Lou Burnard in 1976, and has been in continuous operation at the University of Oxford ever since.

Oxford Text Archive

Oxford Text Archive at 40

2016 is therefore our fortieth anniversary. Ten years ago we organized a one-day event to celebrate the thirtieth anniversary, and to look back and forwards. A summary of the day can be found one the OTA Thirtieth Birthday page.

The OTA is a repository for digital texts. The collection includes large numbers of digital editions, language corpora, and some more complex digital collections, such as databases, collections of data from websites, and images and audio data. Most of the items are the outputs of academic research projects, and one of the main roles of the OTA is to offer a route for digital outputs to be preserved, shared and reused beyond the end of fixed-term projects.

The OTA offers long-term preservation for its collections with secure storage in an HFS archive account. Accession of new items to the collection continues, although the OTA does not currently actively seek new accessions. Funded research projects from any institution are welcome to get in touch to discuss deposit of new works.

The OTA continues to participate in a number of collaborations. It is a centre in the CLARIN European Research Infrastructure Consortium, and is home to the coordination of CLARIN-UK. One of the results of this is that resources in the OTA can be discovered via the CLARIN Virtual Language Observatory. Users of the OTA can explore many of the texts online by clicking on the link to explore the text in Voyant Tools. Since all of the unrestricted resources are made available with stable and position URIs, other services can be used to process individual texts or batches of them.

Standards-conformant texts encoded in XML can be accessed in a variety of formats thanks to the OxGarage document conversion service – see for example this edition of Twelfth Night.

Below is a brief timeline of some of the key milestones in the past forty years:

  • 1976 Start of the Oxford Text Archive, based in Oxford University Computing Services (OUCS)
  • 1978 Oxford Concordance Programme launched by Susan Hockey
  • 1979 Kurzweil data entry machine (KDEM) installed in OUCS
  • 1987 Start of the Text Encoding Initiative (TEI)
  • 1989 Start of the project to build the British National Corpus
  • 1989 Computers in Teaching Initiative (CTI) Centre for Textual Studies
  • 1994 Launch of British National Corpus
  • 1995 First publication of TEI Guidelines
  • 1995 Humanities Computing Unit formed in OUCS
  • 1996 Start of Arts and Humanities Data Service
  • 2008 Start of Common Languages Resources and Technology Infrastructure (CLARIN)
  • 2008 End of Arts and Humanities Data Service
  • 2015 EEBO TCP texts available via OTA

Read more

Burnard, Lou (1988), ‘Report of Workshop on Text Encoding Guidelines’, Literary and Linguistic Computing 3: 131–3.

Burnard, Lou, and Harold Short (1996), An Arts and Humanities Data Service, JISC [http://www.ahds.ac.uk/about/documents/ahds-feasibility-study.pdf]

Burnard, Lou (undated), ‘Humanities Computing in Oxford: a Retrospective’ [http://users.ox.ac.uk/~lou/wip/hcu-obit.txt]

Hockey, Susan (2004), ‘The History of Humanities Computing’, in A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Oxford: Blackwell, [http://www.digitalhumanities.org/companion/]

Proud, Judith K. (1989). The Oxford Text Archive. London: British Library Research and Development Report.

Pajares Tosca, Susana  (2000), Report on the Humanities Computing Unit,[https://pendientedemigracion.ucm.es/info/especulo/hipertul/HCUreport/HCUeng.htm]

Posted in Uncategorized | Comments Off on The Oxford Text Archive at 40

Connecting and Integrating Language Resources and Tools with CLARIN

As part of the meeting of the CLARIN technical centres, an open tutorial session was held, at the SURF offices adjacent to Utrecht train station. SURF is the collaborative ICT organisation for Dutch higher education and research.

After a welcome from Dieter Van Uytvanck, there were presentations and discussion on various aspects of the CLARIN infrastructure.

Federated Search Results

The Federated Content Search (FCS) version 2  was presented by Leif-Jöran Olsson from Sprakbanken in Sweden. FCS decouples the back-end functionality of online corpus search engines from the results, and aims to aggregate and integrate results from multiple sites. CLARIN-FCS 2.0 is not a replacement, but an extension of the existing functionality. It will be backwards compatible with 1.0. It is built on the SRU / CQL interface specification. There are guides and tutorials on the CLARIN technical wiki.

The transport protocol encompasses and endpoint, between FCS client and the idiosyncracies of the various search engines based in different repositories. The endpoint translates queries from the client from CQL or FCS-QL to the query dialect, and translates query results back to format required for the client. A discovery phase allows a client to see the functional capabilities of a search engine, and which resources are available for search.

FCS 2.0 allows advanced search in multiple layers of annotation: token, lemma, pos (UD-17), orth, norm, phonetic, text. Only text search was possible in the earlier version. The FCS-QL is a superset of CQL 3.0, and there are adapters for CWB. Hits are serialized as CLARIN-FCS results, with each hit serialized as one record. Data views allow different presentations of results. The technical requirements for offering an endpoint include reference libraries SRUServer, SRUClient, FCS-QL, FCSSimpleEndpoint, and translation libraries.

In the discussion it was established that it is possible to set up an endpoint for some online search engine in another location not curated by a CLARIN centre.

It would also be useful to know, and for the information to be disseminated, what would be the best course if a service manager was making a decision now about which corpus search engine platform to implement locally. What should they bear in mind for compatibility with FCS? Do certain software stacks work better out of the box? Where is there most expertise and support? I would guess that Corpus Workbench might be a good choice, but I don’t have confirmation of this.

CLARIN-FCS 3.0 is already planned and will include syntactic search and more advanced views of multiple layers. Possible hurdles to more advanced cross-searching include the fact that mapping of POS sets can be problematic, and might be too bad to use effectively. It would be interesting to explore whether the system’s reliance on annotation in the text (rather than stand-off annotation) means that advanced search on resources using annotation is not usually possible or meaningful when cross-searching resources, since annotations are normally specific to a particular corpus. Related to this is the important question of how far CLARIN will be able to, or want to go with adding advanced features? Is the intention for the FCS to remain as a ramp to discover and reach more advanced features on the various interfaces for each corpus, or will it become a corpus search platform, where researchers can carry out all of their data exploration?

More information on Federated Content Search can be found at https://www.clarin.eu/content/federated-content-search-clarin-fcs.

Metadata

After a short break, we launched into a session on the latest version of the CLARIN metadata standard, CMDI 1.2, with Mitchell Seaton and Menzo Windhouwer, although this took rather a narrow perspective of presenting the changes in version 1.2, without offering an overview of what CMDI is and does.

More information on CMDI can be found at https://www.clarin.eu/content/component-metadata.

Connecting applications to data

Claus Zinn introduced the Language Resource Switchboard, which is under development as part of the CLARIN-PLUS project. The switchboard suggests applicable web-based tools for a given resource, and forwards relevant details of the resource to the application and starts processing. There is already a demonstrator at http://weblicht.sfs.uni-tuebingen.de/clrs, and it works! There are some challenges to integrate this with the VLO. For example, some records in the VLO only contain descriptive metadata, and don’t link to the actual resource files.

The current Switchboard is a very impressive demonstrator, but there are a lot of issues to be addressed before it can go beyond demonstrator status to a service. These include consideration of the vocabulary used for ‘tasks’, and whether an existing taxonomy such as TaDiRAH can be used, offering the opportunity to link to other services.

There are also likely to use cases where the user starts with a raw text and is not necessarily happy to follow a fully automated processing chain with applications available as web services. What about data inputs that already have bespoke analysis and interpretation, and manually corrected encoding? Is it possible to interrupt workflows, then re-insert manually tweaked data inputs?

In the evening there was a short walking tour of the historic city of Utrecht, including a visit to the new CLARIN ERIC offices, and a most gezellig evening meal.

LINDAT DSPace for CLARIN Repositories

On the following morning, a session introduced how a reusable instantiation of DSpace has been created with bespoke modifications for CLARIN, known as LINDAT DSpace. The modifications include an enhanced submission interface for depositors, and ingest workflows. The interface is set up to capture CMDI metadata, but multiple profiles are possible, and conversions are built in, including Dublin Core, OLAC, and TEI headers.

Developed in Prague, the repository package is currently implemented in Slovenia, Poland, Italy, Sweden, and Norway. LINDAT DSpace is currently on DSpace 5.6, and is following upgrades in the core packages, as well as contributing code back into the DSpace project.

The repository would normally be installed on a dedicated virtual machine, and the technical requirements include the packages ant, postgresql, jdk, tomcat, maven, make, and apache or nginx. Most instances have been installed with help from staff from Prague, typically supported by a CLARIN ERIC mobility grant.

I am certainly interested in this as an option for migrating the Oxford Text Archive to a new platform. It is most attractive that it offers various aspects of integration with CLARIN out of the box, including a path to CLARIN B-Centre accreditation, and there appears to be excellent community support available. One of the challenges for the OTA would be to ensure that there is an effective crosswalk from TEI headers to CMDI 1.2, but discussion here revealed that there are several successful instances of this, for example in CLARIN centres Sprakbanken (Sweden), CST (Denmark), ACDH (Austria).

 

Posted in Uncategorized | Comments Off on Connecting and Integrating Language Resources and Tools with CLARIN

The Oxford Text Archive – downloads in 2015

Analysis of the logs for downloads of resources from the Oxford Text Archive in the calendar year 2015 reveal a dramatic increase in usage. This increase can clearly be largely attributed to a number of factors, of which the most significant if the large number of additional texts from the Text Creation Partnership which became available via the OTA at midnight on the first day of the year, when it became legally possible to share them openly.

Most of the credit for this is due to the late Sebastian Rahtz, who did most of the work, ably assisted by James Cummings and Magdalena Turska. Sebastian, who passed away last week, has been instrumental in building and maintaining all of the technical infrastructure of the Oxford Text Archive in the past eight years or so. He will be sorely missed for this, and for the numerous other activities in which so many people became so reliant on him for his hard work, energy and brilliance.

Other factors aiding increased usage of the OTA include:

  • BNC free for download: at the start of 2014 the British National Corpus was made available for download for free, replacing the old system of paying for postal delivery of optical disks, and as word continues to spread about this development, so downloads continue to increase;
  • Freeing the texts: an ongoing programme of reassessing legacy data, and, where possible, removing access restrictions;
  • Higher visibility: resource discovery via the CLARIN Virtual Language Observatory, which aggregates OTA records and offers a new way for users to find the texts;
  • Shibbolization: a small and growing number of resources are available currently for UK users only, but soon to be opened Europe-wide thanks to the CLARIN and EduGAIN;
  • More digital research: demand grows as more users in the humanities start to engage in digital scholarship.

The grand total for the discrete downloads of resources from the Oxford Text Archive was 917077. Of these 180452 could be identified as originating from users in the University of Oxford, approximately 20%. Of the total downloads, more than 99.5% were direct downloads of resources made available at open URLs, the rest made up of the various resources where access restrictions require authorization.

The table below shows the top twenty of the downloads of all types:

Number of downloads ID (with link) Title
10659 2542 VOICE: Vienna-Oxford International Corpus of English
9181 5268 The history of the most noble Order of the Garter: Wherein is set forth an account of the town, castle, chappel, and college of Windsor; … To which is prefix’d, a discourse of knighthood in general, … Collected by Elias Ashmole, … The whole illustrated with proper sculptures.
6957 3016 The spy who came in from the cold
6382 4431 An account of the proceedings against the rebels, and other prisoners, tried before the Lord Chief Justice Jefferies: and other judges in the west of England, in 1685. for taking arms under the Duke of Monmouth. … To which is prefix’d, the Duke of Monmouth’s, the Earl of Argyle’s, and the Pretender’s declarations, that the reader may the better judge of the cause of the several rebellions.
5266 5314 The peerage of Scotland: containing an historical and genealogical account of the nobility of that kingdom, … collected from the public records, and ancient chartularies of this nation, … Illustrated with copper-plates. By Robert Douglas, Esq;.
4806 5301 The peerage of Scotland: containing an historical and genealogical account of the nobility of that Kingdom. … By George Crawfurd, Esq;.
4480 3549 The four seasons, and other poems. By James Thomson
4255 5299 The history and antiquities of the town and county of the town of Newcastle upon Tyne: including an account of the coal trade of that place and embellished with engraved views of the publick buildings, &c. … By John Brand, … [pt.1]
4146 3151 New York newspaper advertisements and news items: 1777-1779
3377 5244 The history of Newcastle upon Tyne: or, the ancient and present state of that town. By the late Henry Bourne, …
3175 3094 The Life of Charlotte Brontë by Elizabeth Gaskell
3154 5309 The history of English poetry: from the close of the eleventh to the commencement of the eighteenth century. To which are prefixed, two dissertations. … By Thomas Warton, … [pt.2]
2984 4835 The history and antiquities of the county palatine, of Durham: by William Hutchinson … [pt.2]
2948 4652 Miscellaneous works: of Edward Gibbon, Esquire. With memoirs of his life and writings, composed by himself: illustrated from his letters, with occasional notes and narrative, by John Lord Sheffield. In two volumes. … [pt.1]
2410 5308 The history of English poetry: from the close of the eleventh to the commencement of the eighteenth century. To which are prefixed, two dissertations. … By Thomas Warton, … [pt.1]
2321 4949 The history of the parishes of Whiteford, and Holywell
2252 4786 The history of Scotland from the accession of the House of Stuart to that of Mary. With appendixes of original papers. By John Pinkerton. In two volumes.: [pt.1]
2209 5730 Treasure Island by Robert Louis Stevenson
2118 4384 Charles and Charlotte: In two volumes. [pt.2]
2079 2554 British National Corpus, XML edition

And the table below shows the most popular items with access restrictions, which required an online application and manual authorization before they could be downloaded. There were 3681 of these downloads – over the year an average of ten per day which needed to be manually authorized by a member of staff.

Number of downloads ID (with link) Title
1857 2554 British National Corpus, XML edition
411 2553 British National Corpus, Baby edition
324 2539 British Academic Written English Corpus
220 2474 The Lancaster Corpus of Mandarin Chinese
93 1477 Helsinki corpus of English texts
81 2551 British National Corpus Sampler
79 2525 British Academic Spoken English corpus
62 0163 Complete corpus of Old English: the Toronto dictionary of Old English corpus / compiled by the University of Toronto Centre for Medieval Studies
53 2462 The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE)
52 2510 Parsed Corpus of Early English Correspondence (PCEEC)
50 2507 A Corpus of English Dialogues 1560-1760 (CED)
31 2488 Dictionary of Old English Corpus in Electronic Form (DOEC)

There were 321 downloads from the experimental site hosted by the Oxford e-Research Centre, where users can obtain authorization for an instant download of a small number of resources (of which the BNC is the most popular) by authenticating with their institutional single sign-on. Only eighteen of these downloads were from the University of Oxford.

Posted in Uncategorized | Comments Off on The Oxford Text Archive – downloads in 2015

Exploring Online Language Resources

A course in the IT Learning Programme at the University of Oxford in Hilary Term 2016 will explore how we can use online language datasets to explore language, history and culture. This course is the latest stage in the evolution of the ‘Corpus Linguistics’ course which has run for the past few years.

We now have at our fingertips huge amounts of language data in digital form, representing unprecedented opportunities for exploring and analysing language and discourse. How we can use the evidence of language usage in digital resources to draw conclusions about language, culture and society? Drawing on techniques and methods from corpus linguistics, this course will offer guidance on finding and evaluating digital sources, hands-on exercises to explore and analyse data, and some suggestions on how to assess, use and interpret evidence from digital sources.

Each session will focus on the exploration and analysis of a different corpus or dataset, with practical hands-on exercises for how to use the resource to find evidence to explore linguistic, socio-cultural and historical research questions. Participants are free to attend the whole course or individual sessions, although there will be benefits, to beginners in particular, to attend the whole course and build week by week on the techniques and insights offered by each session. Prior registration with the online services described below will save time on the day.

1. BNCWeb – exploring a corpus of late twentieth century English

12:30-13:30 Thursday 28th January (HT week 2) at IT Services, Banbury Road. Sign for free to attend the course here.

The British National Corpus is a very widely used and cited dataset, which was designed and built in the 1990s to provide a representative and balanced sample of modern British English, in speech and writing, across a number of varieties in a wide range of contexts. This session will introduce and explore basic concepts of corpus design and construction, and introduce techniques, functions and methods for corpus analysis. BNCWeb is a customized application of CQPweb, with the facilities to exploit and analyse the linguistic annotation of the texts in the BNC, and to make use of the detailed descriptions of the sources. Participants can access BNCWeb with Oxford single sign-on via a link at https://ota.oerc.ox.ac.uk/ (and can also register to use the BNCWeb service hosted at Lancaster University, which will be the back-up in case of problems with BNCWeb at Oxford).

2. CQPweb – exploring a range of corpora

12:30-13:30 Thursday 4th February (HT week 3) at IT Services, Banbury Road.  Sign for free to attend the course here.

The online application CQPweb offers an interface to a powerful corpus search and analysis engine which can be applied to any textual dataset. CQPweb is an open source software application, deployed at many institutions around the world to offer access to a wide range of corpora. This session will focus on mining a large corpus from Early English Books Online for historical information. Participants should register to use the service at http://cqpweb.lancs.ac.uk/.

3. corpus.byu.bnc – historical and cultural investigations

12:30-13:30 Thursday 11th February (HT week 4) at IT Services, Banbury Road.  Sign for free to attend the course here.

The set of large corpora hosted at Brigham Young University include contemporary and historical Corpora of British and American English, Spanish, Portuguese, and the Hansard Corpus of UK parliamentary proceedings. This session will further extend the exploration beyond linguistic research questions to explore historical and political texts. Participants should register to use the service at http://corpus.byu.edu/.

4. The Oxford English Corpus – lexicography and beyond

12:30-13:30 Thursday 18th February (HT week 5) at IT Services, Banbury Road.  Sign for free to attend the course here.

The Oxford English Corpus, and related datasets, offer the opportunity to  explore current and recent trends in the English language, via a very large and growing corpus which is regularly updated with new texts. This corpus is used by the lexicographers at Oxford University Press to create and update entries in the Oxford English Dictionary and other dictionaries, reference works and teaching materials, and can also be used to monitor and discover social trends via the discourses revealed in the data. The Oxford English Corpus uses the SketchEngine software to manage, filter and reveal patterns in these multi-billion word corpora. Log-in credentials for the Oxford English Corpus are kindly supplied by OUP and will be issued during the tutorial session.

5. Exploring modern European languages with CLARIN

12:30-13:30 Thursday 25th February (HT week 6) at IT Services, Banbury Road.  Sign for free to attend the course here.

A wealth of corpora and other language resources are becoming more easily available to researchers thanks to the CLARIN European Research Infrastructure Consortium. The UK has recently joined CLARIN as an Observer, allowing access to all UK researchers with institutional single sign-on via the UK Federation. We’ll take a whistle-stop tour of some of the available languages and corpora, with a focus on the facilty for Federated Content Search, finding hits for a search term across a wide number of resources held in different repositories. Access is available to all University of Oxford users, and via institutional single sign-on to users in higher education institutions from participating countries (see more about access at http://clarin.eu/content/easy-access-protected-resources).

The course will take place Thursday lunchtimes weeks 2-6, Hilary Term 2016, at IT Services, Banbury Road, and will be taught by Ylva Berglund Prytz and Martin Wynne of IT Services, University of Oxford. It is open to all members of the University of Oxford and there is no charge.

Posted in Uncategorized | Comments Off on Exploring Online Language Resources

A Virtual Museum of Language for Oxford

The University of Oxford is home to unrivalled collections of  materials and expertise relating to human language, dispersed across faculties, museums, libraries and other units. A pilot project is proposed to imagine a new virtual museum, which would bring together and feature online resources and information about physical collections at the University of Oxford, in the various museums, galleries and libraries. The visitor to the website would see images, video, sound and text, to entertain and to inform about some aspect of language and its use. The collections of the virtual museum would start by creating exhibits which draw together existing materials, and then be built up over time with a series of new online exhibits created by guest curators. Exhibits could feature digital images of artefacts in museums, opportunities for exploring language resources online, interactive language games, podcasts and blogs about research in the University. The intended audience would be the general public, including sixth form students.Humpty_Dumpty_Tenniel

The museum could work closely with, and not try to replace or compete with, existing outreach activities. It should be seen as a portal to help web visitors to find ways to navigate to the wealth of language-related online resources created by people in different parts of the University. Obvious partners, already involved in outreach, engagement, dissemination and knowledge exchange activities online would be TORCH, Oxford Sparks, Digital Humanities at Oxford, Digital.Bodleian and numerous websites provided by OUP for language learners and others interested in language. The museum could also provide a platform for new citizen science and crowd-sourcing projects.

Language is a feature of all disciplines in the university, sometimes directly as the sole object of study, but more often as one component in more complex objects and processes, and almost ubiquitously as the medium of communication. Linguists study language, but all human and social scientists study social and cultural phenomena which are infused with language. There are scientists who study physical, mental and medical aspects of language, but all conduct the large part of their communications via language.

dusnerThe distributed collections would include obvious candidates for material from the following faculties: Modern Languages; Linguistics, Philology and Phonetics; Oriental Studies; Classics; English Language and Literature, but all would be welcome to contribute. One of the key uses of the museum would be to offer a route for dissemination and outreach for research projects by presenting their research and their outputs to a general audience, and the musuem could be a central pathway to impact for these and other disciplines.

In Social Sciences, initial exhibits could be sought from Anthropology (relating to endangered languages), Education (particularly relating to language teaching and learning), Oxford Internet Institute (language on the web). The Migration Observatory (http://migrationobservatory.ox.ac.uk/) is involved in large-scale linguistic analysis of public discourse about migration, as a sociological research topic and to inform social policy and to inform the general public, and has existing materials relevant for a museum exhibit on this topic (see, for example http://www.compas.ox.ac.uk/2013/pr-2013-migration_media/).

In Mathematical, Physical and Life Sciences Division (MPLS), the e-Research Centre and the Computational Linguistics Group in Computer Science are already home to research projects focussed on language, and Medical Sciences is involved in research, treatment and therapy relating to numerous aspects of human language. Exhibits could build on existing collaborations on themes relating to medical humanities, the history of science, and humanities and science (especially TORCH themes and networks), and the degree course in  Psychology, Philosophy and Linguistics (http://www.ox.ac.uk/admissions/undergraduate/courses-listing/psychology-philosophy-and-linguistics).

Oxford University Press could potentially provide exhibits relating to:

  • The Press Archive (e.g. old printing presses, books, artefacts relating to the history of the OED);
  • Oxford English Dictionary (OED);
  • Dictionaries and Scholarship online;
  • Relevant entries from the DNB (e.g famous linguists and linguistic innovators);
  • Monographs (some full-text online) on linguistics and history of language monographs, and also examples of different language varieties exemplified in publications from different periods;
  • Corpora of contemporary language  (e.g Oxford Corpus of English, Oxford Twitter Corpus).

Relevant exhibits from the Bodleian Libraries are practically limitless, and could certainly include materials from:

  • Centre for Digital Scholarship;
  • Centre for the Study of the Book;
  • Special collections;
  • Digital.Bodleian;

and all of the libraries in the University would of course be encouraged to create exhibits. The Language Centre and the Department for Continuing Education are also likely to be potential collaborators.

RTISADFig2The representation of materials and artefacts in the Museums would be a key pillar of the virtual museum. All University Museums would be encouraged to contribute, with some obvious candidates being early language inscriptions and tablets in the Ashmolean Museum, audio recordings by anthropologists in the Pitt Rivers. The University Museum of National History could bring in an evolutionary perspective, and perhaps something on animal communication from. Collaborations with the the galleries and visual art should also be explored.

Featured themes could bring together cross-disciplinary perspectives, and could include, for example, real-time monitoring of public and social media discourse on selected themes (e.g. Europe, environment, migration, etc.),  500 years of the history of printing, language of the reformation (2017 will be the five-hundredth anniversary of Luther’s 95 Theses), the Language of the first world war, etc..

Some exhibits in the virtual museum would draw on existing online materials, e.g.192x125_Auditorydisplays

Simple interfaces would also be offered to search language resources online, such as the British National Corpus (BNC) simple search (http://www.natcorp.ox.ac.uk/), and the similar services for many other languages, e.g. http://clarin.eu/lrtshowcases/. Further possible services from the virtual museum could be ‘Ask an expert’, where language-related queries could be answered and discussed.

Other potential participants outside the university could be invited to contribute, although the intention would be to maintain a strong University of Oxford identity and branding for the museum. Potential external partners could include the Oxford Brookes publishing course, Digital Oxford, schools, other publishers in addition to OUP, The Story Museum, and other museums and libraries. The CLARIN-UK network could also be invited to contribute exhibits. CLARIN-UK is a consoritum of linguistics experts in the UK who have come together to promote the use of online and digital resources in research in the humanitiies and social sciences and beyond. Potential exhibits here could be based on the Metaphor Map, SCOTS corpus, the ESRC Centre for Corpus Approaches in Social Sciences, text mining from the GATE team at the University of Sheffield, etc. Some of the activites and resources involved can be seen at the CLARIN-UK website.

bodleian_little_wonderAs well as being inspired by existing collections and research in Oxford, the developing ideas for this museum have drawn on a number of sources, including discussions with the Ludwig Eichinger, Director of a new Museum of the German Language which is under development in Mannheim, and the proposals for an English Language Museum in Winchester (http://www.englishproject.org/resources/english-language-museum-winchester), and a proposal made for by David Crystal some years ago for a London Language Museum (http://www.davidcrystal.community.librios.com/?fileid=-4845 [PDF file]). These are all proposed as physical museums, but they usefully draw attention to the gap in the market, and the potential breadth of relevant exhibits. The Language at Leeds initiative also shows how language can be at the centre of a truly multi-disciplinary activity in a Unversity.

More suggestions and volunteers to contribute are welcome!

Posted in Uncategorized | Leave a comment

CLARIN: what’s in it for us?

Now that the UK has joined the CLARIN European Infrastructure Consortium, it time to consider the actual and potential benefits for UK researchers. See also CLARIN for Beginners and UK joins the CLARIN family for more background.

Exploring online language resources

The CLARIN Virtual Language Observatory offers a single point to search for thousands of language resources held in hundreds of repositories around the world. The CLARIN Federated Content Search allows researchers to search for patterns in these resources, for example cross-searching multiple corpora held in different repositories with a single query, such as searching for occurrences of a word or phrase.

Access to protected resources via single sign-on

While CLARIN aims for open access where possible, in many cases, for a variety of good reasons, users need to log in and obtain authorization to access certain online resources. webauth For this to work for you in the CLARIN domain, you need to have a Shibboleth identifier and password supplied by your parent institution, which needs to be registered as an identity provider with the UK Access Management Federation, and to have opted in to eduGAIN.Usually, this just means using your usual institutional single sign-on credentials. You can see some of the resources which are available this way on the web page Easy access to protected resources. As well being used for negotiating authentication and authorization to access data and tools, having a persistent and reliable identifier enables services to recognize you and this allows you to save settings, datasets, workflows, etc., enabling an enhanced user experience.

Single sign-on access to your resources

wayf-vs-ds

As well as all researchers in the UK having access to protected resources in all of the CLARIN repositories via eduGAIN and the CLARIN Federation, you can use these mechanisms to control user access to your online resources. To do this you need to register as a shibboleth service provider with the UK Access Management Federation, and then to make some configuration changes on your web server to allow authentication via eduGAIN. The Oxford Text Archive is currently going through this process, and I’ll report back in a future blog with more information on how to do this.

Attending events

P1050037

Certain CLARIN events are only open to, or offer preferential access to, persons working in CLARIN member countries, so now we are eligible to attend these. In occasional cases, but by no means all, funding is available to facilitate attendance (see for example this workshop) last year, and this forthcoming ‘Creative Camp’. More events to follow soon!

Eligibility to access services.

Certain CLARIN services (e.g. the legal helpdesk) are only available to  persons working in CLARIN member countries. As the number of advice centres increases in the ‘knowledge-sharing infrastructure’, this is likely to become a more significant benefit.

Mobility grants

locoslidesouthernjpg Researchers in CLARIN countries are eligible for small CLARIN mobility grants to facilitate short visits between centres in different countries, to carry out CLARIN-related work. There is a call open now for proposals for mobility grants.

Other benefits for CLARIN-UK consortium members

  • Visibility in the Virtual Language Observatory for online resources and repositories
  • Publicity and dissemination via www.clarin.ac.uk, email lists and newsflashes
  • Funding to attend CLARIN workshops, conferences and events in the CLARIN-PLUS Horizon2020
  • Opportunities to host events
  • Participation in future Horizon2020 projects.

This last point is likely to be of particular importance. Participation can take place if your institution joins a project consortium, which is similar to how it worked in FP7, or via a new mechanism which allows individuals to work on H2020 projects on secondment to CLARIN ERIC. Such secondments usually include payment of full direct costs plus an overhead, although the arrangements in particular funding schemes and projects may vary.

Posted in Uncategorized | Comments Off on CLARIN: what’s in it for us?

Collective Intelligence

This post was originally composed 23-02-15, in the wake of the event ‘Digital Humanities Collective Intelligence: a workshop to foster international cooperation’ held in the Anatomy Theatre and Museum at King’s College London on the 21st and 22nd February 2011

A two-day workshop at King’s College, London in February explored the idea of ‘Collective Intelligence’ in relation to DARIAH and the Digital Humanities. Two dozen participants, representing numerous countries, organisations, domains and backgrounds were in attendance, including DARIAH partners from London, Oxford and Dublin. The workshop kicked off with the presentation of position papers from Jan-Cristophe Meister (participating remotely), Andrew Prescott and Susan Schreibman.

Jan-Cristophe Meister (Hamburg University) outlined the plans of the Association for Literary and Linguistic Computing (ALLC) to relaunch its website with three major functions, namely provide a moderated Digital Humanities information platform for the association’s members and affiliates, that will offer a “one-stop” overview on current DH activities, funding opportunities and services, with links to more detailed external repositories.

As a precondition to the wider sharing of such data, Meister emphasised the need “to define a data curation protocol stipulating standards for the moderation and validation of DH information by information gatherers
and providers”, warning that without such a protocol there would be too much variation in the shared information, making it obsolete and “creating ‘white noise’ that will frustrate information seekers.”

Meister therefore proposed “the definition of a DH atlas or a DH taxonomy enabling us to systematize DH information”.

Andrew Prescott (University of Glasgow) proposed that we need a new generation of tools that will work with publishers and other content providers, and enable new perspectives on data and humanities research questions.

Susan Schreibman (Digital Humanities Observatory, Ireland) outlined the detailed and extensive work done in the DRAPIer: Digital Research And Projects in Ireland to scope and describe digital humanities work and act as a collaboration space to share expertise. Susan proposed greater use of Web 2.0 technologies in future initiatives in this area.

There was also a presentation of the arts-humanities.net portal, and a discussion of the lessons to be learned from its six years of existence. The possibility of preferring to follow a design path more oriented towards ‘apps’, ‘gadgets’ or ‘toolkits’ was considered.

Group discussions considered how to move forward to create more interoperable metadata. Do we already have adequate standards and procedures for sharing information? Do we need the carrot or the stick to encourage data creators to follow them? Do we need to link communities and expect the metadata to follow, or vice versa? Some concrete suggestions emerged for potential ways forward to capture, disseminate and use the potential knowledge that is embedded in our currrent and past activities. An aggregation of information about events was strongly promoted, and the idea of a service for mining the collected knowledge of past discussions on relevant email lists and forums was mooted. There are plenty of organisations and initiatives producing useful information, that there is a general willingness to share, but due to various factors there is a certain inertia tending to block efforts to do so. Measures to overcome this inertia and to make it easier to exploit our collective intelligence should be a key guiding priniciple of our next steps.

Posted in Uncategorized | Comments Off on Collective Intelligence

Beyond the Digital Humanities

The NeDiMAH Conference ‘Beyond the Digital Humanities’ was held at the School of Advanced Study, University of London, on Tuesday 5th May 2015. NeDiMAH has run for four years as a project of the European Science Foundation, with the backing of research funders in a large number of European countries. Outputs of the project include the NeDiMAH Methods Ontology (NeMO), to be sustained by DARIAH, and the Methodology Map of DH in Europe.

I have an interest, having organised a joint CLARIN-NeDiMAH workshop in December 2014 in the Hague, together with colleagues from the University of Passau and Huygens ING, on the topic of ‘Exploring Historical Sources with Language Technology: results and perspectives‘.

The opening keynote of the day was from Lucy Kimbell (University of Brighton) on Open Policy Making in a Digital World: Opportunities and Possibilities for Academic Research, who took on the difficult task of getting the audience excited about the bureaucratic manoeuvrings of the civil service in relation to academic research and innovation. I didn’t feel that Lucy ever quite got to explaining the relevance of initiatives like open policy making, the government digital service, open data institute, GovJam, Policy Lab UK, etc. for the digital humanities. She made it clear that data science and social science research were informing the bureaucracy, but struggled to articulate the role of the arts and humanities, or digital variants thereof, except for the rather bizarre assertion that Ed Miliband’s desperate interview with Russell Brand (aka #Milibrand)  was a ‘cultural intervention’ in the general election campaign, presumably cited as a model for arts and humanities practitioners.

A roundtable on creativity and cultural heritage explored the aspects of the digital humanities relating to art, architecture, design. Alessio Assonitis suggested that there is too much arrangiare (roughly, making do, or makeshift arrangments) in Italian cultural heritage, and too much reliance on digital projects to prop up ailing institutions, and called for a more radical approach to promoting digital research. Helle Porsdam explored the difficulties of ethical and legal issues relating to the digital surrogates of intangible cultural heritage, focussing on the recent example of the prehistoric Chauvet cave. Jon Pratty of Arts Council England brought some scepticism about the ‘smart cities’ agenda, and, in particular, the aspiration or expectation that city-wide content management systems and centralised data dashboards might lie behind a future data-driven society, and made a plea to reorient towards creativity rather than heritage. Teal Triggs of the Royal College of Art (does everyone who works there have to adopt a colour as a forename?) asserted the importance of ‘design’ in data curation and analysis, and in forming the bridge between the physical and the digital.

Brett Bobley from the National Endowment for the Humanities (a US federal funder of research in the humanities) looked back to the ‘Our Cultural Commonwealth’, published almost ten years ago, to see what has changed and what is still relevant. Interestingly he drew attention to the weirdness of the notion of the ‘digital humanist’, not foreseen by the report and still contested. Brett introduced Trans-Atlantic Platform, which is building on the success of the Digging into Data challenge to develop more international funding schemes, and now involving 11 countries.

A panel discussion on ‘new forms of data and collaboration’ featured Keri Facer (University of Bristol), who started with appealing for more involvement of the diversity of humans who do ‘digital humanities’, and talked about AHRC Connected Communities programme. We were treated to the call to ‘check our privilege’ and to start count the number of women and ethnic minorities in the room. Whatever the digital humanities are, I think they need to be part of the humanities, and the humanities need to be informed by the intellectual traditions of the enlightenment, not the political correctness of the students’ union. If this is what beyond the digital humanities means, you can count me out.

A scientist in the audience, Peter Fletcher from the Science and Technology Facilities Council in the UK, suggested that a lot of discussion was about sharing data and tools, and that this needs infrastructure. Various academic communities have come together and agreed priorities for building central repositories and experimental facilities. Milena Zic-Fuchs, a linguist from the University of Zagreb, supported the call for infrastructure to support digital research, and urged the audience to support initiatives such as CLARIN and DARIAH, but also to look towards not just pan-European but global collaborations.

A final panel  on ‘Genres of scholarly knowledge and production’ featured Andrew Prescott, who offered a clear and useful explanation of the polar positions of (i) empirical, data-driven research and (ii) critiquing, questioning and problematizing the assumptions inherent in data and tools, such as canonicity, and post-colonial and environmental critiques. Barry Smith gave an entertaining presentation of work on smells from the Centre for the Study of the Senses, which engaged the public, neuroscientists and restaurant chefs with a philosopher in a humanities research project. Patrik Svensson made an appeal to the builders of infrastructure to cater not just for data and tools, but for the research processes and methods which humanists employ. Rounding off the day, Milena Zic-Fuchs outlined some of the background to NeDiMAH and the concurrent emergence of research infrastructures in the social sciences and humanities.

My overall impression was that the various suggestions put forward to promote the legitimacy of DH were not convincing, apart from Lorna Hughes straightforward presentation of an example of exemplary research (http://eira.llgc.org.uk/). This reinforced my view that wht we really need are compelling case studies which demonstrate the possibilities of digital transformations and show a real-life success story (warts and all) which stands on its own as a good piece of research in the humanities.

The discussion on the day may have left some with the impression that we are faced with a choice between, on the one hand, the utopian folly of building Procrustean infrastructure, anti-theoretical and populated with non-contextualized data, and, on the other hand, the development of a critical digital humanities with the goal of exposing the folly, puncturing the hubris, limiting environmental impact, and checking the privilege of the digital humanities. I hope there is a middle way.

Posted in Uncategorized | Comments Off on Beyond the Digital Humanities