New OTA corpus: VOICE

VOICE logoThe OTA is proud to announce the latest addition to the archive: the Vienna-Oxford Corpus of International English (VOICE).

The corpus consists of transcriptions of over 110 hours of audio recordings of English as a lingua franca – a common means of communication between speakers for whom English is not their native language. VOICE is the first freely available resource which systematically samples what is now the most widespread use of English in the world.

The corpus is freely available and can be downloaded from the OTA website:

Posted in news, resources | Leave a comment

Discovering Babel Workshop

A workshop on How to make your language resources discoverable was held at Oxford University Computing Services on Friday June 24th, as part of the JISC-funded Discovering Babel project.

Ylva Berglund-Prytz from OUCS welcomed the participants, who introduced themselves and revealed that they came from numerous universities, representing teachers, researchers, post-graduate students and archivists, from the UK and abroad. See slides (pptx).

Andy McGregor introduced the work of the Resource Discovery Task Force and the JISC programme ‘Infrastructure for Resource Discovery’, with a refreshing willingness to acknowledge the different standards and practices in different disciplines. See slides (pptx).

Martin Wynne then spoke about Discovering Babel, the project within the programme which relates to language resources, focussing on the issues relating to the different ways of describing and cataloguing language corpora (and other resources) and making those descriptions available to users in a variety of ways. See slides (pdf).

Alexander König of the Max Planck Institute for Psycholinguistics then gave a demonstration of the CLARIN Virtual Language Observatory, which is collecting and making available to users in a single place the information about language resources from all around Europe. Most impressive was the overlay of the geographical data on Google Earth, allowing users to find resources via the map. See slides (ppt).

James Wilson then spoke about the suite of projects (many of them JISC-funded) in OUCS which are addressing the more general data management needs of researchers. After the discipline-based and pan-European scope of the CLARIN initiative, it was fascinating to compare the idea of service provision which we might hope to find within an institution. See slides (pptx).

In the afternoon, a ‘show-and-tell’ session then allowed participants to share information about the resources and services that they were sharing with other researchers. This fascinating whirlwind tour of a snapshot of the resources available in the UK showed us all what a variety of extremely valuable datasets continue to be created.

The presentations included:

The final session was a discussion which went beyond concerns about discovering resources, and focussed more on the re-use of resources, and on ways in which they can be exploited online, cross-searched, combined together, and connected with online tools and services.

From a very open and frank discussion about our needs, concerns and frustrations there emerged a strong feeling that a UK network was needed to express our requirements more forcefully to funders and other relevant organisations who can help us to build the kind of services that we need.

Recent informal meetings with partially overlapping set of people in Glasgow, Newcastle and Oxford have reinforced my impression that there is a strong desire to form a UK network of researchers interested in language data and tools. The motivations and proposed activities are to:

  • find ways to find, share and reuse resources;
  • develop joint projects to build resources and services;
  • promote interoperability of resources so that they can more easily be used with generic tools, and combined with each other;
  • lobby for UK funders to invest in infrastructure for creating and using language resources;
  • lobby for language data and tools to be included in national computing infrastructure;
  • lobby for UK participation in the European CLARIN infrastructure;
  • provide channels of communication between UK researchers and CLARIN, e.g. to feed in our requirements, get access to services, participate in technical discussions, etc.).

Clearly this meeting was only a starting point!

Posted in Uncategorized | Leave a comment

Discovering Babel workshop programme

The programme for the Discovering Babel workshop is now available.

The workshop, with the title How to make your language resources discoverable, will be held on Friday June 24th at 23 Banbury Rd, Oxford. The workshop is aimed at researchers who create and use corpora and other digital language resources. More information about the event can be found in a separate blog post.

For more information (or last-minute registration), please contact:
Martin Wynne
Ylva Berglund-Prytz


09:30 Registration and coffee
10:00 Welcome and introductions Martin Wynne & Ylva Berglund Prytz, Discovering Babel
10:15 Introduction to the ‘Infrastructure for Resource Discovery’ Programme Andy McGregor, JISC
10:30 Discovering Babel Martin Wynne & Jens Stegmann, Oxford & IDS, Mannheim
11:00 coffee break
11:30 The CLARIN Virtual Language Observatory Alexander König, Max Planck Institute for Psycholinguistics, Nijmegen
12:10 Managing data in your institution James A. J.  Wilson, Oxford
12:30 lunch
13:30 Scoping the field: what language resources can we share, and what do we want? (5-minute intros and discussion)
14:30 coffee break
15:00 How to make your language resources usable Round table with John Coleman, Hermann Moisl & Peter Austin
16:00 end

Posted in Babel, events | Leave a comment

Do we need language corpora?

The speakers in the debate w Martin Wynne (OTA)

Speakers and Chair

The ICAME32 conference in Oslo started with a number of pre-conference workshop. The Oxford Text Archive was involved in one – a debate on the motion:

“Language corpora are no longer necessary for linguistic research.”

The debate was recorded and a podcast will be produced and made available later. In the meantime, here follows a few illustrations of part of the arguments put forward. They are not to be seen as neither comprehensive nor necessarily representative of the debate as a whole but are offered as a taster of what the participants offered.

The debate was opened by Silvia Bernardini (University of Bologna) who spoke in favour of the motion. She argued that the availability of large quantities of digital texts has changed the world of corpus building and use. Earlier, when textual material in digital form was rare, corpus building had to be done by experts and corpora were small. Today, we can find material online that can be used to help inform us about language, and we should use that.

Janne Bondi Johannessen (University of Oslo) spoke against the motion. She talked about how we need carefully crafted spoken corpora to answer certain questions about language. On the web, even data that may appear speech-like (such as chat room exchanges) still show greater similarity to written than spoken language.

In her opening statement for the motion, Elena Tognelli-Bonini (University of Siena), discussed how we need to change our methodology as we get other types of data to work with. We cannot use same methods as before, when we were working with small, well-defined sets of data. As corpus linguists we need to develop new query languages, new ways of filtering the new types of data we now have.

The last of the four speakers, Gregory Garretson (Uppsala University), spoke against the motion. He maintained that one problem with studying the language we find on the web is that we do not know what this language represents. Using a corpus allows us to make comparisons and our studies can be replicated – doing the same investigation again will return the same result, an important feature of science.

After the four opening statements, the floor was open to general debate and discussion. It was encouraging to see that this obviously is a question that people can relate to, as a large proportion of the audience took part and shared their thoughts. Many good points were put forward, as will be possible to hear in the podcast when this is published.

At the end of the discussion, the four speakers offered a closing remark each before the participants voted. The result of the vote was that the motion was defeated, possible a fortunate result considering that the debate took place just before the formal opening of an annual corpus linguistics conference. After all, if corpus linguists do not believe in corpora for linguistic research, who does?

Posted in events | Leave a comment

The Big Debate

Introductory slide

As part of the ICAME32 conference in Oslo, we are organising a debate. The notion is:

“Language corpora are no longer necessary for linguistic research.”

and four speakers are kicking the debate off by making an introductory statement for or against the motion. We are hoping the opening statements will encourage a lively debate with participatin from the floor. The debate will end with concluding statements by the four invited speakers and proceedings will close with a vote from the floor.

The debate will be recorded and made available online.

Posted in Uncategorized | Leave a comment

ICAME 32, Oslo 2011

ICAME conference logoThe OTA are attending the ICAME 2011 conference in Oslo. The conference, running from 1-6 June 2011, has the title ‘Trends and Traditions in English Corpus Linguistics In Honour of Stig Johansson ‘ and offers a rich and varied programme, as seen on the conference website.

Before the conference officially opens, proceedings will commence with an afternoon of pre-conference workshops. The OTA is organising a debate around the question ‘Do we still need language corpora?’. An outline of the debate can be found on the conference website. We are hoping to see a lively debate. Reports from the debate and the rest of the conference will be added to the blog. You can also follow us on Twitter @oxtext.

Posted in events | Leave a comment

The OTA blog

The Oxford Text Archive now has a blog. The blog will feature posts about OTA activities and resources as well as information about and reflections on new developments, tools, resources and anything that relate to the work of the Archive.

Posted in news | Leave a comment