CLARIN-UK: latest news and forthcoming opportunities

CLARIN-UK: a new consortium and a new CLARIN country

The UK is currently preparing an application to join the CLARIN European Research Infrastructure Consortium as an Observer. The Arts and Humanities Research Council, in cooperation with JISC and the other research councils, will make the agreement, and will work closely with the CLARIN-UK consortium to monitor the benefits to the national community. CLARIN-UK currently has ten member institutions, and more are welcome. Members of the consortium have resolved to share the cost of the annual fee (c. £1000 per institution per annum) and staff in consortium member institutions will be eligible for participation in CLARIN activities. Please get in touch if you’d like to join – there is still time to be part of the initial consortium.

CLARIN – a growing family

In the past six months, Portugal, Sweden, Lithuania and Greece have joined CLARIN as full members. Croatia, Finland and Slovenia are close to full membership too, waiting only to cross some minor political and administrative hurdles. As well as the imminent application for observer status from the UK, France and Italy are also considering similar moves. Norway should soon progress from observer to full member status, and negotiations are under way in almost all other European countries.


CLARIN will receive a major funding boost in the next few years, with the news that the CLARIN-PLUS proposal has been successful and is moving to the grant agreement preparation stage. CLARIN-PLUS offers extra resources to accelerate and extend the construction of the CLARIN infrastructure, including strengthening the central hub, and spreading the reach of CLARIN to more countries and to more users. CLARIN-PLUS will be funded as part of the INFRADEV-3 scheme in Horizon2020, a closed call for ERICs in the implementation phase. The start date is September 2015, and the programme of work includes numerous opportunities for CLARIN-UK consortium members, including workshops on how to create a CLARIN centre, and how to integrate your tools and data.

CLARIN ERIC in Horizon2020

As well as the major success of CLARIN-PLUS, CLARIN ERIC has been successful in a number of Horizon2020 proposals, including LT-Observatory and Parthenos, and a number of further proposals have been made or are in preparation. As soon as UK has joined CLARIN, we will be able to start to participate in these opportunities.

The rules for participation in Horizon2020 allow an ERIC to participate as a consortium member, on terms not available to most other types of participant. The ERIC can assign project work to individuals in universities and other bodies in member countries. This is already working well to significantly reduce the administrative overhead usually associated with forming a consortium involving all participating institutions as partners. Furthermore, as an European extra-national organization, CLARIN counts as an ‘extra country’ where the funding scheme rules prescribe that at least three different European countries are required.

CLARIN Centres Meeting Utrecht 28-29 May

CLARIN-PLUS will offer beginners’ workshops in various aspects of using and developing CLARIN services. If you are keen to dive in at the deep end now, there is a meeting for CLARIN Centres and those wishing to set up centres next month in Utrecht. Sessions include a webservices workshop, and a tutorial for those setting up a CLARIN centre. There is no CLARIN funding available for travel and subsistence, but there is no fee for participation. More details here.

Find out more about about CLARIN at

Posted in Uncategorized | Comments Off on CLARIN-UK: latest news and forthcoming opportunities

Analyzing the language of social media

The eighth one-day event Corpus Linguistics in the South was held at the University of Reading on Saturday 15th November 2014, and focussed on research analysing the language of social media.

Dawn Knight (Newcastle University) spoke about the ‘spoken-ness of e-language’, exploring the positioning of online discourse in relation to the norms of spoken and written language. In short, is language online more like speech or writing?

There are several interesting aspects to Dawn’s project. Explicit permission to re-use and share the data was obtained from all contributors, who were active and popular participants in online discourses. Anonymization of private personal data was carried out, but the data set does not seem to be available for other scholars to use. Funding for the project came from Cambridge University Press, who, it appears, are not willing to share the data.

One starting hypothesis is that there is a continuum of formality for interactive language, with writing at one end and speech on the other (see, for example, David Crystal, English as Global Language, 2003). Can we map different forms (blogs, tweets, email, discussion forums, SMS) onto this continuum?

Pronouns and deictic markers are interesting. In spoken interaction there are typically references to people, actions and things in the shared immediate context, and (probably as a result of this) pronouns and deictic markers are typically more common in speech. Corpora also show that personal pronouns, adverbs and interjections are more common in speech. In this sense, the e-language corpus looks more like speech than writing, despite asynchronicity of the discourse and the lack of shared space. Dawn suggests that there might be an over-compensation, since in online forums we are more reliant, almost exclusively reliant, on language for interactional aspects of communication.

Some results went against expectations – ‘shall’ and ‘must’ are thought to be generally in decline, particularly in informal registers, but proved to be more frequent than expected in SMS language. Discussion forums proved to be most like speech in many ways, despite their low interactivity and asynchronicity.

Dawn’s approach looks promising, and the initial results are suggestive. Further research could involve visualization of the multidimensional comparisons between corpora, for example, to explore more fine-grained identifications of similarities between different e-language and language types in a reference corpus.

The next two papers in the morning focussed on the analysis of particular online forums. Daniel Hunt (QMU) explored the language of an online forum for sufferers of anorexia. An approach based on keywords showed ways in which participants in the forum present their illness as an entity external to themselves, thus presenting themselves as passive and unaccountable. Amanda Potts (Lancaster University) presented an exploration of ‘queer’ sexual innuendo in an area of discourse and human activity that was new to me – commentaries accompanying highly popular videos of Minecraft games. It wasn’t clear to me that either project was able to draw conclusions that they couldn’t have drawn from simply reading the texts in their relatively small datasets, nor that there was anything particularly interesting or significant about the particular issues, themes and areas of social media which they had chosen.

Amy Aisha Brown (Open University) set out to examine English in Japan via Twitter, but the analysis seemed to be of tweets in Japanese which included references to the English language, so I’m afraid that I was a bit lost. She seemed to draw the conclusion that fluency in English is generally considered in a positive way in the Japanese twittersphere. Amy used a Windows desktop application Tweet Archivist Desktop and a programme called KH Coder for cleaning and analysis. Again, the data is not being shared and no indication whether this might ever be possible.

Alison Sealey and Chris Pak (Lancaster University) reported on a small-scale analysis of references to animals on Twitter, part of a larger project examining discourse about animals. The project used an online service called Topsy to find tweets, but it wasn’t clear to me how the analysis and the results are arrived at, are or what the research questions are.

Rachelle Vessey (Newcastle University) took a more theoretical tack, characterising mainstream corpus linguistics as being mainly concerned with, and focussed on, notions of stability and normativity, and on standard languages. The idea of ‘superdiversity’, was presented as a cultural successor to multiculturalism, with an assumption of more diverse and fast-changing cultural formations. She has pursued this issue in the context of Canadian language politics, examining tweets relating to a recent controversy known as ‘pastagate’. The data in this project, like others here today, was somewhat complicated by the large number of retweets. The somewhat underwhelming conclusion was that the largely separate English and French language communities operate separately in Twitter.

Yin Yin Lu (Oxford Internet Institute, University of Oxford) is investigating the linguistics of the Twitter hashtag. She has used the Streaming API, which offers access to a restricted number of tweets according to filters. She used a keyword filter with the hundred most frequently used words in English, sampled in one-hour slots over a two-week period. (Interestingly, she didn’t have access to a server to do this on in Oxford, and used a server at a different University thanks to a family connection.) Her analysis focussed on a few examples of how hashtags were used in activist campaigns such as #bringbackourgirls.

The final talk from Diana Maynard (University of Sheffield) introduced a research project, ‘Decarbonet‘, which aims not only to investigate what people think about climate change, but to ‘raise awareness’ and foster ‘behavioural change’. Diana’s part in this is to analyse discourse about climate change in social media.

Overall, it was clear that there are still some serious hurdles to accessing social media data, processing it for use with standard text analysis tools, and sharing the results and datasets. I was hoping to find examples of workflows which could access big datasets and analyze them in close to real time, but I haven’t found that yet. I’ll keep looking!

Posted in Uncategorized | Comments Off on Analyzing the language of social media

Advising DigHumLab

University of Oxford researchers from IT Services and the Oxford Internet Institute are playing a key role in advising an important national project in Denmark, and learning a lot about different ways of building and sustaining research infrastructure along the way.

DigHumLab is a national initiative in Denmark to set up a collaboration to advance digital research in the arts and humanities. Staring in 2010 with the drawing up of a roadmap of research infrastructures for Denmark, DigHumLab was awarded €4 million Euro for five years in 2011. DigHumLab encompasses the Danish contribution to the CLARIN and DARIAH European research infrastructures. I was asked to join the small international Advisory Board for the project, and to attend a mid-term meeting in Copenhagen in September 2014, to offer advice to the project.

DigHumLab logo


The vision for DigHumLab is to take actions to strengthen research in the humanities and humanistic social sciences, to improve access to data, develop methods and tools, promote collaboration and support emerging areas of digital research. The project goals are to:

  • create a virtual portal, an access point, and potential partner for international collaborations
  • create a knowledge hub
  • become a provider of software and technical solutions
  • act as a national political advisor on matters relating to digital research in the humanities.

As well as activities to establish these outputs and services, the project includes a significant amount of effort spent on three research themes:

  1. language resources and technologies
  2. media tools
  3. interaction and design

The project has kicked off with participation from four Danish universities, but the intention is not to create a club closed to other universities or research bodies. The project also aims to build links and to coordinate activities with national services for high performance computing, research data management, e-Science, as well as with the National Library and the European research infrastructures. It wasn’t possible in this meeting to find out what measures are being taken to achieve these goals, but it was encouraging that the meeting was hosted by the National Library, in their impressive modernist ‘Black Diamond’ building, with participation by senior staff from the library.

Photograph of Danish Royal Library

The Black Diamond building housing the Danish Royal Library

After spending an initial period establishing the working groups and themes, the project is now moving into a period with a focus on building generic services such as online research environments, awareness raising, a survey of requirements, outreach activities to various research communities, establishing teaching programmes, and increased student involvement.

The first theme, language resources and tools, was presented by Lene Offersgard from the University of Copenhagen, who outlined the key activities, including the establishment of a data repository, now certified with the Data Seal of Approval and CLARIN ‘B’ Centre status, with an accompanying helpdesk, tools for the analysis and annotation of data, and a user engagement programme. There are also PhD teaching modules for students at the University of Copenhagen.

The second theme, audio-visual data and tools in various media, was presented by Niels Brügger and Per Jauert from Aarhus University. Work on this theme acknowledges that “the digital comes in a variety of forms.”, and which they sub-divide into:

  • Digitized
  • Born-digital
  • Reborn-digital

The enhanced web archive is an example of the latter, where digital materials have been collected, reassembled and made available with metadata as research data. The focus of this work is on web archives, but it occurred to me that it is a characterization which fits the modern linguistic corpus as well. The team have developed the Digital Footprints software, which is still in beta, but is in use for studying online material. As well as developing ways to examine and to improve access to Netarkivet, the national web archive, researchers are working together in international collaborations, including with the British Library, and the Oxford Internet Institute, and establishing a transnational European research infrastructure for the study of archived web materials. The NetLab Forum provides wiki space for research projects using the tools so that they can communicate and share experiences, expertise and results. It was pointed out DigHumLab has been crucial in providing the funding for an IT developer, without whom this work would not have been possible, and on whom ongoing work is reliant. Another risk to the viability of ongoing work was flagged – independent legal advice is needed on the risks associated with access, use and redistribution of online materials.

The theme also encompasses work on audio-visual data and tools. Following on from research projects such as LARM, a research infrastructure has already been established for the use of audio materials, and the challenge is to integrate with other DigHumLab services. This work has been built on the national library media collections of radio and televsion programmes. Advance services already offer streaming access, and ongoing research projects are using these services for research.

Johannes Wagner of University of Southern Denmark introduced the third theme, the “little brother” of the DigHumLab siblings, focussing on “experiential research”, or analysis of human interactions and activities via digital capture and analysis. An example is the VELUX project on non-verbal communication. The experience of the researchers in this area is that “if you build it they will come” doesn’t work in this context. Face-to-face and hands-on bespoke support are needed to engage with researchers and to meet their requirements.

In the discussion with the Advisory Board, Eric Meyer (Oxford Internet Institute) asked the penetrating question of how are the success stories of flagship projects disseminated to other researchers who could potentially engage with DigHumLab. Demonstrators are much more compelling and convincing when they have been used for real research that has been finished and can be shared. Too many e-science case studies have been based on toy data or invented problems, making it was difficult for the people who might want to use these tools to envisage real uses, or to deploy the solutions. A variety of instruments are currently used to involve researchers, including travelling workshops, PhD courses, journal articles, lectures, and short courses. The question of how, or whether, to attempt to address all disciplines and all communities in the humanities remains an open one. It was agreed that robust showcases modelled from the user point of view were vital to promote uptake.

The afternoon session focussed on the thorny question of possible business models for the sustainability of DigHumLab beyond it’s current phase of funding. From 2017 DigHumLab aims to focus on the refinement and improvement of services, including prioritization of research areas, marketing of services and the recruitment of users, and the development of a viable financial model for sustainability.

One model would be for DigHumLab to be based on a core of generic services, with research themes changing over time. Eric Meyer offered a cautionary tale, the generic services and service centres developed as part of the e-Social Science programme in the UK no longer exist. I added the further example of the Arts and Humanities Data Service.

There was also some discussion of how to enter into collaborations with computer scientists. It was agreed that it was important not to try to treat computer scientists as “code monkeys”. Computer scientists need to address research questions and to publish in high-impact journals relevant to their discipline. We need to approach collaboration as an inter-disciplinary research project as with equal academic standing for all partners. Sometimes we just want to build a website or an interface or install some software, and then we need to find a developer, but this is different to an inter-disciplinary collaboration.

Sten Runar Ludvigsen from the University of Oslo made the interesting point that although distributed services can have a certain robustness, a centralized lab means that you only need to change the culture in one place, not in every lab, to run services for the community in a collaborative spirit, and might therefore be more realistic. He also made the crucial point that, as a small country, the Danish humanities community could benefit from focussing on a small number of areas Clearly they have already done this with the three themes in the current phase of DigHumLab. It would be useful to have further reflection on whether these are the right areas, and then to communicate clearly to stakeholders how the scope of the project will be constrained in future.

To summarize the day, I proposed the following three points for the project, after discussion and in agreement with, the other members of the Advisory Board.

1. DigHumLab would should articulate a vision and a mission relating to the use of digital data, tools and methods situated firmly within the wider project of the mission(s) of humanistic research. A strategic vision about what and who should be included, what the priorities are and why, and what is not included. A decision needs to be made on whether it would make sense to focus on a small number of strategic areas, or to try to engage with all areas of the humanities, and the former seems likely to be more successful. These statements about vision, mission and scope can be informed by asking where do you want to be in 10 years time. The project is nicely focussed already on specific themes – do you plan to continue to restrict the scope to these or expand to other areas of research?

2. A flexible and robust business model needs to be able to survive the withdrawal of a funder, institution, partner, academic domain, key individuals, etc.. Staking everything on the support of a ministry or a national funding body is a risky, all or nothing strategy. Flexibility means a range of funders can be accommodated (e.g. national, local funders, programmes for libraries, research data management, research grants, e-science, network/conference funds, etc.). The key to this is that various institutions and people want to buy into and sustain the mission, and are prepared to align local strategies of sustainable institutions with the common aims. This way, there is the opportunity to repurpose existing resources and funding streams to fulfill the aims of DigHumLab, rather than the more difficult task of seeking additional funding on a long-term basis.

3. It would be useful to clarify and define how DigHumLab supports digital research at the various stages of the research life-cycle (initiating, carrying out, connecting, disseminating and sustaining research). Do you want to be involved in some or all of these? How are you adding value to these activities?

You can see and read more about DigHumLab at

Posted in Uncategorized | Comments Off on Advising DigHumLab

The Oxford Text Archive and the British National Corpus: an annual report

The Oxford Text Archive continues to deliver open access to language resources to the academic community, via the website at This year there were 5278 downloads of datasets from the OTA. An exciting development in this period was the arrival of the British National Corpus (BNC) in the OTA collection. This major reference work for the English language is now available from the OTA website, and was downloaded 397 times by researchers from around the world after it went online in January 2014. Two subsets of the corpus, BNC Baby and the BNC Sampler, are also available. Thousands of texts created as part of the Eighteenth Century Collections Online Text Creation Partnership (ECCO-TCP) are available via the OTA in high-quality XML format, and many thousands more will be available in 2015 from the Early English Books Online Text Creation Partnership (EEBO-TCP).

Two new services, introduced as the result of a collaboration with the Oxford e-Research Centre, offer new ways for users to access and use the literary and linguistic texts in the OTA. Users can download certain texts (including the BNC) without waiting for manual authorization of their requests by using their institutional single sign-on, thanks to Shibboleth federated access and identity management. At the moment, only users who are members of an institution which is part of the UK Access Management Federation can use this facility, but we are working to open it to cross-border access to more users throughout Europe via the CLARIN and EduGain federations. More than 300 instant downloads have been made already using this facility.

Screenshot from BNCweb


The second new service is BNCweb, a sophisticated online interface to the BNC, which allows researchers, teachers and language learners across the University to submit queries to identify and analyse distributions and patterns of usage in this large dataset of English speech and writing. In the coming year, we will start to implement an enhanced service offering access to more datasets via a common interface.

The OTA obtained certification as a CLARIN Centre in 2014, which confirms and strengthens its role as a key hub in the European research infrastructure. As a result of the collaboration with CLARIN, OTA resources can now be found via the Virtual Language Observatory, an online research portal, which offers access to electronic language resources held in repositories worldwide.

Screenshot of the Virtual Language Observatory

Virtual Language Observatory

The development of these services, and the expertise in these areas, has enabled staff from IT Services to offer specialized teaching and support in digital methods to members of the University, including teaching on Masters course in English Language, induction sessions for new postgraduate students, and a course open to all in the IT Learning Programme on corpus linguistics.

Posted in Uncategorized | Comments Off on The Oxford Text Archive and the British National Corpus: an annual report

Web publishing in the University

A new project is investigating what users need for a new service which aims to provide some better options for members of the University of Oxford to build and maintain their web presence. All members of the University are invited to fill in a survey to help us to The online survey can be found at the following URL:

Why should you fill in yet another survey? If any of the following apply to you, then please let us know what you want:

  • Would you like to have official University of Oxford web pages which you can edit and manage yourself?
  • Would you like to have a wider range of templates, themes and features for your websites?
  • Would you like to have an easy to configure and edit website for your research group or unit?

IT Services are developing a new way of providing central web publishing services, which will include catering for individual members of the University wanting to manage their own web presence, in addition to clubs, societies, research groups and clusters, as well as other units. The new service will replace the existing web publishing service hosted at, and will offer substantially more features, including a range of templates for different types of website, plus optional modules and bespoke services. Below is a typical site hosted at – pretty old-fashioned, I’m sure you’ll agree:

A home page

A home page

Increasing numbers of people are going elsewhere to meet their requirements – to services like WordPress and Google Sites which allow you to quickly and easily build and publish a new site using one of a selection of templates. Others find web designers and external hosting solutions to build and support their sites, while others rely on social media for their web presence, using Facebook, Twitter and more professionally oriented services such as and linkedin.  While this mixed economy offers lots of flexibility and benefits to users, it does raise problems. It can often be hard to meet the recurrent costs needed to keep such services going. Users are reliant on external services with no guarantees of their continuation. Money flows out of the University, when it could be used to build capacity here. And, finally, it can be difficult or impossible to give the right impression with University branding and URL.

The Research Support team in IT Services are gathering requirements from members of the University for the new service. At this stage we want to know which functions should be included in the standard offerings. The survey is designed to find out what features and functions are important to you so that we can design our new service to meet those needs. Please fill in the survey if you have time, and get in touch if you’d like to discuss it further.

Posted in Uncategorized | Comments Off on Web publishing in the University

Corpus Linguistics, Context and Culture

Edited transcript of a very short introduction to a panel discussion in which I participated with Bas Aarts, Stefan Gries, Andrew Hardie, Christian Mair, Peter Stockwell on Corpus Linguistics, Context and Culture on 2nd May 2014.

We are now tantalisingly close to being able to process and analyse very large-scale textual resources with relative ease; these resources represent significant sections of the human cultural record; the opportunities for digital transformations of research in many disciplines are enormous.

Researchers are starting to use these resources to find and ask new research questions, as well as to address some old questions with more and new data, on a bigger scale, more authoritatively, more systematically; this is starting to happen and will happen with or without corpus linguists.

To engage more effectively in these new forms of interdisciplinary research, we should focus more of our resources and attention on overcoming some important technical and methodological barriers; the main technical barriers are a lack of professional, reliable, persistent and sustainable services open to all – this is what CLARIN is trying to achieve; in terms of methodology, humanities scholars need to take a step towards working on some connected and common research programmes, addressing questions susceptible to big data approaches – this is down to us, the research community.


Posted in Uncategorized | Comments Off on Corpus Linguistics, Context and Culture

Popular Representations of Development

A few weeks ago, I was invited to join a panel discussion at Wolfson College, Oxford, to discuss the new book Popular Representations of Development: Insights from Novels, Films, Television and Social Media, edited by David Lewis,  Dennis Rodgers and Michael Woolcock. The book aims to open up a new method of analysis for development studies by treating popular representations of development issues as a data source, and engaging in interdisciplinary research with various disciplines in the humanities to analyse these representations. See more about the event at Below is an edited transcript of what I said, or meant to say.

‘Popular Representations of Development’ is convincing on the key point of argument, that artistic and fictional representations can be useful and important data resources, since they can influence, shape or reflect public perceptions and debates. It’s fairly straightforward to see how a popular novel or film might have rather more impact than social scientific scholarship, certainly as far as popular discourses and perceptions are concerned.

Through the study of representations of various aspects relevant to Development Studies, in various media and from various time periods, the research papers in this volume make illuminating, and sometimes contentious points, about development, about the representations, and about the relationship between them and about methodologies for pursuing this question. For me, they also raise some questions about methodology in an emerging interdisciplinary field. I have some experience of participating in and studying the ways in which new methods emerge, and are contested, particularly in interdisciplinary areas which are transformed by the introduction of digital methods, such as corpus linguistics and digital humanities. I will try to bring some of this experience to bear and make some observations about how this new field might grow.

At the risk of some oversimplification and crudeness, I would say that the research showcased in this volume demonstrates a methodology whereby the researcher hand-picks an example and analyses it according to their chosen methodology, throwing in an overlay of their chosen ideological approach. This raises questions about these choices: questions about bias, representativeness, balance, scope, sampling, and the importance and impact on perceptions and debates of the representations chosen for study. What is an important film or novel? High relevance, popularity, critical acclaim, artistic merit, sociological integrity? Any given representation is unlikely to tick all of these boxes. What is more, now we are at a time when we have the opportunity to use the opportunities presented by the large amounts of available texts and media, applying approaches currently characterized by buzzwords like big data, linked data, smart data, and enabling us to ask different questions, develop new methods, and engage in different types of interdisciplinary collaboration.

I will now examine a little more depth what I mean by these points.

How representative are the works examined? Aren’t they just hand-picked examples to back up your points? To take one example from the book, Missing, Under Fire, The Year of Living Dangerously, and Salvador do clearly constitute an interesting tendency, a new sub-genre starring the investigative journalist or war reporter amid political turmoil in the Third World. The chapter in question convincingly relates the emergence of this sub-genre to the early stages of the break-up of cold war certainties in the Third World. But do they help us to understand these processes better, or are they just crude popularizations of certain aspects (from the point of view of the Western media)? What about the influence of other mainstream adventure films with more conventional (and maybe more misleading) narratives? How representative are these films of Hollywood output, and what are the norms that they diverge from, and what are the dominant forms of discourse and representation that they react against?

A further question begged by this new approach relates to the scope of representations of development. There are many places and time periods to examine, various different media, different artistic forms, many theoretical approaches. It’s possible to construct a powerful argument to justify the inclusion of outliers like The Wire by defining urban decay in the USA as a development issue, but it might be difficult to connect the debate about that with studies of popular film in India, and with poster campaigns in 1930s Britain, and then also with representations of genocide and war in central Africa. With such diversity, and particularly when there is a lot of focus on outliers, marginal and non-prototypical cases, no coherent picture of the central representations and discourses is built up, and the findings of each research project or paper don’t necessarily relate to each other in any way. It’s difficult to build an academic discipline on the basis of a series of largely unconnected research studies. This is a problem shared with the digital humanities, where the objects of study and the methods are so wide and varied that there is little possibility to move forward in understanding in any useful way.

An interesting question is raised, either explicitly or implicitly, by a number of the chapters. Do artistic representations and other narratives merely reflect opinions and debates, or can they somehow provide special insights? Do they inevitably just reflect dominant (and occasionally minority or marginalised) discourses? And if not, how can they do that? Do novelists and film-makers have better insights than social scientists? As noted by the authors, representations emanating from the developing countries in question might be based on more local knowledge of everyday life than the social scientists can muster. One could add that writers of fiction have better story-telling skills. So you can argue that representations can easily be more popular, and more engaging, but can they be more right?

It is an interesting parallel to compare debates around the nineteenth century realist novel. One popular theory is that the great realist novelists such as Balzac, wove stories that told in narrative form the story of how industrial capitalist society worked, dramatizing the interplay of economic and social forces and the effects on people’s lives, and the role of the human subject and the ability to shape their destiny and that of society. And in an era before sociology, such fictions are often seen as key texts to understand society. I think that The Wire aims to do something similar (although now also partly informed by sociological literature), and it partially achieves this, although I would argue that the wrong conclusions can be drawn if you treat it as a data source. The chapter on this topic asserts several times that it is the withdrawal of the state from US post-industrial inner cities that is the problem. An alternative narrative, with a wider scope and drawing different conclusions, could draw attention to a longer historical trend which includes the story of the state’s attempts to overcome the exclusion of the black population by intervention, with the effect of undermining traditional forms of civil society. This alternative story point to the eradication through state intervention of the effective mechanisms for the exercise of autonomous action by ordinary people in the inner cities, and draw attention to this as the more fundamental problem? And is this not a possible reading of The Wire in any case?

A further general problem with interdisciplinary studies is the difficulty of engaging with the cutting edge of research methods in all fields, and the danger of adopting a rather conservative or simplified method in the field in which one is dabbling. In some cases, the studies in this volume could be said to be a little conservative in their methods. The humanities are now grappling with new methods of investigation and interpretation which are being made possible by the availability of massively larger amounts of the human cultural record now in digital form, and the possibilities of searching and analysing these records with computational tools. So there is the danger here of only using the microscope to look at tiny details when we have the opportunity to use new instruments which can show us big pictures, and significant patterns and tendencies which appear when we look at lots of representations at the same time.

I am not suggesting a return to a quantitative approach, which is partly what this new approach is trying to get away from. In my view there seems to be rather too rigid a divide between qualitative and quantitative approaches in the social sciences. In the humanities, I think that it is rather taken for granted that all research needs to be qualitative in some sense, to be soundly based on a firm understanding of sources, of their provenance, contect, value and meaning. Digital and quantitative contribute as an additional set of tools and approaches, not as a replacement. Digital humanities, at its best, is developing techniques which can blend qualitative and quantitative approaches. Scalable reading is now much discussed as an approach which makes use of tools to analyse the big picture, patterns and trends (distand reading), with the ability to zoom and and examine meaning in texts in detail (close reading). In linguistics we have found the need for instruments that can count frequencies, spot trends, but also support close analysis and the interpretation of meaning.

We should also be open to the opportunities to trace present debates in the past, via digital records now coming available and online. Apart from anything else, historical data is often more easily available, thanks to lapsed licensing restrictions and intellectual property rights.  But there are new possibilities not being properly exploited now. Culturomics using Google N-grams is bad social science (and bad linguistics). New, more scholarly initiatives such  Red Hen Lab might show some possibilities for media studies. CLAROS shows how you can do different studies of ancient art when you have all of the data in one place online; CLARIN is starting to show how literary, linguistic and historical studies are being transformed by the possibility of asking new questions of large datasets, and linking data in new ways, as well as old questions in new ways, more systematically, more authoritatively. This opens up the possibility to ask important and central questions – not just the hidden voices, the unusual cases, the margins – and this is necessary if we are to build a research community where it is possible accumulate knowledge, contest and debate issues, to conduct research which builds on earlier findings.

We might look back at this in a few years time and say, “Hey, do you remember, that was when we used to look at one film, or one novel, or just a handful of posters at a time, in order to try to understand development?”.

Posted in Uncategorized | Comments Off on Popular Representations of Development

CLARIN for beginners

CLARIN-Logo_4C14pure_0What is CLARIN?

CLARIN is a network of people, centres and research activities which support advanced digital research based on language data and tools. Formally, it is the Common Language Research Infrastructure, and it exists as a legal entity, a European Research Infrastructure Consortium, with a base in Utrecht in the Netherlands, but CLARIN is really built on important national initiatives in a growing number of countries across Europe. These are building up data centres, connecting resources together and with online tools, creating advisory and support services, and promoting research programmes which make use of them.

Who is CLARIN for?

It’s primarily for anyone interested in digital research in the humanities and social sciences who wants to make use of linguistic data and tools. We’re also very open to scholars from other disciplines, and interested in supporting the use of the infrastructure in teaching and by the general public. In fact, we’re pretty sure that there are lots of cool uses that this stuff could be put to which we haven’t even thought of yet. The funding comes from national and European sources to provide services across the EU, but we’re also keen to make international alliances, and to make as much as possible free for anyone to use. We know that research communities cross many boundaries, and we want to break down barriers, not build them.

What can CLARIN do for me?

It depends who you are and what you’re looking for. If you are a researcher, you could use CLARIN to find services or people to help you to use language resources and tools more effectively, or to ask new research questions. If you create language resources, you might like to deposit them with one of the CLARIN centres so that they can be curated by professionals, and then found and used by many more researchers. If you run a repository, think about registering as a CLARIN centre and making your resources discoverable and usable via CLARIN services like the Virtual Language Observatory, or the Federated Content Search. If you develop or work with language software, you might want to try to get it integrated into the CLARIN architecture.

What can I do for CLARIN?

That’s more like it. CLARIN needs people to build the tools, services and infrastructure that we need. We also need to hear from researchers what they want from the infrastructure. You could also let us know if CLARIN has helped you, so that we can tell our funders about that. If you are in a country which hasn’t joined CLARIN, such as the UK, ask the funding agencies and policymakers why not!

Why should I be interested?

Whatever you do, you probably write, read or otherwise manipulate language in your job, and some of the resources and tools in CLARIN might be useful for you. Want to see how particular words are usually used in English (or French or German or Estonian)? Need to identify the language of a text? Need to identify all of the people and places in a text? Want to get hold of an expert in Dutch dialects? CLARIN is building a one-stop shop for solutions to these sorts of questions. Furthermore, you might not be interested right now in language technology, but you might be interested in how we are trying out novel approaches to building a virtual infrastructure to support research in the humanities and social sciences. This involves cutting edge technologies for authorizing access to resources, expertise in digital curation, new ways to describe, find and share electronic resources online, overcoming legal, administrative and financial barriers to build cross-border infrastructure services, lobbying for more access rights to copyright material for research, and lots more. Visit regularly and watch the story unfold.

What has CLARIN actually achieved?

Here are a few examples: the Virtual Language Observatory, the Federated Content Search, a service provider federation allowing cross-border log-ins to resources, numerous training events (like this), research projects facilitating collaborations (like these) and really cool websites like this: website

What has it go to do with you, Martin?

I’m Director for User Involvement for CLARIN at the European level, on a 3-year part-time secondment 2013-2016, as well as having been one of the founders and architects back when it first started. So explaining what CLARIN is and encouraging people to get involved is part of my job. Get in touch if you want to know more.

Posted in Uncategorized | Comments Off on CLARIN for beginners

Using large-scale text collections for research

I participated in a recent workshop in Würzburg on using large-scale text collections for research. The workshop was organised as part of the activities of NeDiMAH, the Network of Digital Methods in the Arts and Humanities.

I had the opportunity to give a short introduction on some aspects of my interest in this topic. I outlined how the current problems include the fragmentation currently available resources in different digital silos, with a variety of barriers to their combination and use, plus a lack of easily available tools for textual analysis of standardized online resources, and I briefly referred to the plans of the CLARIN research infrastructure to address some of these problems.

Christian Thomas explained how the Deutsches Textarchiv (DTA) is facilitating and making possible research with large-scale historical German text collections. The DTA has funding 2007-15, and now includes resources with more than 200 million words from the period 1600 to 1900. There are images and text, and automatic linguistic analysis is possible. The DTA is a CLARIN-D service centre. Integration in the CLARIN infrastructure means that resources can be discovered via the Virtual Language Observatory (VLO), can be searched via the Federated Content Search (FCS), and analysed and processed via WebLicht worklows. The DTA also contributes to discipline-specific working groups to work with as an outreach and dissemination strategy. The majority of texts are keyed in (see more at The workflow for OCR texts is interesting – structural markup is added to the electronic text (using a subset of TEI P5), and then OCR errors are corrected. They find that it is easier to identify and correct errors in structured text. The Cascaded Analysis Broker provides a normalization of historical forms to allow for orthography-independent and lemma-based corpus searches, and this is integrated into the DTAQ quality assurance platform. Christian’s slides can be found here.

The DTA is also a key partner in the Digitales Wörterbuch der deutschen Sprache (DWDS), an excellent concept allowing cross-searching of resources in different centres, and very well implemented. This offers a view of the future of corpus linguistics and the study of historical texts online.

Jan Rybicki from the Jagiellonian University in Kraków told us about a benchmark English corpus to compare the success or failure of stylometric tools. There was a very interesting discussion of the idea of how to build representative and comparable literary corpora, which put me in mind of the work of Gideon Toury in descriptive translation studies. There was also discussion of a possible project to build comparable benchmark corpora for multiple European literary traditions.

Rene van Stipirian (Nederlab) outlined the backgroud of how the study of history in the Netherlands is characterised by a fragmented environment of improvised resources. The project Nederlab will be funded by the NWO 2013-17 to address the integration of historical textual resources for research. Some very interesting statistics were presented: for the period to the end of the twentieth century there are 500 million surviving pages printed in Dutch, and 70 million of these are digitized, but only 5-10 million have good quality text – most are rather poor quality OCR. Nederlab brings together linguists, literary scholars and historians, and integrated access to resources will go online in the summer of 2015.

Allen Riddell from Dartmouth Colege in the US took an interesting and highly principled approach to building a representative literary corpus. He randomly selected works from bibliographic indexes, then went out and found works and scanned them if necessary. This seems to me to be a positive step, in contrast to the usual rather more opportunistic approach of basing the corpus composition of the more easily available texts. The approach to correcting the OCR text was also innovative and interesting – he used Amazon Mechanical Turk. Allen also referred to a paper on this topic at
This also raised an interesting question – can a randomly selected corpus be representative, or do we need more manual intervention in selection (at the risk of personal bias)?

Tom van Nuenen from Tilburg University described how he scraped professional travel blogs from a Dutch site and starting to analyse the language. Puck Wildschut from the Uni Nijmegen described the early stages of her PhD work on comparing Nabokov novels using a mixture of corpus and cognitive stylistic approaches.

The discussion at the end of the first day focussed on an interesting and important question: how do we make corpus-building more professional? Reusability was seen to be key, and dependent on making sure that data was released in an orderly way, with clear documentation, and under a licence allowing reuse. And since what we are increasingly dealing with is large collections of entire texts (rather than the sampled and truncated smaller corpora of the past), then we should ensure that the texts that make up corpora should be reusable, so that others can take them to make different ad hoc corpora. This requires metadata at the level of the individual texts, and would be enhanced by the standardization of textual formats.

Maciej Eder from the Institute of Polish Studies at the Pedagogical University of Kraków introduced and demonstrated Stylo, a tool for stylometric analysis of texts. In this presentation, and one on the following day, I found some of the assumptions underlying stylometric research difficult to reconcile with what I think of as interesting and valid research questions in the humanities. How many literary scholars are comfortable with notions that the frequencies of word tokens, and the co-occurrence of these tokens give an insight into style? And the conclusion of a stylometric study always seems to be about testing and refining the methods. Conclusions like “stylometric methods are too sensitive to be applied to any big dataset” don’t actually engage with anyone outside of stylometry. Until someone comes up with a conclusion more relevant to textual studies, this is likely to remain a marginal activity, but maybe I’ve missed the point.

The focus on looking for and trying to prove the differences between the writing of men and women also strikes me as a little odd, and certainly contentious. Why prioritize this particular aspect of variation in the writers? Why try to essentialize the differences between men and women, and why not other factors? I’d be more interested in an approach which identified stylistic differences and then tried to find what the relevant variables might be, rather than an initial starting point assuming that men and women write differently, and trying to “prove” that by looking for differences.

On the second day of the workshop, Florentina Armaselu from the Centre Virtuel de la Connaissance de l’Europe (CVCE) described how they are making TEI versions of official documents on EU integration for research use. I suggested that there might be interesting connections with the Talk of Europe project, which will be seeking to connect together datasets of this type for research use with language technologies and tools.

Karina van Dalen-Oskam from the Huygens Institut in the Netherlands, one of the workshop organisers, introduced the project entitled The Riddle of Literary Quality which is investigating whether literariness can be identified in distributions of surface linguistic features. The current phase is focussing on lexical anbd syntactic features which can be identified automatically, although a later phase might investigate harder-to-identify stylistic features, such as speech presentation. In the discussion Maciej Eder suggested that the traces of literariness might reside not in absolute or relative frequencies of features, but in variation from norms (either up or down).

Gordan Ravancic (Institute of History in Zagreb joined us via Skype to introduce his project on crime records in Dubrovnik, “Town in Croatian Middle Ages”, which was fascinating, although not clearly linked to the topic of the workshop, as far as I could tell.

Some interesting notions and terminological distinctions were raised in discussions. Maciej Eder suggested that “big data” in textual studies is data where the files can’t be downloaded, examined or verified in any systematic way. This seems like a useful definition, and it immediately raised questions in the following talk. Emma Clarke from Trinity College Dublin presented work on topic modelling. This approach to distant reading can only be used on a corpus that can be downloaded, normalized and categorized, and would be difficult to use on the type of big data as defined by Eder, although it could potentially used as a discovery tool to explore indeterminate datasets. Christof Schlöch from the Computerphilologie group in Würzburg differentiated “smart data” from “big data”, and suggested that this was what we mostly wanted to be working with. Smart data is cleaned up and normalized to a certain extent, and is of known provenance, quality and extent.

The workshop concluded with discussions about potential outcomes of this and a previous NeDiMAH workshop. A possible stylometry project to build benchmark text collections and to promote the use of stylometric tools for genre analysis and attribution was outlined, with perhaps the ultimate goal of an ambitious European atlas of the history of the style of fiction. We also discussed the possible publication of a companion to the creation and use of large-scale text collections.

Read more about the workshop on the NeDiMAH webpages at

Posted in Uncategorized | Comments Off on Using large-scale text collections for research

Changes to the distribution of the British National Corpus

In January 2014 there will be some changes in the way that the British National Corpus (BNC) is distributed.

It is now possible to download the British National Corpus at no cost from the Oxford Text Archive at the following URL:

BNC Baby, a 4-million word sample of the BNC is also available:

Click on the ‘apply for approval’ link to request a copy. The BNC continues to be subject to the same user licence conditions, which can be viewed at If you have already paid for permission to use the BNC, you should consider that this continues to be valid in perpetuity.

There is an even simpler download option if you have a login ID from a UK or eduGAIN Shibboleth identity provider (usually, this applies to all members of UK universities, and many European institutions). You can follow the links at the locations above to download the corpus directly without applying for approval. We hope that this facility will soon be extended to users from other countries who participate in the CLARIN Federation.

It will remain possible to order the BNC on disks from the University of Oxford until the end of March 2014, with the current administrative charges still applying, from the following URL:

As part of this process, I have to announce that the University of Oxford can no longer offer any support for the XAIRA software, which has for many years been made available with the corpus. We have tried to offer support on a ‘best efforts’ basis in recent years, but we do not have the resources or expertise to help with the installation or use of XAIRA on the latest hardware and software. Users of XAIRA are encouraged to visit and check out the forums and mailing lists which you will find there. The future of XAIRA depends on a committed user community, so please get involved if you have questions or can contribute expertise.

There are excellent services offering instant online access to the BNC, such as those listed at I am convinced that there is still further potential for the integration and use of the corpus in online services and web applications. There are plans to integrate access to the BNC with the emerging CLARIN infrastructure, enabling a range of applications and web services to be used in conjunction with this and many other corpora. See for more details.

If you know of other ways of using the BNC, or have any more ideas about its future, I would welcome a discussion on this email list, or email me.

Posted in Uncategorized | Comments Off on Changes to the distribution of the British National Corpus