CLARIN Germany: Happy First Birthday!

A workshop was held in Leipzig last month to mark the end of the first year of  CLARIN-D, the national initiative in Germany to build a research infrastructure as part of the Common Language Resources and Technology Infrastructure. The wider CLARIN effort is Europe-wide and aims to link up repositories, services and researchers in the social sciences and humanities who are making use of the wide range of digital datasets and tools for processing human language. More details of the workshop, including all of the presentations, are available here.

Greg Crane, the newly appointed Professor of Digital Humanities at the University of Leipzig, kicked off the event with a stimulating presentation which situated CLARIN in the wider context of the evolution of the humanities and, more recently, the digital humanities. Greg suggested that we should provide platforms and tools for students and citizen scholars to contribute to research and to the accumulation of knowledge, culminating in the challenge: “How can we foster a new global Republic of Letters?”.

Erhard Hinrichs (Tübingen), the coordinator of CLARIN-D, introduced the overall initiative as a “web and centre-based research infrastructure for the social
sciences and humanities”. CLARIN aims to build an integrated, interoperable, scalable and sustainable research infrastructure via a network of centres. Language resources and tools (LRTs) will be deployed as services for researchers in the social sciences and humanities. CLARIN-D has 9 centres: BAS, University of Munich; BBAW, Berlin; IDS, Mannheim; MPI, Nijmegen; University of Hamburg; University of Leipzig; Saarland University; University of Stuttgart; and University of Tübingen.

Erhard reassured us that CLARIN-D has taken to heart the words of John Wood from the Knowledge Exchange Workshop in Berlin in September 2009:

“Research infrastructures that do not take user needs into account from the very start run the risk of becoming empty infrastructures.”

There are working groups for 9 humanities and social science disciplines. These discipline-specific working groups act as catalysts, linking CLARIN-D
to the research communities. They choose key resources and tools from their communities and advise and supervise their integration into the CLARIN-D infrastructure (in the so-called “curation projects”). CLARIN-D is also working with many of the BMBF-funded eHumanities projects. CLARIN-D also has work packages which are devoted to liaison with the CLARIN-ERIC and with DARIAH, an emerging humanities e-infrastructure, and also on legal and ethical issues, support and helpdesk, and training and education.

Dieter van Uytvanck (MPI Nijmegen) introduced the distributed technical architecture of CLARIN in the context of an infrastructure to support researchers throughout the life-cycle of their work. He also situated CLARIN in the context of a (European) ecosystem of infrastructures:

  • Community Services  – CLARIN
  • Cross Community Services – DASISH
  • Compute Services – DEISA
  • Data Services – EUdat
  • Grid Services – EGI
  • Network Services – GEANT

Dieter outlined the services which are available now:

  • WebLicht for resource processing and workflow management;
  • the Virtual Language Observatory for resource discovery; tools to support resource creation and enhancement;
  • European Persistent Identifier Consortium (EPIC) service;
  • repository services in the centres for archiving, preservation and sharing;
  • federated identity management (including a CLARIN Identity Provider, a service provider federation and cross-federation)

Services that will be available in the future include:

  • Federated Content Search (in development)
  • Monitoring (currently in alpha)
  • Center Registry (alpha)
  • Virtual Collection Registry (alpha)
  • Workspaces + SimpleStore (alpha)
  • Safe Replication (alpha)

The workshop then moved on to consider the various projects associated with CLARIN-D. Angelika Storrer (TU Dortmund) spoke about her experiences in corpus-based language analysis in research and teaching. The requirements which she identified were of particular interest:

  • One common interface with a German language version and German online tutorials
  • Tools to further work with the results of search queries (clean-up and search again; manually annotate and search again; interface to statistic tools)
  • Word sense disambiguation / semantic clustering tools
  • Orthographic variation tools: important issue when dealing with historical corpora or with computer-mediated communication, e.g. Stress / Streß

Annette Hautli (Konstanz) is part of a team is aiming to tackle the problem with an innovative combination of methods coming from three disciplines: Linguistics, Visual Analytics and Political Science. It is clear that the proposed process of automatic pragmatic annotations of naturally occurring speech data is ambitious, and it is not yet clear that effective results can be obtained. Furthermore, the data set used, which seemed to be transcripts of interviews carried out by the political scientists in the project, is not really the sort of “naturally occurring” speech events that the linguistic methods were developed to deal with, and the eradication of biases and formulation of appropriate interpretations of the data will be difficult. In this sense, it will be an interesting collaboration between the social sciences and other disciplines. On a technical note, it has been noted that a multi-layered annotation approach would be useful, although they don’t have the tools at present.

Eva-Maria Wunder (Augsburg) introduced her PhD work on searching for evidence of second-to-third language interference in language learners (e.g. if a Chinese speaker learns English and then German, how does this effect their German pronunciation?). While she didn’t address the methodological problem that looking for English influence in pronunciation is difficult when “English” is not one accent, this probably wasn’t the place for such discussions, and she introduced the CLARIN tools Wikispeech and WebMAUS which are supporting her work.

Kirsten Bergmann spoke about the challenges of integrating multimodal resources into the CLARIN infrastructure, such as the SaGa speech and gesture corpus, sign language materials, and “sociable machines” under development in Bielefeld.

Ingmar Schuster (Leipzig) described one of the curation projects, which aims to build a “reproducible research platform”, to support “reproducible data-driven linguistics”. The platform is a development of the Potsdam Mind Research Repository (PMR2), and incorporates pre-prints Open Journal Systems (including OAI-PMH, CMDI plug-in); author submission system (reducing the admin load of the centre supporting the system); data publication; “non-significant” (presumably negative) results; R integration, with a web application variant, since most researchers in this field use R.

Christian Mair (Freiburg) described the integration of the Virtual Linguistics Campus (VLC, a suite of online distance learning resources) into CLARIN. The aim is to create an accessible digital resource for a mass market, expanded by a large number of users (to build a web-based community of practice). This could evolve into a multi-functional digital language resource from an e-learning resource: from teaching through research-based teaching to research. There are ongoing issues of quality control, and as yet unexplored potential and obvious synergies with other CLARIN ventures, e.g. the integration of distributed corpora.

Thomas Gloning (Gieẞen) described another curation project, on the integration of German historical philological resources, ultimately aiming to integrate the textual resources of the 15th to the 19th centuries into a reference corpus of historical German, and including a workflow for future integration of further resources. Integrating various textual resources will not provide a corpus in a strict sense but rather a huge repository, from which users can use metadata to build up subcorpora from the repository, according to relevant criteria, e.g. text type (newspaper reports, plant descriptions), decade (texts from the 1680s), topic (texts on alchemy, cookery, medicine, etc.). Anticipated outcomes of making such a resource available include a new historical dictionary of New High German from the 17th-21st centuries based on corpus principles. Innumerable projects on more specific themes would also result, for example investigating the history of foreign words, emergence of specialized vocabulary, evidence of language change, etc., leading to new models and theories.

Alexander Geyken (BBAW) announced plans to write a user manual or handbook (Benutzerhandbuch) for CLARIN-D services. The target audience sectors will be:

  • researchers who have/want to develop Language Resources, Tools and Services (LRTS) and want to make them CLARIN-D compatible
  • researchers who want to learn more about the solutions adopted in CLARIN-D
  • technical staff supporting researchers in resource development and migration.

The manual will aid the migration LRTS to the CLARIN-D infrastructure, with the following benefits:

  • linking to larger community / visibility or resources
  • interoperability
  • long-term preservation by CLARIN-D service centers.

Among the challenges presented by the plan are the relations of this manual with the emerging standards and procedures of the CLARIN ERIC, which are intended to be Europe-wide in their application. Also, centres and research creation projects will need to make decisions at particular points in time regarding standards, which might be made difficult by the nature of the handbook as a “living document” with constant updates and changes. Nevertheless, this work should provide an excellent foundation for future work in documenting CLARIN procedures.

Frank Wiegand (BBAW) explained the project to build the Deutsches Textarchiv (DTA), which will identify and integrate distributed text resources into a large reference corpus for German (1650-1900). Some of the work to produce editions for the corpus is being done in

Thomas Eckart (Leipzig) reported on infrastructural and CLARIN-related aspects of the eAqua project which is working on the extraction of structured knowledge from ancient sources. The project aims to develop tools as small independent components available as services via SOAP and REST, to support the reuse of data and algorithms, and promote interaction and interchange with existing projects in Digital Humanities, and to allow the integration of existing data resources. They aim to use existing standards, and so plain text and TEI have been selected as input formats for the CLARIN workspace, and have built a TEI text integrator, which automatically sucks in texts to a repository, allocates a PID, generated CMDI metadata (which is then pushed to the Virtual Language Observatory aggregator), the full text will be offered to the CLARIN Federated Content Search, with output in TCF/txt/XML/HTML.

After the presentation of this impressive array of projects, Erhard Hinrichs returned to the stage to introduce the CLARIN ERIC, the new legal and organisational framework underpinning the Europe-wide CLARIN research infrastructure, and its relationship with CLARIN-D. In short, ERICs are reliant on national funding and national infrastructure initiatives. The challenges will be to integrate numerous national infrastructures of varying size, scope and maturity into a coherent European infrastructure. The CLARIN ERIC started operation in the Spring of 2012, with nine founding members – Austria, Bulgaria, Czech Republic, Denmark, Estonia, Germany, The Netherlands, Nederlandse Taalunie (the Dutch Language Union, an international organization based in Flanders and the Netherlands), and Poland. Six additional members are expected by the end of 2012: Croatia, Finland, Latvia, Lithuania, Norway, Slovenia.

Thomas Zastrow (Tübingen) introduced the EUDAT Data Project, which brings together a consortium of research communities and national data and high performance computing centers, aiming to contribute to the production of a collaborative data infrastructure (CDI) to support Europe’s scientific and research data requirements, and to deal with the “data tsunami” – note that no longer merely a deluge! As well as CLARIN, there are participants from Earth sciences (EPOS), Climate sciences (ENES), Environmental sciences (LIFEWATCH), and Biological and medical sciences (VPH).

Erik Ketzan and Pawel Kamocki (IDS, Mannheim) introduced the CLARIN-D legal helpdesk and “Three Important Legal Concepts for  Language Scientists in Germany”.
The first two of these concepts represented encouraging news about the relatively liberal provisions of German law for personal scientific use and implied licences. However, we should note that services built on the these exceptions will pose problems for the CLARIN infrastructure, the boundaries of which are EU-wide (at least). It remains to be seen how we can deal with problems of identifying the relevant legal jurisdictions for complex workflows involving cross-national collaborations and distributed architectures. It might prove necessary to base services on the assumption of the lowest common denominator of EU-wide legal principles, rather than on those of the most liberal country. (By the way, the third concept was the potential landmine of database rights!)

In summary, it was extremely encouraging to see the plans of CLARIN, first conceived many years ago, start to come to fruition. The connections now being made in Germany with key communities of academic researchers is of paramount importance, and will need to be carried on in other countries. There were a few niggling doubts in this respect – it would have been good to find out more about connections with literary scholars, and with TextGrid and DARIAH. But overall, CLARIN-D shows a remarkable level of maturity, at both technical and organisational levels. There are numerous key challenges ahead, but this community seem well-equipped to address them. We have seen the future of language resources, tools and services, and it works!

