The Oxford Text Archive in 2013

The New Year promises to be an exciting one for the Oxford  Text Archive. As well as new accessions to the archive, new services and new collaborations, we plan to integrate the archive further into the new research data management services at the University of Oxford. This will involve working more closely with the Bodleian Libraries, who are embarking on a number of ambitious projects to serve the requirements of researchers for working with digital data.

The last year has seen the biggest ever expansion in the archive, with the accession of more than 2,000 texts from the Eighteenth Century Collections Online text creation partnership. These are made available under Creative Commons licences, another new venture for the OTA, and we plan to release future accessions with the relevant CC licence. These texts, along with all other XML resources, are now made available in a variety of formats, including popular ebook formats, converted automatically by the Oxgarage web service. We are planning future releases of Early English Books Online (EEBO) texts as they come into the public domain.

The Oxford Text Archive has taken over the management and distribution of the British National Corpus. We are not able to give support for the Xaira software, which continues as an open source project, but we continue to distribute copies of the corpus. In 2013 we will open a consultation on how to open access to the corpus with the corpus linguistics community and other stakeholders. We aim to make more widely available a BNCWeb service hosted by the National e-Infrastructure Service with secure authentication for users in educational establishments. The excellent online services listed at http://www.natcorp.ox.ac.uk/ continue to be available online.

The OTA also hopes that in 2013 we will be able to make more links with CLARIN infrastructure services and projects. OTA resources are already visible via the CLARIN Virtual Language Observatory, and we hope to participate in the federated content search demonstrator which is being built now. However, proper participation for service centres like the OTA, and for other institutions and individual researchers, does require that the UK funders and policymakers finally acknowledge the importance of the emerging European research infrastructure. Regretfully, attempts to engage research councils, JISC and the UK Access Management Federation in these processes continue to founder. Let’s hope for more progress in 2013, and that policy-makers start to act on their promises about building and promoting digital research infrastructure in the UK.

Posted in Uncategorized | Comments Off

CLARIN Germany: Happy First Birthday!

A workshop was held in Leipzig last month to mark the end of the first year of  CLARIN-D, the national initiative in Germany to build a research infrastructure as part of the Common Language Resources and Technology Infrastructure. The wider CLARIN effort is Europe-wide and aims to link up repositories, services and researchers in the social sciences and humanities who are making use of the wide range of digital datasets and tools for processing human language. More details of the workshop, including all of the presentations, are available here.

Greg Crane, the newly appointed Professor of Digital Humanities at the University of Leipzig, kicked off the event with a stimulating presentation which situated CLARIN in the wider context of the evolution of the humanities and, more recently, the digital humanities. Greg suggested that we should provide platforms and tools for students and citizen scholars to contribute to research and to the accumulation of knowledge, culminating in the challenge: “How can we foster a new global Republic of Letters?”.

Erhard Hinrichs (Tübingen), the coordinator of CLARIN-D, introduced the overall initiative as a “web and centre-based research infrastructure for the social
sciences and humanities”. CLARIN aims to build an integrated, interoperable, scalable and sustainable research infrastructure via a network of centres. Language resources and tools (LRTs) will be deployed as services for researchers in the social sciences and humanities. CLARIN-D has 9 centres: BAS, University of Munich; BBAW, Berlin; IDS, Mannheim; MPI, Nijmegen; University of Hamburg; University of Leipzig; Saarland University; University of Stuttgart; and University of Tübingen.

Erhard reassured us that CLARIN-D has taken to heart the words of John Wood from the Knowledge Exchange Workshop in Berlin in September 2009:

“Research infrastructures that do not take user needs into account from the very start run the risk of becoming empty infrastructures.”

There are working groups for 9 humanities and social science disciplines. These discipline-specific working groups act as catalysts, linking CLARIN-D
to the research communities. They choose key resources and tools from their communities and advise and supervise their integration into the CLARIN-D infrastructure (in the so-called “curation projects”). CLARIN-D is also working with many of the BMBF-funded eHumanities projects. CLARIN-D also has work packages which are devoted to liaison with the CLARIN-ERIC and with DARIAH, an emerging humanities e-infrastructure, and also on legal and ethical issues, support and helpdesk, and training and education.

Dieter van Uytvanck (MPI Nijmegen) introduced the distributed technical architecture of CLARIN in the context of an infrastructure to support researchers throughout the life-cycle of their work. He also situated CLARIN in the context of a (European) ecosystem of infrastructures:

  • Community Services  – CLARIN
  • Cross Community Services – DASISH
  • Compute Services – DEISA
  • Data Services – EUdat
  • Grid Services – EGI
  • Network Services – GEANT

Dieter outlined the services which are available now:

  • WebLicht for resource processing and workflow management;
  • the Virtual Language Observatory for resource discovery; tools to support resource creation and enhancement;
  • European Persistent Identifier Consortium (EPIC) service;
  • repository services in the centres for archiving, preservation and sharing;
  • federated identity management (including a CLARIN Identity Provider, a service provider federation and cross-federation)

Services that will be available in the future include:

  • Federated Content Search (in development)
  • Monitoring (currently in alpha)
  • Center Registry (alpha)
  • Virtual Collection Registry (alpha)
  • Workspaces + SimpleStore (alpha)
  • Safe Replication (alpha)

The workshop then moved on to consider the various projects associated with CLARIN-D. Angelika Storrer (TU Dortmund) spoke about her experiences in corpus-based language analysis in research and teaching. The requirements which she identified were of particular interest:

  • One common interface with a German language version and German online tutorials
  • Tools to further work with the results of search queries (clean-up and search again; manually annotate and search again; interface to statistic tools)
  • Word sense disambiguation / semantic clustering tools
  • Orthographic variation tools: important issue when dealing with historical corpora or with computer-mediated communication, e.g. Stress / Streß

Annette Hautli (Konstanz) is part of a team is aiming to tackle the problem with an innovative combination of methods coming from three disciplines: Linguistics, Visual Analytics and Political Science. It is clear that the proposed process of automatic pragmatic annotations of naturally occurring speech data is ambitious, and it is not yet clear that effective results can be obtained. Furthermore, the data set used, which seemed to be transcripts of interviews carried out by the political scientists in the project, is not really the sort of “naturally occurring” speech events that the linguistic methods were developed to deal with, and the eradication of biases and formulation of appropriate interpretations of the data will be difficult. In this sense, it will be an interesting collaboration between the social sciences and other disciplines. On a technical note, it has been noted that a multi-layered annotation approach would be useful, although they don’t have the tools at present.

Eva-Maria Wunder (Augsburg) introduced her PhD work on searching for evidence of second-to-third language interference in language learners (e.g. if a Chinese speaker learns English and then German, how does this effect their German pronunciation?). While she didn’t address the methodological problem that looking for English influence in pronunciation is difficult when “English” is not one accent, this probably wasn’t the place for such discussions, and she introduced the CLARIN tools Wikispeech and WebMAUS which are supporting her work.

Kirsten Bergmann spoke about the challenges of integrating multimodal resources into the CLARIN infrastructure, such as the SaGa speech and gesture corpus, sign language materials, and “sociable machines” under development in Bielefeld.

Ingmar Schuster (Leipzig) described one of the curation projects, which aims to build a “reproducible research platform”, to support “reproducible data-driven linguistics”. The platform is a development of the Potsdam Mind Research Repository (PMR2), and incorporates pre-prints Open Journal Systems (including OAI-PMH, CMDI plug-in); author submission system (reducing the admin load of the centre supporting the system); data publication; “non-significant” (presumably negative) results; R integration, with a web application variant, since most researchers in this field use R.

Christian Mair (Freiburg) described the integration of the Virtual Linguistics Campus (VLC, a suite of online distance learning resources) into CLARIN. The aim is to create an accessible digital resource for a mass market, expanded by a large number of users (to build a web-based community of practice). This could evolve into a multi-functional digital language resource from an e-learning resource: from teaching through research-based teaching to research. There are ongoing issues of quality control, and as yet unexplored potential and obvious synergies with other CLARIN ventures, e.g. the integration of distributed corpora.

Thomas Gloning (Gieẞen) described another curation project, on the integration of German historical philological resources, ultimately aiming to integrate the textual resources of the 15th to the 19th centuries into a reference corpus of historical German, and including a workflow for future integration of further resources. Integrating various textual resources will not provide a corpus in a strict sense but rather a huge repository, from which users can use metadata to build up subcorpora from the repository, according to relevant criteria, e.g. text type (newspaper reports, plant descriptions), decade (texts from the 1680s), topic (texts on alchemy, cookery, medicine, etc.). Anticipated outcomes of making such a resource available include a new historical dictionary of New High German from the 17th-21st centuries based on corpus principles. Innumerable projects on more specific themes would also result, for example investigating the history of foreign words, emergence of specialized vocabulary, evidence of language change, etc., leading to new models and theories.

Alexander Geyken (BBAW) announced plans to write a user manual or handbook (Benutzerhandbuch) for CLARIN-D services. The target audience sectors will be:

  • researchers who have/want to develop Language Resources, Tools and Services (LRTS) and want to make them CLARIN-D compatible
  • researchers who want to learn more about the solutions adopted in CLARIN-D
  • technical staff supporting researchers in resource development and migration.

The manual will aid the migration LRTS to the CLARIN-D infrastructure, with the following benefits:

  • linking to larger community / visibility or resources
  • interoperability
  • long-term preservation by CLARIN-D service centers.

Among the challenges presented by the plan are the relations of this manual with the emerging standards and procedures of the CLARIN ERIC, which are intended to be Europe-wide in their application. Also, centres and research creation projects will need to make decisions at particular points in time regarding standards, which might be made difficult by the nature of the handbook as a “living document” with constant updates and changes. Nevertheless, this work should provide an excellent foundation for future work in documenting CLARIN procedures.

Frank Wiegand (BBAW) explained the project to build the Deutsches Textarchiv (DTA), which will identify and integrate distributed text resources into a large reference corpus for German (1650-1900). Some of the work to produce editions for the corpus is being done in de.wikisource.org.

Thomas Eckart (Leipzig) reported on infrastructural and CLARIN-related aspects of the eAqua project which is working on the extraction of structured knowledge from ancient sources. The project aims to develop tools as small independent components available as services via SOAP and REST, to support the reuse of data and algorithms, and promote interaction and interchange with existing projects in Digital Humanities, and to allow the integration of existing data resources. They aim to use existing standards, and so plain text and TEI have been selected as input formats for the CLARIN workspace, and have built a TEI text integrator, which automatically sucks in texts to a repository, allocates a PID, generated CMDI metadata (which is then pushed to the Virtual Language Observatory aggregator), the full text will be offered to the CLARIN Federated Content Search, with output in TCF/txt/XML/HTML.

After the presentation of this impressive array of projects, Erhard Hinrichs returned to the stage to introduce the CLARIN ERIC, the new legal and organisational framework underpinning the Europe-wide CLARIN research infrastructure, and its relationship with CLARIN-D. In short, ERICs are reliant on national funding and national infrastructure initiatives. The challenges will be to integrate numerous national infrastructures of varying size, scope and maturity into a coherent European infrastructure. The CLARIN ERIC started operation in the Spring of 2012, with nine founding members – Austria, Bulgaria, Czech Republic, Denmark, Estonia, Germany, The Netherlands, Nederlandse Taalunie (the Dutch Language Union, an international organization based in Flanders and the Netherlands), and Poland. Six additional members are expected by the end of 2012: Croatia, Finland, Latvia, Lithuania, Norway, Slovenia.

Thomas Zastrow (Tübingen) introduced the EUDAT Data Project, which brings together a consortium of research communities and national data and high performance computing centers, aiming to contribute to the production of a collaborative data infrastructure (CDI) to support Europe’s scientific and research data requirements, and to deal with the “data tsunami” – note that no longer merely a deluge! As well as CLARIN, there are participants from Earth sciences (EPOS), Climate sciences (ENES), Environmental sciences (LIFEWATCH), and Biological and medical sciences (VPH).

Erik Ketzan and Pawel Kamocki (IDS, Mannheim) introduced the CLARIN-D legal helpdesk and “Three Important Legal Concepts for  Language Scientists in Germany”.
The first two of these concepts represented encouraging news about the relatively liberal provisions of German law for personal scientific use and implied licences. However, we should note that services built on the these exceptions will pose problems for the CLARIN infrastructure, the boundaries of which are EU-wide (at least). It remains to be seen how we can deal with problems of identifying the relevant legal jurisdictions for complex workflows involving cross-national collaborations and distributed architectures. It might prove necessary to base services on the assumption of the lowest common denominator of EU-wide legal principles, rather than on those of the most liberal country. (By the way, the third concept was the potential landmine of database rights!)

In summary, it was extremely encouraging to see the plans of CLARIN, first conceived many years ago, start to come to fruition. The connections now being made in Germany with key communities of academic researchers is of paramount importance, and will need to be carried on in other countries. There were a few niggling doubts in this respect – it would have been good to find out more about connections with literary scholars, and with TextGrid and DARIAH. But overall, CLARIN-D shows a remarkable level of maturity, at both technical and organisational levels. There are numerous key challenges ahead, but this community seem well-equipped to address them. We have seen the future of language resources, tools and services, and it works!

Posted in Uncategorized | Tagged , , | 1 Comment

Silos or fishtanks?

The following is a partial summary of a presentation given at the Interedition Symposium in the Hague in March 2012 on the topic of Scholarly Digital Editions, Tools and Infrastructure.

People are often talking about digital silos in the context of digital resources in the humanities. The problem is that resources, although valuable in themselves, are located in different locations on the web, where they might be difficult to find, and they all have their own individual interfaces and registration procedures, and are not connected with similar or related resources. So you can’t easily search the Old English Corpus (available either for download with no software from the OTA, or online via numerous university library portals to local users). Some resources, like the ARCHER corpus, you can’t access at all unless you’re friends with someone at the University of Manchester.

Silo image from Doc Searls (dsearls)

This is clearly far from ideal. But what alternative, more connected, architectures are most appropriate to achieving interoperability and sustainability of the arena of digital textual scholarship? The emergence of fast and high capacity networks, a deluge of data, and web service APIs mean that it is increasingly possible to imagine and build distributed architectures for scholarly services, where data, tools, computing resources, and the outputs of annotation and analysis live in different parts of the network but can be brought together virtually in the user’s desktop environment. The current concerns about ‘digital silos’, in which the outputs of digital humanities projects are deployed online unconnected to other resources, and with limited sustainability, are directly addressed by this vision.

I want to put forward put the argument for distributed architectures, while reviewing some of the risks and problems, and survey some current moves towards such an infrastructure. And I also want to suggest another metaphor as an alternative to the ‘silo’.

An open and fully distributed architecture where the resources are located in different places can have the advantages of allowing the following services to be created:

  • potentially unlimited functionality, since developers can deploy content and tools that they want to use, and which can interoperate with other data, tools and infrastructure services;
  • building ad hoc collections and corpora across different repositories;
  • complex workflows, for example piping together web services from different locations;
  • protected resources (e.g. works in copyright, sensitive data) curated in situ yet still analysed online via web applications which access the data via a secure infrastructure

All of this can happen in a situation with a better division of labour than we typically have now: the repositories don’t have to worry about tools; tool and content developers don’t have to worry about creating the entire online environments; tool developers don’t have to worry about data management; users don’t have to install software; etc.. The emergence of an ‘ecosystem’ with numerous actors providing content, tools, computing resources, and other infrastructure services, provides a flexibility and resilience and the potential for sustainability which is not possible for a single-site or other more closed or monolithic system.

So let’s consider the unconnected, problematic online resource as a fishtank rather than a silo.

Goldfish image from Praveen Gupta (praveengupta)

There are lots of fishtanks out there, and they can be very large, elaborate, pretty, sophisticated, long-standing and sustainable. But they’re all in different places and they are not connected with each other. If you want to see a variety of fish, you have to visit a lot of houses, try to negotiate access to their fishtanks, and make use of whatever facilities they have for viewing or otherwise analysing the fish. Some places are better than other to visit – aquariums might have very good facilities and lots of information, but you still can’t view the fish in one aquarium alongside the fish in another, and it’s hard to compare them.

And if I want to keep a fish I have to build a fishtank and maintain a fishtank, or I could find someone else’s fishtank to put it in, but then it’s difficult for me to get access and control the environment. And who’s going to carry on feeding the fish? We can probably agree that it’s better if we don’t all try to make and look after our own fishtanks, at least not if our main goals are to enable as many people as possible to get into looking after, breeding and sharing fish, and if we want to be able to see a wide variety of fish. Wouldn’t it be better to have an ecosystem where we can all set our fishes free to swim together?

Marine Ecosystem image from www.sciencelearn.org.nz

This way, everyone can access all of the riches of the deep and it’s a lot easier to get into fish research.

Of course, ecosystems can be dangerous places, with predators and diseases, and they can be fragile. You could also argue that what fishkeepers really want is the experience of nurturing their own fish, and the enjoyment of setting up and maintaining their own micro-infrastructure, and therefore fishtanks are the best solution. But there a limits to the applicability and relevance of any metaphor.

There are potential disadvantages to distributed infrastructures, and many of them relate to the additional complexity that they introduce into the access and identity management arrangements. Arranging access to services in one location can be hard enough, but authorization to use, for example, textual data in more that one repository might require passing of information between institutions. It is also the case that while there are reasonably well-established technologies and procedures and agreements for controlling access to online content, the authorization of web services is not such a well-established area. Furthermore, authorization to access online content cannot easily be passed on to authorize access to the computer processing power that is necessary to carry out an online textual analysis, if this is being provided by another centre in the distributed infrastructure. In summary, the fact that distributed services are reliant on cross-institutional agreements and arrangements adds an extra hurdle to be crossed to participate, as data provider or user, and a layer of complexity and additional layers of risk to the robustness of services.

Other potential disadvantages of distributed infrastructures include:

  • Registering persistent identifiers with a shared service becomes desirable to sustain the interoperability of content and applications, thus adding another level of complexity to the curation of the data;
  • Monitoring of usage is difficult, since operations are being carried out on remote servers not under the control of the repository;
  • Monitoring of the availability of services is difficult – it might be possible to test the status of individual components but not a complex workflow;
  • Although underlying interoperability is essential, there is no impetus towards consistency in user interfaces, and even a tendency towards heterogeneity, and therefore fragmentation of services is likely to be maintained or even made worse;
  • Various further questions also remain (at least partially) unanswered in many cases, relates to where and how the computer processing is carried out, and how usage and services are monitored and logged.

We also need agreement at some level about our categories, formats and concepts. To get to the promised land, we need to agree on some standards. Linking datasets requires interoperability at the levels of the linguistic representations, annotations and metadata. Visualization of large datasets requires a reduction of variables, and deciding what is important and what is not. There is a tendency in the humanities for everyone to think that their way of looking at things and of categorizing things is unique. Annotations do sometimes embody the unique intellectual work of identification, categorization and interpretation of phenomena, and these are vital operations in the humanities, so it is not a surprise that this is problematic.

Another problem is that building infrastructure takes time and involves addressing complex and difficult administrative, legal, financial, political, technical barriers, often by making international agreements. So, usually, it’s easier to make ad hoc work-arounds. And building tools can be more attractive and rewarding. But actually, it’s a false opposition – enhanced infrastructure should help with tool development and deployment. An infrastructure providing a range of simple solutions for connecting together data and tools, deploying them as reliable services, managing authentication and authorization, licensing, access to computing power, monitoring availability, connection to virtual research environments, etc.

The mistake would be to try to build the perfect all-purpose tool, or to claim to provide services for end-users which solve all of the infrastructure issues. Or to put it another way, building the biggest and best fishtank in the world doesn’t solve the problem, because you can’t get all the fish in the world in there, allow everyone access to view every kind of configuration and interaction in there. But all too often this is what people try to do, rather than contributing a part of a wider, distributed system. Understandably people are impatient and our efforts and resources go into building new fishtanks, which can be fun to make, and which look good when people come to visit.

Posted in Uncategorized | Tagged , , | Leave a comment

What are the Digital Humanities?

The Day of Digital Humanities on 27th March this year has provoked numerous conversations about the nature of Digital Humanities (DH). Some believe DH is a discipline or community, with its own methods, resources, communities of practice, journals, standards of evidence, etc.

Others prefer simply to use the term as a way of looking at activity across a number of humanities-related disciplines which has a significant digital component, and while it is useful to trace connections in terms of methods, resources and tools, it is preferable for digital research in the humanities to live within the historic academic disciplines. It could be argued, for example, that the work of ‘digital classicists’ should be primarily related to addressing research questions in the mainstream of classics (or relevant sub-discipline), not primarily focussed on interacting with an interdisciplinary ‘digital humanities’.

But this is simplistic: digital research can be transformative, allowing new research questions to be formulated and posed, thus transforming existing communities. DH can enable new forms of inter-disciplinary research. Geographical Information Systems (GIS), together with large historical datasets in digital form, can allow visualizations of spatial data in ways that allow new questions to be asked in, for example, economic history, literature, history of science, linguistics, toponymy, climate studies, etc.. New points of contact between these disciplines are created, and also with scientists, social scientist, engineers and technologists in geographical sciences.

Where are the Digital Humanities?

Digital research in the humanities takes place in a variety of institutional frameworks, from isolated individuals in otherwise non-digital faculties to large specialist centres. There are 22 member organisations in the ‘Network of Expert Centres in the Digital Humanities in Britain and Ireland’, but there is no common template. To give a few partial examples:

  • The Oxford e-Research Centre has a strong DH team and project portfolio, but is not exclusively humanities-focussed, by any means, and the vast majority of DH activity in the university is outside of this department;
  • CRASSH at Cambridge is focussed on the arts and humanities, but is not exclusively digital;
  • The Department of Digital Humanities at KCL is an academic faculty which comes out of a merger of centres and groups who focussed on infrastructure,  teaching, and technical development work on research projects;
  • Institute for Historical Research offer a wide range of facilities and services which assist the researching, teaching, writing and dissemination of history, not all of them digital;
  • Archaeology Data Service runs a data repository and associated services to support research, learning and teaching in Archaeology

In fact, while there are strong overlaps in activities and organizational forms between many of the centres, there is no easily discernible common factor which is true for all centres.

This network of ‘centres’ risks failing to connect with the large number and wide range of academics engaged with digital research in the humanities who are not associated with one of these centres. The problem is writ larger at the international scale with the wider centernet network. The answer is not necessarily to create and connect more ‘centres/centers’ to encompass the wide range of activity currently outside of them. There is no consensus on what a center should do and how it should fit into an institution, and the very existence of a centre risks detaching practitioners of digital research from the mainstream of their disciplines.

DH@OX aims to provide a view of the wide range of DH activity across the University, and to support this activity in various ways, including facilitating communication and collaboration between researchers, and building better infrastructure and support services, but without imposing any particular boundaries, organisational models or definitions on the ‘digital humanities’.

It remains to be seen which approaches will prove most fruitful in the long term. The Day of Digital Humanities is likely to be a recurrent catalyst ongoing relections and discussions for many years to come.

Posted in Uncategorized | 1 Comment

Discovering Babel – final outcomes

This is a summary of some of the key outcomes of the Discovering Babel project, with links to where you can find out more.

Next steps

For those of you looking to find electronic literary and linguistic resources please visit the Oxford Text Archive (OTA) and the CLARIN Virtual Language Observatory. The OTA will shortly relaunch with a new look and feel,and many new resources. The VLO is constantly improving and under development.

Those of you creating and sharing language resources, please join the CLARIN-UK mailing list. This list is a forum for creators and users of linguistic resources and tools to discuss how we can go forward to develop better facilities and shared services, and to gather user requirements.

Evidence of reuse

The metadata that has been made available as part of the Discovering Babel project is being harvested by the CLARIN Virtual Language Observatory, and can be viewed on their portal. At the moment, we still have some performance issues with delivering the files via OAI-PMH, so there may only be a few records listed there, but we have identified the problem and will be fixing it in the next few days!

The work in Discovering Babel has contributed to an enhanced Oxford Text Archive, with more reliable and more easily discovered catalogue records, and with open access texts at persistent locations. This is designed to allow others to build services on top of our data, in a distributed environment. It has already helped to make possible the JSC-funded Great Writers project, which will, among other things, link to source texts in various formats, including epub, in the OTA.

The OTA is now also working together with the creators of Voyant at the University of Alberta, who have under development exactly the sort of tools that we imagined would bring our texts alive. Visit http://voyeurtools.org/ and paste in the following URI to get a flavour of what will be possible:

http://www.ota.ox.ac.uk/text/3253.xml

You can see more about this text at http://www.ota.ox.ac.uk/desc/3253. At the beginning of 2011, texts from the OTA were only available on request for download. Already now, thanks in large part to Discovering Babel, we are seeing on our desktop the emergence of seamless access to distributed texts  with remote tools in a service-oriented architecture.

Further collaborations with the National Grid Service in the UK to host language resources in the Cloud for UK researchers, with the development of a cross-repository search service for CLARIN, and shared services in Project Bamboo will all be underpinned in part by work done in Discovering Babel.

Skills needed for the project

The basic technical skills needed were for processing XML, e.g. XSLT 1.0 and 2.0, plus installation of modules in an Apache server, including Shibboleth access and identity management software. Various perl scripts were also deployed. Exactly how to do these things in this circumstances in which we were working were not things that anyone in the team had done before. For example, we had to read about and learn the specifications for the Open Archives Initiative Protocol for Metadata Harvesting, and the about the element set for describing language resources from the Open Language Archives Community, as well as the Shibboleth software. We were able to call on expertise in the Oxford University Computing Services for the fundamental technical areas and administrative procedures, and on experts in the CLARIN network across Europe for guidance on implementation in the specific scenarios for sharing language resources. Perhaps more than technical skills, knowledge of the work that was going on in our institution, nationally, and around Europe in the relevant areas were key to the success of the project.

Most significant lessons learned

  • don’t build a digital silo: engage with infrastructure initiatives, such as CLARIN, and find out about recommendations for good practice in connecting resources, such as the Resource Discovery Task Force, and avoid building an online resource which is difficult to find and unconnected to other data and tools;
  • at the technical level, be flexible. This work touched on fast-changing fields, and we needed to be prepared to learn about new things, and to change the technological solutions which we deployed. This also meant planning for future change in order to make services sustainable;
  • keep it simple: our successes were not the result of great leaps forward, or building complex and flashy front-ends and tools. Instead, we applied good practice in a systematic way in order to provide reliable services to underpin and fit into a shared services infrastructure. So simply providing crosswalks to Dublin Core from our metadata, and establishing an OAI-PMH service opened many doors. Putting the resource files at accessble URIs on the web allows new types of service to be developed, with much easier access and more powerful functionality.
Posted in Uncategorized | Leave a comment

CLARIN infrastructure notes – on the record

In a recent informal meeting involving various members of the CLARIN and other infrastructure initiatives, we had an open, frank and “off the record” discussion about successes and failures so far, and plans for the future. In preparation for the meeting, and to get the discussions going, we were asked to think of five points in response to each of three questions. I’m happy to go “on the record” with mine here!

What were your original impulses and dreams [when CLARIN planning started around 2006]?

1. To build an Arts and Humanities Data Service for Europe, on the model of the AHDS in the UK, to support digital work in the literary and linguistic subject areas, and link with similar emerging initiatives then emerging, e.g. at the CNRS in France.

2. To promote and integrate Central and East European researchers, resources and languages, continuing the work of TELRI project in the previous period.

3. To build new European networks, built on transparency, openness and a real desire to engage with, support and improve research, to replace failed European initiatives which were sometimes built on careerism, croneyism and corruption.

4. To move the focus of language resource & tool creators (especially computational linguists) towards the requirements of Humanities researchers, making it easier for users with little technical support to do simple yet powerful things with key resources.

5. To facilitate the participation of literary and linguistic disciplines in the emerging e-Science agenda.

What are the most important successes and failures so far?

1. Success: the initiative is almost pan-European, although some key countries not involved or not fully integrated (UK, Italy), and a very few not involved at all (Ireland, Switzerland); the integration of former TELRI partners from central and eastern Europe was successfully achieved.

2. Success: we have succeeded in getting enough funding from national funders to make CLARIN happen!

3. Partial failure: we’ve only had fairly small-scale engagement so far of scholars to elecit detailed requirements and to develop use cases.

4. Partial failure: we haven’t made the total shift of focus of the CLARIN community away from traditional concerns (own tools and research) to production infrastructure services for the humanities and social sciences.

5. Partial failure: we have not yet created a standards-oriented ecosystem for resource and tool creators to enable them to contribute to sustainable production services. To put it another way No answer to “How do make CLARIN-conformant resources?” I hope that the forthcoming Reference Manual will at least partially solve this problem.

What are the top priorities for future work?

1. We need to work out ways to lobby for and secure funding, in a situation where, in the Humanities, there is a lack of a critical mass of researchers (in any given discipline) who want research computing infrastructure, or who see it as a top priority. This means that here is a lack of an effective lobby group of influential scholars in most forums. This is one of the disadvantages of the cross-disciplinary nature of linguistics and the language resources and tools field.

2. We need to deliver something urgently to show the relevant communities that we can do it, and to give them a clearer idea of what he intend to do. Access and authentication infrastructure (AAI) is the key to delivering any kind of production service which can show and end-to-end use case, so we should make solutions in this area a logical priority.

3. Where is the data processing going to take place, who is going to pay for it, and how will we do the accounting? We urgently need to make progress towards solutions here as well if we are to create production-quality services.

4. Humanities and social sciences research has global connections. How will we accommodate users and service providers outside of our AAI domain? As CLARIN starts to rely on national funding, there is an increased danger of two-speed progress, with some countries and communities who are currently engaged being pushed out.

5. What will the platforms for users, and who is going to make the user interfaces? Are we going to be able to overcome fragmentation and ‘silo-building’ – can we offer a good user experience while still allowing flexibility and connectiveness? If so, how, and when?

Posted in Uncategorized | Leave a comment

Making your language resources discoverable and reusable

By Ylva Berglund Prytz and Martin Wynne, University of Oxford

The JISC-funded Discovering Babel project has enabled the Oxford Text Archive to improve the ways in which we make our language resources available for users to find and use. Here we will explain some of the ways in which other resource creators might be able to follow in our footsteps.

Language resources are electronic collections of language data that can be used for language study and research, and are created in a number of contexts. Sometimes the main purpose of a project is to create a dataset, and in many other cases language resources are created as a part of or simply as the result of a larger project to investigate a particular aspect of language. Irrespective of why and how a resource is created, there is usually scope for making the resource available to others. This report will examine some simple ways in which creators of language resources can make it easier for others to find and reuse them.

Why make resources available?

There are many reasons why you may want to make your language resources available to others. It may be a requirement for your funding. It may be that you simply want to give something back to the community, and contribute to assisting the our accumulation of knowledge. Sharing your resources can also be a way of drawing attention to your work and getting recognition for what you are doing, and showing that it is having an impact on wider research goals. If you are able to show that something you have created is valuable to a larger group of users, this is likely to work in your favour in future grant applications, and when looking for to find collaborators, and support from the community.

Making language resources available is also a way of minimizing duplication of effort. If you have created a resource that others can use, they do not have to spend time and resources on creating their own resource.

Replicability of research results is another important issue. If others are to test and reproduce your results, or attempt to extend or refine them, then they will need to have access to the data, tools and methods which you used. Making resources available in this way is essential to testing, refining and building on research results, and is considered necessary for the verification of research findings and interpretations in many scientific domains.

Assuming that for one of the above reasons, or for another, you want others to know about and maybe also to reuse your language resources, what are the issues that you need to consider before sharing your resources? Thinking about the questions below should make your task easier and the sharing of the resource more effective.

Issues to consider when deciding whether and how to share your resources include:

  • How do you share?
    • Will you offer metadata, to help users find, evaluate and understand your resource?
    • Will you offer a service for users to access the resource (e.g. online access, or download option only)?
    • Will you deposit the resource in an archive or repository (instead of, or in addition to your own service)?
    • Or do you want to only share on request to users who get in touch with you?
  • Legal issues
    • Do you have the right to share the resources?
    • How do you protect your rights?
    • What kind of licence will you ask users to agree to?
  • Administrative and organizational issues
    • Do you have access to the resources needed to share your resources (server, staff, admin, user support, etc.)?
    • Who will be responsible for the service?
    • Are these reliable, sustainable and likely to be available in the long term?
  • Finding your users
    • How do users find your resource?
    • How can you make it easier for users to find/use your resource?
    • Can you support users?
  • Sustainability
    • How do you ensure you have the necessary resources/support/infrastructure to share your resource?
    • How do you ensure continuation of service?

Let’s now examine in more detail some of the issues relating to how to help users to find your language resources.

Making your resources discoverable

If you want to share your resources you have to make sure people know about them and can find them. The most effective way to do this is to make your metadata available to a portal which brings together information about where to find language resources in different locations. These exist in particular sub-domains (e.g. endangered languages, child language acquisition, learner language, sign language, for particular languages or language families, for historical periods, etc.), and there are a couple of more comprehensive initiatives: the Open Language Archives Community, and the CLARIN Virtual Language Observatory. Some questions to explore in order to market your resource effectively include:

  • Who are the potential users? Where do they currently look for resources?
  • What are the relevant mailing lists, conferences, and publications for your target audience?
  • Where in other domains, or sets of users, or geographical regions (beyond your immediate community or target audience) might you find interest in the resource?
  • If the resource is available online, or has a webpage associated with it, make sure you make it easy for search engines to find and index your page, for example by including the correct keywords in the website metadata (see Google’s guidelines for webmasters).

Once you have decided how and to whom you will make your resource descriptions available, it is necessary to provide the necessary information in the right formats. If you decide to deposit your resource in a repository, you will get some assistance in doing this. If you deposit with the Oxford Text Archive, you will need to fill in a deposit form, and then the repository staff will create an electronic metadata record. This will be transformed automatically to the correct formats for the online catalogue record, for OLAC and for CLARIN. If you want to create your own records, you can follow the guidelines provided by the different repositories. Some expertise in creating and manipulating XML documents will be required.

Social media

You may use social media forums such as blogs, twitter, facebook, dig, de.licious, and zotero, if you think that this might be a way to reach your potential users. It might prove to be a way to reach unexpected groups of users by reaching outside of the academy. Your funders might consider this to be a useful way to increase wider impact. It’s probably still not clear how appopriate and useful such methods are, and it’s a fast-changing field. But it doesn’t take much effort to tweet, announce things on facebook, make links on various services. Furthermore, writing blogs can be a good way to report your work to a wide variety of stakeholders and potential users.

The point of making your language resources discoverable is to facilitate the reuse of them by others. Let us now briefly examine some of the issues relating to how you can make this happen as effectively as possible, starting with avoiding any potential legal pitfalls.

Before you can share – a little more on legal issues

Before you make your resource available you have to make sure you have the right to share it. You may also want to look at what you can do you protect your rights (for example release the resource under a particular licence). You also need to consider if there are any restrictions on what users of the resource are allowed to do with it. Can they share it, add to it or develop it further? This could be specified in a user licence which you specify. Rights issues can be complex and often vary between different countries. If you have questions about what rights you have or what you need to do to have the right to share a resource, you may want to consult a legal representative for your area, for example the University lawyers or legal department.
If you are making the resource ‘freely available’, you may want to specify this with an open access licence. One way to encourage reuse is by making it simple for users is to see under what conditions a resource is available.

Creative Commons (CC) licences can be used as a “a simple, standardized way to grant copyright permissions to [your] creative work”. The CC licences can be used to specify that there are no restrictions whatsoever on re-use, or, for example, that people may only use the resource for non-commercial purposes or that they have to acknowledge the original creator when using it. It is also possible to specify that people may create derivatives (for example use part of the resource and/or add to it) and that such derivatives have to be made available under the same licence conditions. For more information about Creative Commons, please see http://creativecommons.org/.

Whatever rights or restrictions you assign to your resource you need to consider if the situation is likely change in the future. For example, will it be the case that restrictions can be lifted after a certain date? Or do you have permission to sue certain source texts only for a limited time? If so, you have to ensure that you can deal with this.

As well as considering the legal and ethical issues relating to making your language resources available, you should also certainly consider the licensing of the metadata associated with your resources. In order for users to be able to find, evaluate and reuse the resources, good descriptions of their nature and context are necessary. It is usual in the domains using language resources for this descriptions to be made freely available, but usually there is not a specific and clear statement of the terms under which they are made available. In order to avoid any restrictions on the free sharing of metadata, and to ensure that maximum use is to be made of it, it is better to assign a specific open access licence to all metadata records, such as ODC-PDDL or a Creative Commons licence.

In the case of the Oxford Text Archive, we found that because some of our resources are TEI XML documents, with the metadata embedded in the header of a single file which also contains the resource in the body, then it was necessary to apply a single licence to both metadata and data, and we have found that the Creative Commons best fulfills our needs for licensing the textual data (in most cases), we opted for that. In cases where we make just the metadata available, for example as a catalogue, and to metadata harvesters, we will apply the least restrictive possible Creative Commons licence, usually know as the ‘no copyright’ or ‘CC0′ licence (http://creativecommons.org/publicdomain/zero/1.0/).

How do you enable reuse of your language resources?

Depending on the nature of the resources at your disposal, you can opt to share your resources in various ways. Whatever way you choose, the key point is to ensure that the solution that you choose is not dependent on specific people, machines, projects, etc. which are likely to be transient, but rather that it is embedded in stable organizational set-up which is adequate for providing persistent service with high availability. The key questions to ask in deciding what sort of service to offer and how to provide it are the following:

  • Is what I am setting up sustainable?
  • Is the solution technically robust and not subject to discontinuation should current funding/staffing/equipment be cut
  • Who is responsible for the service?
  • Is this a person (named or defined by function) or an organisation (unit, department, institution)?
  • Who is responsible for the various bits of infrastructure on which the service depends?
  • Technology (server, scripts, physical server space, etc)
  • Human resources (server maintenance, user support)
  • What will the situation be in 1, or 2, or 5, or 10, years time?
  • What happens if you (or the person responsible for the service or part of it) leave or take on a different role?
  • What happens at the end of the current round of funding?
  • Will additional funding be needed/be available to continue the service?
  • Would it be better to look to move the service to another institutional home?

How can I make it easier for users to use the resources?

Let’s examine some of these options in a little more detail.

Distribution via email or on disk

A simple option, especially where small resources are concerned, is to simply send the resource to whoever requests it either as an email attachment (suitable for very small resources only) or on a CD or DVD.
This is only suitable for low-demand, small resources. You still have to consider legal issues and what provision there is for making the resource available also if you are not available personally to respond to requests. For distribution on disk there is also a cost – for the media and postage. What is more, the end user is left to their own devices when it comes to getting their resources connected to the relevant analysis tools. It can be tricky to work out which ones to use – will you be prepared to offer advice and technical support to users? Some will ask for it.

Online delivery

If you make your resource available online, you can opt to either make it available for download only (with some of the same problems identified above), or you may offer an online service where people can access and use via their web browser (for example a corpus with a search interface) . Now a new set of questions arise:

  • Who maintains the website?
  • Can the site handle the volumes of traffic, and the amount of processing required?
  • How will you know how many users have visited the site and downloaded your resource, or performed other operations? Do you need to report this to funders or other stakeholders?
  • Who will maintain the server and ensure that the service is available?
  • Will you offer a service level description, setting down exactly what you offer and under what terms?
  • Can you monitor the availability of the online services (i.e. tell if everything is up and working properly)?
  • Do you need to restrict access to certain classes of user? If so, how will you do this?
  • Do you need to recognize users so that they can come back to datasets and workflows that they have started to assemble on previous visits?
  • How will you deal with user support or queries (technical or about the resource/service)?Will it be available even if you leave the institution, or change your ISP?
  • Is the URL stable, or is it likely to change when the university re-designs its website (or the website host goes into administration)?
  • What happens when the technology behind the service needs updating/renewing (for example to work on different operating systems or in different browsers)?
  • Are you prepared to offer any guarantees of availability and persistence of service to users who might require stable datasets and tools for their research, or who may want to be able to come back and reproduce results at a later date?
  • How will users cite your datasets and services in their reports and publications?

Depositing your resource in an archive or repository

A lot of the issues arising from running your own web service can be avoided if you deposit your resources in a repository, which will deal with distribution, as well as perhaps offering long-term preservation, help with generating and sharing metadata, and connection with other tools and resources. So, you may also opt to deposit the resource in a repository. In deciding whether to do this, and whether a repository is appropriate, you may wish to consider:

  • Is there a cost associated? If so, is it a once-off, annual, etc? How will you pay ongoing fees after the end of the project?
  • What do you have to do to deposit (for example format of resource and metadata)?
  • How stable and reliable is the repository? How long is their funding likely to be continued?
  • Who knows about the repository? Is it known to potential users of your resource? Does it share metadata with relevant aggregators, and announce new deposits in appropriate forums?
  • Who has the right to use it? Is access restricted to members of particular institutions, associations, countries, etc.? Are there technical barriers which might exclude some sets of users?

There are several archives and repositories available. The Oxford Text Archive offers a service to deposit resources for a small administrative fee. This has the advantage of being a specialist archive for literary and linguistic resources, offering metadata to aggregators in this domain, and part of the emerging research infrastructure being developed by CLARIN. Other services exist for more specialist resource types, such as SCOTS at the University of Glasgow for Scottish and historical resources, CHILDES for language acquisition studies, ICAME for English language resources, and the Endangered Languages Archive at the School of Oriental and African Studies, University of London. Each is well embedded in their research communities, and so deposit with such an archive is an excellent way to reach particular sets of users.

There is also a lot of ongoing work in developing institutional repositories in Universities in the UK. While some of these are focussed exclusively on e-prints, some offer repository services for research data as well. Creators of resources should check on the facilities and services available in their institution (often based in the library or information services department), and deposit with your institutional repository it may be a viable option. This may be useful for raising your profile locally and as a secure storage solution. It is however highly unlikely to satisfy all of your needs. An institutional repository which aims to cater for research output of all types and for all disciplines cannot have specialist curation expertise in all areas, and will not, for example, know about all of the relevant metadata standards, best practice in digital preservation of language resources, or connection to relevant discipline-specific resource discovery services. Repositories will typically offer non-exclusive deposit agreements, which means that when you deposit your resources, you do not give up any of your rights. There is normally no barrier to you depositing your resource in numerous archives. This is effective for preservation purposes, although you may need to consider the impact that it might have in terms of version control (will the resource be updated, and how to you check that the latest version is available in all places?), and monitoring usage.

Furthermore, it is increasingly likely that federations of archives, with the possibilities of cross searching resources, and connecting disparate collections and tools. Beyond this, sophisticated virtual research environments will emerge allowing more operations, as well as collaborations between groups of scholars, and connections to publications and other outputs. It is likely to be the specialist repositories which are connected to this new infrastructure, and it is likely to become increasingly difficult for the individual scholar to connect up their resources without the assistance of the repository and infrastructure specialists.

Whichever of the options you choose, you can help to ensure that users can work with your resource as effectively as possible by considering offering the following facilities:

  • A full description of the resource, carefully crafted user guidelines, FAQ, instructions (preferably with screenshots);
  • Support for answering user queries;
  • A forum for users where they can discuss issues that come up. Make sure that you, or someone with good knowledge of using the resource, is available to respond to queries, in particular if the forum is new or under-used;
  • Make it easy for users to give appropriate accreditation to resource creators and access services, thereby also further promoting your resource and announcing its availability;
  • Make it clear what the title of the resource is, who the creator is and where it is found (at a persistent URL);
  • Make any licence restrictions clear (especially if your licence stipulates that the resource creator/owner should be attributed by any user);
  • Include on your website a sample citation/bibliography entry that users can use for reference;
  • If you are offering an online service, test the interface during development, and try to find some resources for ongoing development in response to user feedback.

In summary, you need to take as wide a view as possible about who the potential users are, how they will find the resources, how they might want to use them, and then to think about how the arrangements will continue in the future. Good luck!

Posted in Uncategorized | Leave a comment

Discovering Babel: technical issues

The Discovering Babel project aims to make the digital resources in the Oxford Text Archive easier to discover for potential users. The technical issues in the project relate to the ways in which we are making the OTA catalogue data available in new ways. There are several aspects to this work:

  1. making the catalogue records available to be collected by online resource discovery services;
  2. transforming the catalogue records into a variety of different formats for the different services;
  3. updating catalogue records for the items in the archive.

Making the records available

Before Discovering Babel, the OTA metadata was available only in abbreviated form in the catalogue list on the website, and on the webpages for each resource, or in full when a user downloaded the resource. An important additional service made available as part of the project workplan was to make the full metadata available for online services to collect, or harvest it. We chose to do this using the most widely used protocol for this purpose, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH for short).

In order to do this we had to follow the following steps:

  • add the appropriate Apache and Perl modules to our web server to allow OAI-PMH queries to our web service;
  • implement crosswalks (using XSLT) from our metadata in TEI Header format to the Dublin Core format;
  • register as a metadata provider with relevant aggregators;
  • set up procedures to ensure the ongoing availability, persistence, maintenance and updating of the OAI-PMH service

We have chosen to make metadata available in a number of formats via OAI-PMH, to fit the expectations and requirements of a number of harvesters relevant to our field. We therefore deliver Dublin Core, with extensions for the Open Language Archives Community (OLAC) and the TEI Headers. We were also planning to provide CMDI metadata for the CLARIN aggregator, but this format has not yet achieved sufficient maturity and stability, so we aim to add this later. In the meantime, the CLARIN aggregator is harvesting OLAC metadata, and in this way they are presenting OTA resources in the Virtual Language Observatory service at http://www.clarin.eu/vlo.

The OTA records are harvested from http://ota.oerc.ox.ac.uk/oai2/XMLFile/ota/oai.pl.

Crosswalks: transforming the metadata to different formats

We initially wrote the crosswalks using XSLT 2.0, but we found that the performance was very poor, and too slow for the harvesting services. We therefore backported the code to XSLT 1.0, which provided adequate performance and enabled the harvesters to operate. We plan to investigate these issues further together with other CLARIN centres to see if future improvements to the performance can be achieved.

What we understand so far is that the repeated calls to the Java-based XSLT 2.0 processor Saxon (in our case, using the saxonb-xslt package on Ubuntu) seem to be the problem. The original stylesheet which we wrote to transform the TEI Headers worked on a directory of header files. However, due to the way in which the OAI-PMH architecture works, the stylesheets had to be written to work on a file-per-file basis. So the Java Virtual Machine starts again and again for each call of Saxon, i.e. for every metadata item. This was very costly computationally, and simply providing more computing power would not have been a very good solution, since the procedure seems to be simply not easily scalable.

A key point for us to consider at this stage was that the our original stylesheets made use of XSLT 2.0 features, but there are few 2.0 processors available. None seem to be based on C or C++. The only real alternative to Saxon of which we were aware were the closed-source AltovaXML products, only available for Windows 32-bit architectures.

We therefore ran tests with C-based XSLT 1.0 processing (with the xsltproc package on Ubuntu), which was lightning fast in comparison for the hundreds of metadata records, with a time factor improvement of 100-200 times compared to Saxon. We therefore rewrote the pertinent parts of the stylesheets to conform to XSLT 1.0 and implemented this solution.

We also considered another possibility, of moving to a servlet-based solution. There is a Java-based OAI implementation (jOAI), for example, to be deployed on a Tomcat Server. Another option would have been to investigate setting up the Java-based Saxon XSLT 2.0 as a service in its own right, which could be consumed by the Perl Code. Both solutions would not involve starting up the JVM again and again. However, either solution would make it necessary to set up a server (Tomcat or Jetty, respectively), and we considered that as well as the additional effort to implement, this would raise an additional maintenance overhead, with serious risks to the robustness, persistence and sustainability of the service.

Updating the records

The OTA has always made freely available the descriptions of the electronic resources in the archive. These descriptions take the form of catalogue records, or metadata, and contain information useful to potential users about the resource, including its title, a summary of the content, where the electronic resource came from (its provenance), technical formats, types of annotation, size of the files, any restrictions on its use, etc..

This metadata for each resource is encoded in an XML file, and the information is encoded according to the guidelines of the Text Encoding Initiative (TEI), following the latest (P5) version of the guidelines. In the area of literary and linguistic computing, the TEI Guidelines are a widely recognized and respected reference point and standard for the encoding of data and metadata. The metadata for OTA resources is therefore in the form of a TEI Header.

The work in Discovering Babel on making this metadata more visible, and on transforming it into other formats has revealed some areas where it was necessary to update, correct or add to the existing information in the metadata. For example, it was found that the description of the language of a resource was missing in some cases, usually where the language was English, and was perhaps considered the default value in the past!

Posted in Uncategorized | Leave a comment

Discovering Babel workshop

A workshop on How to make your language resources discoverable was held at Oxford University Computing Services on Friday June 24th, as part of the JISC-funded Discovering Babel project.

Ylva Berglund-Prytz from OUCS welcomed the participants, who introduced themselves and revealed that they came from numerous universities, representing teachers, researchers, post-graduate students and archivists, from the UK and abroad. See slides (pptx).

Andy McGregor introduced the work of the Resource Discovery Task Force and the JISC programme ‘Infrastructure for Resource Discovery’, with a refreshing willingness to acknowledge the different standards and practices in different disciplines. See slides (pptx).

Martin Wynne then spoke about Discovering Babel, the project within the programme which relates to language resources, focussing on the issues relating to the different ways of describing and cataloguing language corpora (and other resources) and making those descriptions available to users in a variety of ways. See slides (pdf).

Alexander König of the Max Planck Institute for Psycholinguistics then gave a demonstration of the CLARIN Virtual Language Observatory, which is collecting and making available to users in a single place the information about language resources from all around Europe.  Most impressive was the overlay of the geographical data on Google Earth, allowing users to find resources via the map. See slides (ppt).

James Wilson then spoke about the suite of projects (many of them JISC-funded) in OUCS which are addressing the more general data management needs of researchers. After the discipline-based and pan-European scope of the CLARIN initiative, it was fascinating to compare the idea of service provision which we might hope to find within an institution. See slides (pptx).

In the afternoon, a ‘show-and-tell’ session then allowed participants to share information about the resources and services that they were sharing with other researchers. This fascinating whirlwind tour of a snapshot of the resources available in the UK showed us all what a variety of extremely valuable datasets continue to be created.

The presentations included:

The final session was a discussion which went beyond concerns about discovering resources, and focussed more on the re-use of resources, and on ways in which they can be exploited online, cross-searched, combined together, and connected with online tools and services.

From a very open and frank discussion about our needs, concerns and frustrations there emerged a strong feeling that a UK network was needed to express our requirements more forcefully to funders and other relevant organisations who can help us to build the kind of services that we need.

Recent informal meetings with partially overlapping set of people in Glasgow, Newcastle and Oxford have reinforced my impression that there is a strong desire to form a UK network of researchers interested in language data and tools. The motivations and proposed activities are to:

  • find ways to find, share and reuse resources;
  • develop joint projects to build resources and services;
  • promote interoperability of resources so that they can more easily be used with generic tools, and combined with each other;
  • lobby for UK funders to invest in infrastructure for creating and using language resources;
  • lobby for language data and tools to be included in national computing infrastructure;
  • lobby for UK participation in the European CLARIN infrastructure;
  • provide channels of communication between UK researchers and CLARIN, e.g. to feed in our requirements, get access to services, participate in technical discussions, etc.).

Clearly this meeting was only a starting point!

Posted in Uncategorized | Leave a comment

How to make your language resources discoverable

The Oxford Text Archive will host a one-day workshop on Friday June 24th entitled How to make your language resources discoverable, as part of the JISC-funded Discovering Babel project. The workshop is aimed at researchers who create and use corpora and other digital language resources, and will address the following questions:

How can we help users to find corpora and other digital language resources? Can we hope to have a one-stop shop where we can find them all?

Are there ways to describe the content of language resources in ways that help users to compare them, and find the right ones for their research?

How can I make MY language resources easier to discover and use?

Once we discover what we want, how can we make it easier to use language resources and tools? Can we create virtual research environments for corpus users?

What existing initiatives at the national and international level are addressing these problems, and what are the solutions? What can a grass-roots initiative at the UK level do?

Speakers will introduce Discovering Babel and the CLARIN Virtual Language Observatory, including presentation of the work that has been done in the OTA to make it easier for users to find and use the language resources, and how this work might help support other creators and providers of resources and services. A ‘show-and-tell’ session will then allow participants five minutes each to showcase the resources that they wish to share, or would like to have access to. Discussion will then go beyond the discovery of resources to how we can provide the services and tools that we need for online access to a variety of corpora and lexical datasets.

The workshop will also be one of the events which will launch CLARINET, a new network for UK-based researchers with an interest in furthering digital research in the language sciences and related disciplines. CLARINET will be loosely affiliated to the CLARIN European research infrastructure, and other relevant initiatives, but the focus will be on the requirements of researchers in the UK. The workshop will conclude with a round-table discussion on what CLARINET should aim to achieve.

Click here to sign up for this free workshop.

Posted in Uncategorized | 1 Comment