Connecting and Integrating Language Resources and Tools with CLARIN

As part of the meeting of the CLARIN technical centres, an open tutorial session was held, at the SURF offices adjacent to Utrecht train station. SURF is the collaborative ICT organisation for Dutch higher education and research.

After a welcome from Dieter Van Uytvanck, there were presentations and discussion on various aspects of the CLARIN infrastructure.

Federated Search Results

The Federated Content Search (FCS) version 2  was presented by Leif-Jöran Olsson from Sprakbanken in Sweden. FCS decouples the back-end functionality of online corpus search engines from the results, and aims to aggregate and integrate results from multiple sites. CLARIN-FCS 2.0 is not a replacement, but an extension of the existing functionality. It will be backwards compatible with 1.0. It is built on the SRU / CQL interface specification. There are guides and tutorials on the CLARIN technical wiki.

The transport protocol encompasses and endpoint, between FCS client and the idiosyncracies of the various search engines based in different repositories. The endpoint translates queries from the client from CQL or FCS-QL to the query dialect, and translates query results back to format required for the client. A discovery phase allows a client to see the functional capabilities of a search engine, and which resources are available for search.

FCS 2.0 allows advanced search in multiple layers of annotation: token, lemma, pos (UD-17), orth, norm, phonetic, text. Only text search was possible in the earlier version. The FCS-QL is a superset of CQL 3.0, and there are adapters for CWB. Hits are serialized as CLARIN-FCS results, with each hit serialized as one record. Data views allow different presentations of results. The technical requirements for offering an endpoint include reference libraries SRUServer, SRUClient, FCS-QL, FCSSimpleEndpoint, and translation libraries.

In the discussion it was established that it is possible to set up an endpoint for some online search engine in another location not curated by a CLARIN centre.

It would also be useful to know, and for the information to be disseminated, what would be the best course if a service manager was making a decision now about which corpus search engine platform to implement locally. What should they bear in mind for compatibility with FCS? Do certain software stacks work better out of the box? Where is there most expertise and support? I would guess that Corpus Workbench might be a good choice, but I don’t have confirmation of this.

CLARIN-FCS 3.0 is already planned and will include syntactic search and more advanced views of multiple layers. Possible hurdles to more advanced cross-searching include the fact that mapping of POS sets can be problematic, and might be too bad to use effectively. It would be interesting to explore whether the system’s reliance on annotation in the text (rather than stand-off annotation) means that advanced search on resources using annotation is not usually possible or meaningful when cross-searching resources, since annotations are normally specific to a particular corpus. Related to this is the important question of how far CLARIN will be able to, or want to go with adding advanced features? Is the intention for the FCS to remain as a ramp to discover and reach more advanced features on the various interfaces for each corpus, or will it become a corpus search platform, where researchers can carry out all of their data exploration?

More information on Federated Content Search can be found at https://www.clarin.eu/content/federated-content-search-clarin-fcs.

Metadata

After a short break, we launched into a session on the latest version of the CLARIN metadata standard, CMDI 1.2, with Mitchell Seaton and Menzo Windhouwer, although this took rather a narrow perspective of presenting the changes in version 1.2, without offering an overview of what CMDI is and does.

More information on CMDI can be found at https://www.clarin.eu/content/component-metadata.

Connecting applications to data

Claus Zinn introduced the Language Resource Switchboard, which is under development as part of the CLARIN-PLUS project. The switchboard suggests applicable web-based tools for a given resource, and forwards relevant details of the resource to the application and starts processing. There is already a demonstrator at http://weblicht.sfs.uni-tuebingen.de/clrs, and it works! There are some challenges to integrate this with the VLO. For example, some records in the VLO only contain descriptive metadata, and don’t link to the actual resource files.

The current Switchboard is a very impressive demonstrator, but there are a lot of issues to be addressed before it can go beyond demonstrator status to a service. These include consideration of the vocabulary used for ‘tasks’, and whether an existing taxonomy such as TaDiRAH can be used, offering the opportunity to link to other services.

There are also likely to use cases where the user starts with a raw text and is not necessarily happy to follow a fully automated processing chain with applications available as web services. What about data inputs that already have bespoke analysis and interpretation, and manually corrected encoding? Is it possible to interrupt workflows, then re-insert manually tweaked data inputs?

In the evening there was a short walking tour of the historic city of Utrecht, including a visit to the new CLARIN ERIC offices, and a most gezellig evening meal.

LINDAT DSPace for CLARIN Repositories

On the following morning, a session introduced how a reusable instantiation of DSpace has been created with bespoke modifications for CLARIN, known as LINDAT DSpace. The modifications include an enhanced submission interface for depositors, and ingest workflows. The interface is set up to capture CMDI metadata, but multiple profiles are possible, and conversions are built in, including Dublin Core, OLAC, and TEI headers.

Developed in Prague, the repository package is currently implemented in Slovenia, Poland, Italy, Sweden, and Norway. LINDAT DSpace is currently on DSpace 5.6, and is following upgrades in the core packages, as well as contributing code back into the DSpace project.

The repository would normally be installed on a dedicated virtual machine, and the technical requirements include the packages ant, postgresql, jdk, tomcat, maven, make, and apache or nginx. Most instances have been installed with help from staff from Prague, typically supported by a CLARIN ERIC mobility grant.

I am certainly interested in this as an option for migrating the Oxford Text Archive to a new platform. It is most attractive that it offers various aspects of integration with CLARIN out of the box, including a path to CLARIN B-Centre accreditation, and there appears to be excellent community support available. One of the challenges for the OTA would be to ensure that there is an effective crosswalk from TEI headers to CMDI 1.2, but discussion here revealed that there are several successful instances of this, for example in CLARIN centres Sprakbanken (Sweden), CST (Denmark), ACDH (Austria).

 

Posted in Uncategorized | Comments Off on Connecting and Integrating Language Resources and Tools with CLARIN

Comments are closed.