By Ylva Berglund Prytz and Martin Wynne, University of Oxford
The JISC-funded Discovering Babel project has enabled the Oxford Text Archive to improve the ways in which we make our language resources available for users to find and use. Here we will explain some of the ways in which other resource creators might be able to follow in our footsteps.
Language resources are electronic collections of language data that can be used for language study and research, and are created in a number of contexts. Sometimes the main purpose of a project is to create a dataset, and in many other cases language resources are created as a part of or simply as the result of a larger project to investigate a particular aspect of language. Irrespective of why and how a resource is created, there is usually scope for making the resource available to others. This report will examine some simple ways in which creators of language resources can make it easier for others to find and reuse them.
There are many reasons why you may want to make your language resources available to others. It may be a requirement for your funding. It may be that you simply want to give something back to the community, and contribute to assisting the our accumulation of knowledge. Sharing your resources can also be a way of drawing attention to your work and getting recognition for what you are doing, and showing that it is having an impact on wider research goals. If you are able to show that something you have created is valuable to a larger group of users, this is likely to work in your favour in future grant applications, and when looking for to find collaborators, and support from the community.
Making language resources available is also a way of minimizing duplication of effort. If you have created a resource that others can use, they do not have to spend time and resources on creating their own resource.
Replicability of research results is another important issue. If others are to test and reproduce your results, or attempt to extend or refine them, then they will need to have access to the data, tools and methods which you used. Making resources available in this way is essential to testing, refining and building on research results, and is considered necessary for the verification of research findings and interpretations in many scientific domains.
Assuming that for one of the above reasons, or for another, you want others to know about and maybe also to reuse your language resources, what are the issues that you need to consider before sharing your resources? Thinking about the questions below should make your task easier and the sharing of the resource more effective.
Issues to consider when deciding whether and how to share your resources include:
- How do you share?
- Will you offer metadata, to help users find, evaluate and understand your resource?
- Will you offer a service for users to access the resource (e.g. online access, or download option only)?
- Will you deposit the resource in an archive or repository (instead of, or in addition to your own service)?
- Or do you want to only share on request to users who get in touch with you?
- Legal issues
- Do you have the right to share the resources?
- How do you protect your rights?
- What kind of licence will you ask users to agree to?
- Administrative and organizational issues
- Do you have access to the resources needed to share your resources (server, staff, admin, user support, etc.)?
- Who will be responsible for the service?
- Are these reliable, sustainable and likely to be available in the long term?
- Finding your users
- How do users find your resource?
- How can you make it easier for users to find/use your resource?
- Can you support users?
- How do you ensure you have the necessary resources/support/infrastructure to share your resource?
- How do you ensure continuation of service?
Let’s now examine in more detail some of the issues relating to how to help users to find your language resources.
If you want to share your resources you have to make sure people know about them and can find them. The most effective way to do this is to make your metadata available to a portal which brings together information about where to find language resources in different locations. These exist in particular sub-domains (e.g. endangered languages, child language acquisition, learner language, sign language, for particular languages or language families, for historical periods, etc.), and there are a couple of more comprehensive initiatives: the Open Language Archives Community, and the CLARIN Virtual Language Observatory. Some questions to explore in order to market your resource effectively include:
- Who are the potential users? Where do they currently look for resources?
- What are the relevant mailing lists, conferences, and publications for your target audience?
- Where in other domains, or sets of users, or geographical regions (beyond your immediate community or target audience) might you find interest in the resource?
- If the resource is available online, or has a webpage associated with it, make sure you make it easy for search engines to find and index your page, for example by including the correct keywords in the website metadata (see Google’s guidelines for webmasters).
Once you have decided how and to whom you will make your resource descriptions available, it is necessary to provide the necessary information in the right formats. If you decide to deposit your resource in a repository, you will get some assistance in doing this. If you deposit with the Oxford Text Archive, you will need to fill in a deposit form, and then the repository staff will create an electronic metadata record. This will be transformed automatically to the correct formats for the online catalogue record, for OLAC and for CLARIN. If you want to create your own records, you can follow the guidelines provided by the different repositories. Some expertise in creating and manipulating XML documents will be required.
You may use social media forums such as blogs, twitter, facebook, dig, de.licious, and zotero, if you think that this might be a way to reach your potential users. It might prove to be a way to reach unexpected groups of users by reaching outside of the academy. Your funders might consider this to be a useful way to increase wider impact. It’s probably still not clear how appopriate and useful such methods are, and it’s a fast-changing field. But it doesn’t take much effort to tweet, announce things on facebook, make links on various services. Furthermore, writing blogs can be a good way to report your work to a wide variety of stakeholders and potential users.
The point of making your language resources discoverable is to facilitate the reuse of them by others. Let us now briefly examine some of the issues relating to how you can make this happen as effectively as possible, starting with avoiding any potential legal pitfalls.
Before you make your resource available you have to make sure you have the right to share it. You may also want to look at what you can do you protect your rights (for example release the resource under a particular licence). You also need to consider if there are any restrictions on what users of the resource are allowed to do with it. Can they share it, add to it or develop it further? This could be specified in a user licence which you specify. Rights issues can be complex and often vary between different countries. If you have questions about what rights you have or what you need to do to have the right to share a resource, you may want to consult a legal representative for your area, for example the University lawyers or legal department.
If you are making the resource ‘freely available’, you may want to specify this with an open access licence. One way to encourage reuse is by making it simple for users is to see under what conditions a resource is available.
Creative Commons (CC) licences can be used as a “a simple, standardized way to grant copyright permissions to [your] creative work”. The CC licences can be used to specify that there are no restrictions whatsoever on re-use, or, for example, that people may only use the resource for non-commercial purposes or that they have to acknowledge the original creator when using it. It is also possible to specify that people may create derivatives (for example use part of the resource and/or add to it) and that such derivatives have to be made available under the same licence conditions. For more information about Creative Commons, please see http://creativecommons.org/.
Whatever rights or restrictions you assign to your resource you need to consider if the situation is likely change in the future. For example, will it be the case that restrictions can be lifted after a certain date? Or do you have permission to sue certain source texts only for a limited time? If so, you have to ensure that you can deal with this.
As well as considering the legal and ethical issues relating to making your language resources available, you should also certainly consider the licensing of the metadata associated with your resources. In order for users to be able to find, evaluate and reuse the resources, good descriptions of their nature and context are necessary. It is usual in the domains using language resources for this descriptions to be made freely available, but usually there is not a specific and clear statement of the terms under which they are made available. In order to avoid any restrictions on the free sharing of metadata, and to ensure that maximum use is to be made of it, it is better to assign a specific open access licence to all metadata records, such as ODC-PDDL or a Creative Commons licence.
In the case of the Oxford Text Archive, we found that because some of our resources are TEI XML documents, with the metadata embedded in the header of a single file which also contains the resource in the body, then it was necessary to apply a single licence to both metadata and data, and we have found that the Creative Commons best fulfills our needs for licensing the textual data (in most cases), we opted for that. In cases where we make just the metadata available, for example as a catalogue, and to metadata harvesters, we will apply the least restrictive possible Creative Commons licence, usually know as the ‘no copyright’ or ‘CC0′ licence (http://creativecommons.org/publicdomain/zero/1.0/).
Depending on the nature of the resources at your disposal, you can opt to share your resources in various ways. Whatever way you choose, the key point is to ensure that the solution that you choose is not dependent on specific people, machines, projects, etc. which are likely to be transient, but rather that it is embedded in stable organizational set-up which is adequate for providing persistent service with high availability. The key questions to ask in deciding what sort of service to offer and how to provide it are the following:
- Is what I am setting up sustainable?
- Is the solution technically robust and not subject to discontinuation should current funding/staffing/equipment be cut
- Who is responsible for the service?
- Is this a person (named or defined by function) or an organisation (unit, department, institution)?
- Who is responsible for the various bits of infrastructure on which the service depends?
- Technology (server, scripts, physical server space, etc)
- Human resources (server maintenance, user support)
- What will the situation be in 1, or 2, or 5, or 10, years time?
- What happens if you (or the person responsible for the service or part of it) leave or take on a different role?
- What happens at the end of the current round of funding?
- Will additional funding be needed/be available to continue the service?
- Would it be better to look to move the service to another institutional home?
How can I make it easier for users to use the resources?
Let’s examine some of these options in a little more detail.
Distribution via email or on disk
A simple option, especially where small resources are concerned, is to simply send the resource to whoever requests it either as an email attachment (suitable for very small resources only) or on a CD or DVD.
This is only suitable for low-demand, small resources. You still have to consider legal issues and what provision there is for making the resource available also if you are not available personally to respond to requests. For distribution on disk there is also a cost – for the media and postage. What is more, the end user is left to their own devices when it comes to getting their resources connected to the relevant analysis tools. It can be tricky to work out which ones to use – will you be prepared to offer advice and technical support to users? Some will ask for it.
If you make your resource available online, you can opt to either make it available for download only (with some of the same problems identified above), or you may offer an online service where people can access and use via their web browser (for example a corpus with a search interface) . Now a new set of questions arise:
- Who maintains the website?
- Can the site handle the volumes of traffic, and the amount of processing required?
- How will you know how many users have visited the site and downloaded your resource, or performed other operations? Do you need to report this to funders or other stakeholders?
- Who will maintain the server and ensure that the service is available?
- Will you offer a service level description, setting down exactly what you offer and under what terms?
- Can you monitor the availability of the online services (i.e. tell if everything is up and working properly)?
- Do you need to restrict access to certain classes of user? If so, how will you do this?
- Do you need to recognize users so that they can come back to datasets and workflows that they have started to assemble on previous visits?
- How will you deal with user support or queries (technical or about the resource/service)?Will it be available even if you leave the institution, or change your ISP?
- Is the URL stable, or is it likely to change when the university re-designs its website (or the website host goes into administration)?
- What happens when the technology behind the service needs updating/renewing (for example to work on different operating systems or in different browsers)?
- Are you prepared to offer any guarantees of availability and persistence of service to users who might require stable datasets and tools for their research, or who may want to be able to come back and reproduce results at a later date?
- How will users cite your datasets and services in their reports and publications?
Depositing your resource in an archive or repository
A lot of the issues arising from running your own web service can be avoided if you deposit your resources in a repository, which will deal with distribution, as well as perhaps offering long-term preservation, help with generating and sharing metadata, and connection with other tools and resources. So, you may also opt to deposit the resource in a repository. In deciding whether to do this, and whether a repository is appropriate, you may wish to consider:
- Is there a cost associated? If so, is it a once-off, annual, etc? How will you pay ongoing fees after the end of the project?
- What do you have to do to deposit (for example format of resource and metadata)?
- How stable and reliable is the repository? How long is their funding likely to be continued?
- Who knows about the repository? Is it known to potential users of your resource? Does it share metadata with relevant aggregators, and announce new deposits in appropriate forums?
- Who has the right to use it? Is access restricted to members of particular institutions, associations, countries, etc.? Are there technical barriers which might exclude some sets of users?
There are several archives and repositories available. The Oxford Text Archive offers a service to deposit resources for a small administrative fee. This has the advantage of being a specialist archive for literary and linguistic resources, offering metadata to aggregators in this domain, and part of the emerging research infrastructure being developed by CLARIN. Other services exist for more specialist resource types, such as SCOTS at the University of Glasgow for Scottish and historical resources, CHILDES for language acquisition studies, ICAME for English language resources, and the Endangered Languages Archive at the School of Oriental and African Studies, University of London. Each is well embedded in their research communities, and so deposit with such an archive is an excellent way to reach particular sets of users.
There is also a lot of ongoing work in developing institutional repositories in Universities in the UK. While some of these are focussed exclusively on e-prints, some offer repository services for research data as well. Creators of resources should check on the facilities and services available in their institution (often based in the library or information services department), and deposit with your institutional repository it may be a viable option. This may be useful for raising your profile locally and as a secure storage solution. It is however highly unlikely to satisfy all of your needs. An institutional repository which aims to cater for research output of all types and for all disciplines cannot have specialist curation expertise in all areas, and will not, for example, know about all of the relevant metadata standards, best practice in digital preservation of language resources, or connection to relevant discipline-specific resource discovery services. Repositories will typically offer non-exclusive deposit agreements, which means that when you deposit your resources, you do not give up any of your rights. There is normally no barrier to you depositing your resource in numerous archives. This is effective for preservation purposes, although you may need to consider the impact that it might have in terms of version control (will the resource be updated, and how to you check that the latest version is available in all places?), and monitoring usage.
Furthermore, it is increasingly likely that federations of archives, with the possibilities of cross searching resources, and connecting disparate collections and tools. Beyond this, sophisticated virtual research environments will emerge allowing more operations, as well as collaborations between groups of scholars, and connections to publications and other outputs. It is likely to be the specialist repositories which are connected to this new infrastructure, and it is likely to become increasingly difficult for the individual scholar to connect up their resources without the assistance of the repository and infrastructure specialists.
Whichever of the options you choose, you can help to ensure that users can work with your resource as effectively as possible by considering offering the following facilities:
- A full description of the resource, carefully crafted user guidelines, FAQ, instructions (preferably with screenshots);
- Support for answering user queries;
- A forum for users where they can discuss issues that come up. Make sure that you, or someone with good knowledge of using the resource, is available to respond to queries, in particular if the forum is new or under-used;
- Make it easy for users to give appropriate accreditation to resource creators and access services, thereby also further promoting your resource and announcing its availability;
- Make it clear what the title of the resource is, who the creator is and where it is found (at a persistent URL);
- Make any licence restrictions clear (especially if your licence stipulates that the resource creator/owner should be attributed by any user);
- Include on your website a sample citation/bibliography entry that users can use for reference;
- If you are offering an online service, test the interface during development, and try to find some resources for ongoing development in response to user feedback.
In summary, you need to take as wide a view as possible about who the potential users are, how they will find the resources, how they might want to use them, and then to think about how the arrangements will continue in the future. Good luck!