This blog has been archived and will not be updated further. For more IT Services blogs, see the front page.
The Sudamih Project has run its course and the final report is now available for your perusal. Although we’ve been rather quiet on the blog front in recent months this reflects the amount of work we’ve been doing on the project rather than the lack of it. Since the start of the year the project has written two full three-hour face-to-face courses and taught them to a varied class of humanities scholars, we’ve added a whole suite of data management guidance and tips to Oxford University’s Research Skills Toolkit, and progress on the Database as a Service has moved on in leaps and bounds. We’ve also had a go at enumerating the many benefits of the work we’ve done (which is relatively straightforward), calculating the ongoing costs of the training (again, not too difficult), and putting together a business case considering the future returns on investment (much trickier). All this is detailed in the final report, along with various lessons learnt and conclusions, both for JISC and for anyone thinking of establishing similar infrastructure at other universities. You can find the project outputs, including de-Oxfordized versions of the training materials designed to be re-used at other institutions, at the Sudamih Project Outputs webpage.
Some key findings include:
- The intellectual value of humanities datasets tends not to depreciate over time.
- There is a need in the humanities for very-long-term data sustainability solutions and cost models designed to deal with effectively permanent storage and access.
- Most researchers are willing in principle to share their data with others, but in practice do not regularly do so, for a variety of reasons. In the humanities, issues surrounding the incompleteness of the original data, or the layer of interpretation often required to render it consistent, can lead to reluctance to share, as researchers worry that their ‘processed’ data may be misinterpreted by others.
- Researchers need help to discover the most appropriate software tools to deal with specific research challenges.
- Researchers should be trained in organizational principles and strategies to enable them to better manage their information and sources.
- Researchers do not understand the terminology used by data librarians. Care must be taken to avoid technical jargon and use unambiguous but straightforward terminology when talking about data management.
- There is a significant amount of confusion over the ownership of research data. This is exacerbated by complex situations in which multiple people or organizations may have different claims on the same resource.
- Different academic departments and institutional service providers should work together to understand who should be responsible for implementing, and sustaining, various aspects of data management training.
- Data management training can have a large positive impact in terms of long-term cost savings relative to the near-term costs of running and maintaining courses and learning materials.
Although Sudamih is now at an end, our efforts to develop a research data management infrastructure at Oxford are still very much ongoing. Out next task is to take the pilot Database as a Service and turn it into a full production service, with polished intuitive interfaces, secure storage, a user manual, and accompanying training materials. We are also transforming the DaaS into a service that can be provided via the cloud, maximising its cost efficiency. The new project is called VIDaaS (Virtual Infrastructure with Database as a Service), and you can find out more about it here.
I have just got back from an enjoyable, if bitterly cold, few days in Chicago attending the 6th International Digital Curation Conference. It was at the same conference last December that I had my first real taste of the digital curation / data management ‘scene’, so it was fascinating to see how things had changed over the year. The theme of the conference this year was ‘growing the curation community’, a subject of great relevance to the EIDCSR and Sudamih projects, especially with regards to their training and researcher involvement components.
Highlights of the conference included: Chris Lintott’s introduction to Galaxy Zoo and the ‘Zooniverse’ platform for citizen science projects; Kevin Ashley’s state-of-the-discipline keynote entitled ‘Curation Centres, Curation Services, How Many is enough?’ (more than three, apparently); and the presentations by Wendy White of Southampton and Robin Rice at Edinburgh detailing the progress of their institutions’ research data infrastructure developments, which were of obvious relevance to our own ambitions at Oxford.
Hosting the conference in Chicago led to a much larger American delegation than had been able to attend last year’s event in Edinburgh, and some of the differences in approach to digital curation between the US and UK quickly became apparent, particularly from the training perspective. Whereas in the UK the emphasis over the last year has been increasingly on involving researchers in the data curation process, with a particular focus on their data management practices, in the US, the emphasis is far more on extending the skills of the library community and providing career pathways for data librarians. Two questions arise from this: firstly, why the difference? Secondly, which approach is likely to reap the most benefits in terms of maximising the value of research data?
The explanation for the different approaches seemed to lie mostly with the funders. Asked why their appeared to be little development of digital curation courses taking place in the UK, Sheila Corrall of the University of Sheffield pointed out that the current squeeze of library budgets this side of the Atlantic left little space for curriculum development in this area. The fact that UK Masters are one-year affairs, compared with the prevalence of two-year masters courses in the US additionally left less time for flexibility and optional course components. In contrasts, JISC’s funding remit and coordinating role in the UK is directed more towards research. Another aspect that may be influencing national attendance at the conference is differing identifications with the term ‘digital curation’. Whereas the DCC and JISC encourage a broad interpretation of digital curation, including data management considerations at the pre-library ‘ingest’ stage, perhaps it suggests a narrower set of responsibilities elsewhere?
The question of which approach to training most benefits research data is more vexed. EIDCSR and Sudamih take the approach that researchers need to be directly involved with data management activities from the planning stage of a research project, albeit with support from other agencies. In addition, we take the approach that it is the researchers that create data who are best positioned to document that data (and particularly the processes involved in the creation of that data). It became apparent during the course of the conference however that whilst the emphasis in the US may at present be more upon training librarians to manage digital curation rather than researchers, and consequently more attention is paid to issues of longer-term preservation and curation, the librarians being trained in this way are perfectly aware of the need to support researchers and become involved at the creation stage. Ultimately, one of the strengths of the conference was to bring together the international community to exchange and understand these differences in approach and emphasis, and broaden conceptions regarding who could, and should, take responsibility for managing data at various stages of the data lifecycle, along with what needs to be included amongst these responsibilities.
As part of my work for the Sudamih Project, I recently spent some time surveying the literature on data management. This was initially for internal purposes – to fill in our background knowledge and inform the resources we’re developing – but after attending a workshop where we got the chance to meet people from other projects in the JISC Managing Research Data Programme, it became apparent that others might also be interested in our findings. Data sharing is something the Sudamih Project is keen to promote, so in the spirit of practising what we preach, we’re making the fruits of our research publicly available.
While a bibliography isn’t the sort of dataset that requires expensive equipment or trips to archives to produce, that doesn’t mean it’s a free resource: even when all the information can be found online, the process of collecting, sifting, and compiling it carries a significant cost in staff time. Sharing means that duplication of effort can be reduced, thus meaning that resources (both financial and human) are being used more efficiently.
Two PDFs are available for download from our Project Outputs page. The first is a bibliography with brief abstracts, covering policy issues, data sharing, digital curation and preservation, repositories, metadata, and personal information management. The second is a more detailed review of a subset of the literature dealing with personal information management.
The bibliography can also be accessed via a Zotero group. For those unfamiliar with it, Zotero is a free reference management add-on for Firefox: while you can view the bibliography online, if you register (which is quick and easy) and join the group, you’ll be able to download a copy of the group library, which you can then use to add citations to your own documents. Group members are also able to contribute items, and we’d be delighted if others working in this area want to share their own references in this way.
As part of the Sudamih Project’s requirements gathering exercise, we asked a group of researchers to complete a questionnaire based on the Data Audit Framework.
The Data Audit Framework (which is in the process of being renamed as the Data Asset Framework) was developed by HATII at the University of Glasgow in association with the Digital Curation Centre. It’s intended as a tool to help higher education institutions take an inventory of their research data assets, with a view to ensuring effective preservation and accessibility. Within the confines of the Sudamih Project, a complete data audit was impractical, so instead, our questionnaire was based on just the third stage of the DAF methodology, designed to provide detailed information about individual data assets.
Although we were working with a small sample, we nevertheless got some interesting results. The questionnaire answers highlighted two features common to many humanities datasets. First, they are frequently almost infinitely expandable, and secondly, the data is rarely of a type which goes out of date.
This means that there is huge potential for the reuse of humanities datasets, both by the researchers who created them and (if the creators are willing to make the data public) by others: a database compiled for one project may often form a useful starting point for another in a similar area. This in turn emphasizes the need for stable long-term curation infrastructure for humanities data.
A second area of interest was the cost of producing datasets. A number of our respondents noted that the chief expense in their project had been their own time, but were generally wary of putting a specific price tag on this. The issue is complicated by the fact that a humanities research project may produce both data assets and a book or thesis, and it is often hard to say how the total costs of the project should be apportioned. However, the answers gave a general impression that some humanities scholars may be inclined to undervalue their own time, and hence perhaps the data resources they are producing – despite the potential for long-term usefulness of humanities data assets noted above.
These findings will feed into the next stages of the Sudamih Project, as we begin to think in more detail about the provision of a database service and training for researchers.
A full report on the use of the DAF within the Sudamih Project is available from the project outputs page of the Sudamih website.
On Thursday 22nd July, 2010, the Sudamih project staged a workshop on ‘Data Management Training for the Humanities’ at the Oxford e-Research Centre. The event was well-attended, with approximately forty delegates and speakers. Although it was just a morning workshop, we managed to squeeze quite a lot into the programme, with four speakers talking about the projects they were involved in followed by three ‘national perspectives’, and a panel session at the end intended to get some thoughts from members of the audience and start a debate.
The Sudamih project itself opened the morning, detailing the findings of the recent Researcher Requirements Report and drawing out some key messages. We were followed by Catharine Ward from the INCREMENTAL project, which is looking at practices and infrastructure in Cambridge and Glasgow. Although INCREMENTAL is looking at data management across a wider range of academic disciplines than Sudamih, its conclusions relating to existing practices and training requirements were reassuringly similar to our own: researchers were inconsistent in their information management practices, meaning that they misplaced things; they were happy in theory to share data, but in practice often found it problematic; and they were often not aware of existing central infrastructure, so services were not being fully exploited. The importance of language was also highlighted by both projects – the terminology of data management is not familiar to researchers and can be off-putting. You need to communicate clearly to researchers if you are to encourage and improve their skills.
The workshop then heard from Professor Eric Meyer of the Humanities Information Practices Project, based at the Oxford Internet Institute. This has been looking at the way researchers in the humanities use information, especially when collaborating with one another. Although the project is still in its early stages, its findings are already providing useful insights, such as that it is easier to persuade researchers to do new things if these can be introduced via analogies with practices with which researchers are already familiar.
There then followed a presentation by Gareth Knight, about the PeKin project at King’s College London, which is developing tools and advice for managing both electronic business records as well as research data. This presentation emphasised the need to assess the value of data, as this is key to determining what should be curated and what discarded. The need for training in data management was again identified as crucial, as was support from senior management. The PeKin Project is planning a mixed approach to training, producing various materials including content type reports – covering formats for image and audio files for example.
Joy Davidson of the DCC emphasised how institutions should try to fit data management training into existing practices and to use what they already had as much as possible. She stressed the importance of targeting where time and money could be invested to maximise returns, considering the career stage at which training was aimed, and the existence of support services.
Ross English explained the role of Vitae in coordinating the training of HE researchers. He spoke of the resources that Vitae provide for trainers, such as the database of practice, and the now pressing need to demonstrate that training is having a real impact upon researcher practices.
Finally, Stéphane Goldstein of the RIN talked about how data management is becoming a bigger issue in researcher training these days. Whereas previously, the RIN had tended to focus its efforts on research information finding and gathering, its new shortly-to-be-published Researcher Development Framework (RDF) would begin to treat data management in earnest as part of the research process. He stressed the need for more examples of good practice from the data management field, as few had yet been identified.
After the presenters had finished there was a short panel discussion, which brought up issues surrounding the role of supervisors in instilling good practice (at the moment it’s a question of luck, with most supervisors doing little to promote skills training), and the difficulty of engaging senior decision makers in universities and filtering policy down to the researchers who are meant to follow it.
The workshop was well-received and we hope to organise another on a related topic before the end of the year. The various workshop presentations may be downloaded from the workshop webpage, on the Sudamih website.
Today sees the official release of the Sudamih Researcher Requirements Report. We have compiled the Report on the basis of interviews conducted with researchers across the humanities disciplines at Oxford. It summarises current data management practices amongst humanities researchers and assesses demand for training and for the development of a ‘database as a service’ – two of the key anticipated outputs of the Sudamih Project. Although the participants in the interviews are all Oxford-based, there is little to suggest that the way humanities scholars approach data management here is very different from at other UK Universities, so we hope that this report will be of broad interest to anyone involved in research service provision for the humanities or data management.
Scholars in the humanities employ a huge range of sources and approaches in their research, making it dangerous to generalise too freely about humanities research data. Nevertheless, one can tentatively identify distinctions between data compiled for humanities research and that generated in the course of scientific investigation. Firstly, humanities data tends to be gathered from existing sources rather than created from scratch, with the possible exception of some linguistic data gathered under ‘laboratory’ conditions. The diverse nature of the sources that humanities researchers gather their information from often results in data which is inconsistent, incomplete, or which relies to a degree on conscious selection and interpretation. All of these factors must be understood before the data can be properly analysed. However, whilst data in the humanities may be not be as straightforwardly ‘reliable’ as much scientific data, this is not to say that it has less academic value. On the contrary, the intellectual value of humanities research data often has exceptional longevity, tending not to depreciate over time. A database of Roman cities is potentially of as much use to researchers in fifty years time as it is today, provided it is not rendered obsolete through technological change. Humanities scholarship often aggregates to a ‘life’s work’ body of research, with any given researcher often wishing to go back to old datasets in order to find new information.
The challenge faced by Sudamih and other JISC-funded research data infrastructure projects is to build the systems by which researchers can preserve their data so that they are not obsolescent in fifty years time and can still be used both by the researcher that created them and potentially by others. This requires documentation, reliable long-term storage, potential migration to more modern data formats, and various other curation activities. It also places responsibilities on the researcher himself, to organise and structure their information so that it is clear and usable, and to consider the future of their data at the stage of its conception. When done well, good data management should bring obvious benefits to the researcher who created the data, as well as potentially extending that usefulness of the data to others. Good data management maximises the value of the data.
The Sudamih Project will be staging a workshop on the 22nd July to find out about how different institutions and supporting organisations are approaching data management training for researchers in the humanities.
One of the major objectives of the Sudamih Project is to develop and trial training modules that can be used to improve researchers’ data management skills. Our recent requirements-gathering exercise found that whilst ‘data management’ is not a phrase that gets humanities researchers particularly excited, it can induce a sense of anxiety. Most researchers find that they sometimes misplace or lose track of information, or organise it a way that does not necessarily aid re-discovery or re-use further into their academic careers. Data management is often a low priority activity, which can place the data at risk of loss, or simply at risk of obscurity as it sits quietly in a corner of a hard drive, unknown by scholars, unused beyond its initial function, and gradually becoming obsolete as technology moves on. Researchers realise this, but many have little idea of ‘best practice’, and only worry about their data when problems arise. The need for training is increasingly being recognized by those involved in research support activities, as well as by researchers themselves.
Despite recognizing the importance of sound data management, most UK institutions are still only at the early stages in terms of developing training programmes to address the situation. There are plenty of courses on databases, bibliographic software, and disciplinary research skills, but few that really seek to improve research information and data management skills more broadly. We hope that the workshop on the 22nd July will bring interested parties together so that we can all benefit from finding out about the current state of affairs and what people are proposing to improve matters.
We shall, of course, be relating the findings of our own interviews with humanities researchers at Oxford, and attempting to draw out recommendations from these, which we shall follow up over the next few months. Besides Sudamih, delegates will hear from representatives of the Digital Curation Centre, the Research Information Network, Vitae (the national researcher training body), and from projects at Oxford, Cambridge, and King’s College London. We also hope to have a lively panel session where the audience can get the chance to ask questions and relate their own experiences.
The workshop is free to attend, and includes lunch. Further details, along with registration instructions, can be found at the workshop webpage: http://sudamih.oucs.ox.ac.uk/training_workshop.xml.
With the completion of thirty-one interviews, a significant phase of the Sudamih Project requirements gathering workpackage draws to a close. It’s been both a busy and an interesting few weeks, dashing about Oxford to talk to researchers from across the spectrum of humanities disciplines about how they use and organize data.
Balancing the competing demands of research, teaching, and dreary-but-essential admin, academics are immensely busy people, and we were very grateful so many of them found time in their packed schedules to meet with us. Once we had found a free slot in their diary, however, most of our interviewees needed little prompting to talk about their work, and we heard about a huge range of fascinating projects – more than once it required a fair amount of self-discipline to move on to our questions about database requirements and training rather than just letting them keep talking about their research topic.
The interviews brought me to a new appreciation of the huge variety of types of research within the humanities. This isn’t limited to diversity in subject matter (though there certainly is that), but also includes a wide range of very different research methods. My own research experience has been largely limited to the ‘read something, think about it, write something’ model, but I quickly came to realize that those of us who work this way are a minority. We talked to linguists who record and analyse speech, English literature scholars transcribing manuscripts, classicists working with inscriptions, an Orientalist working on an undeciphered writing system, musicologists… and that list barely scratches the surface. I was frequently amazed by how much people crammed into their time: one researcher reeled off a list of five major research projects which sounded sufficient to fill the working week several times over, and then casually added ‘And I also write books and articles.’
A useful side effect of asking researchers about their data management practices was being prompted to re-evaluate my own ways of working. As the interviews progressed, I found I was being more proactive: rather than, for example, keeping all my files in one folder and only reorganizing them when this became unwieldy, I thought ahead and created a finer-grained system. Discussing versioning and backing up prompted me to be more conscientious about this myself – something which became acutely relevant when my home computer recently suffered a hard drive failure.
For most of the interviews, we adopted a belt-and-braces approach, recording them using a digital device, and also taking written notes. One interviewee commented that she was surprised to see us making notes by hand, when it would surely be more efficient to type straight into a laptop. I was in turn surprised: not by the suggestion itself, but by the fact that it hadn’t occurred to me earlier that there was indeed something slightly incongruous in an OUCS representative using a method of data collection which was distinctly old school. (This happened mostly for practical reasons: I don’t have a suitably portable laptop, and the duration of the project wasn’t sufficient to justify the expense of acquiring one.)
With the interviews over and written up, we’re now in the process of drawing our findings together into a report. I may be biased, but I think it makes pretty interesting reading so far: watch this space for further details.
I am very pleased to announce that the Digital Curation Centre will be paying a visit to Oxford on the 16th June to present a workshop on managing research data. The workshop is aimed primarily at researchers interested in bidding for funding for projects with a data output, although it should also appeal to those who assist and support research activities and who would like to find out more about the challenges of data curation.
Although the workshop will obviously be of relevance to those interested in either the Sudamih or EIDCSR projects, it will not focus exclusively on a particular academic discipline but should be useful across the board. Sessions will include: the roles and responsibilities associated with conceptualising, creating and managing research data during the life of a project; the responsibilities associated with the longer-term management of research data after a project has ended; developing a data management plan; and preparing data for long-term curation and re-use.
The workshop is free for members of the University of Oxford, £50 for non-members.
Anyone interested in attending the workshop should register at http://www.dcc.ac.uk/training/digital-curation-101/digital-curation-101-lite-oxford