SPINDLE Frequently Asked Questions

What is Spindle?

SPINDLE has been a project funded by JISC as part of their “Rapid Innovation in Open Educational Resources” programme. The project experimented with speech-to-text technologies to automatically create transcripts of Open Educational Resources (OER), and develop new tools to generate better keywords to help with the indexing and description of OER.

How do I transcribe automatically from speech to text?

We investigated three options for automatic transcription of podcasts:

Adobe Premiere Pro is excellent for video editing, but not for transcribing  thousands of podcasts automatically. If you require the automatic transcription of one or more audio/video podcasts, then the Speech Analysis tool of Adobe Premiere Pro can be helpful, but cannot be used for batch processing of transcriptions of audio or video. On the other hand, CMU Sphinx allowed us to run the batch transcriptions of thousands of podcasts efficiently.

How accurate is automatic Speech to Text?

It depends. The key factor in our experience is the quality of the recording – a professional recording using a good tie-clip microphone gives the best results. A microphone far away from the speaker in a noisy room with echoes gives the worst results. It also depends of course on the clarity and accent of the speaker. In the very best situation we have had results where 6 out of every 10 words are automatically transcribed. In this case the gist of the lecture is obvious. This can drop to much lower results of say 3 out 10 words with poor quality recordings. In this case the results are probably too confused to read as normal English and are too poor to generate a good range of keywords.  It’s important to realise that all automatic transcripts will need significant editing and checking, particularly to insert correct punctuation in order to make them readable for human users.

How do I generate keywords automatically from transcriptions?

We used two methods:

Antconc is a desktop application (which works on Windows, Mac and Linux), and generating keywords involves starting the programme, loading the text and the reference word-list, and manually running the function to generate the keywords. The user has the opportunity to adjust various parameters, and change the reference corpus, so we found this useful when we were investigating the best ways to generate relevant keywords. But the nature of this interactive application meant that it couldn’t be deployed in an automated workflow to generate keywords from multiple podcasts.

So, instead, we wrote a script to generate the keywords, which could be inserted into our automated workflow, and could be invoked and run programmatically without human intervention.

How does the algorithm for keyword filtering work ?

We compared the words in the automatic transcription with the speech transcribed in a large corpus of English called the British National Corpus. Words that are repeated much more often than in normal speech are likely to be keywords.

Where can I download the code generated during the SPINDLE project?

The code is available from https://github.com/ox-it/spindle-code/.

How can I align audio and transcription automatically?

We used the Penn Phonetics Lab Forced Aligner (P2FA), an application which has emerged from academic research in phonetics. The staff at the Phonetics Laboratory at the University of Oxford had identified this in an earlier JISC-funded research project as the state of the art for the automatic alignment of everyday contemporary English speech, and had gained expertise in using it. P2FA is free to download and use, and doesn’t have any licence conditions attached to it.

P2FA is a python script which interfaces with the Hidden Markov Model Toolkit (HTK) aligner, and with a set of good quality acoustic models. It is necessary  to install HTK , and use it according to the HTK End User Licence Agreement, which is not restrictive in terms of how the software is used. HTK is usually available from http://htk.eng.cam.ac.uk, but not accessible 12-09-2012.

What formats did you use for caption work ?

We used WebVTT – this is a simple to understand HTML5 web format for presenting groups of words over a video.

Posted in oerri, Spindle, ukoer | Tagged , , , | Leave a comment

SPINDLE Project Outputs

  • SPINDLE set up and documented a workflow to generate the automatic transcription of future open access audio and video podcasts using an online platform concentrating on generating automatic keyword extraction for better cataloguing.
  • SPINDLE tested and documented this workflow by:
    • developing a method to generate keywords and relevant word pairs automatically
    • generating in a batch process automatic speech-to-text keywords and timecoded transcripts from a database of over 3,400 podcasts
    • documenting the problems of accuracy in automatic transcriptions by testing and reporting the results of using two commonly used speech to text tools and services against baseline hand-transcribed transcripts
    • investigating the use of Automatic Speech-to-Phoneme alignments for our existing manual transcriptions that did not already include time-code information
  • SPINDLE also successfully designed and documented a filtering program for automatically extracting  keywords and relevant word pairs from uncorrected time-coded transcripts by selecting non-common words.
  • SPINDLE extended the functionality of the keyword extraction tool by creating an online web application to manage the transcription of online media podcasts. The main functionality of this online platform is:
  • Caption editor:
    • to edit time-coded transcripts whilst reviewing against the original online media file
    • to allow registered users to transcribe in parallel, with support for crowd-sourcing corrections
    • import into the Caption editor time-coded transcriptions in XMP, srt or CMU Sphinx formats
    • to edit transcriptions to provide corrections, punctuation, caption length chunking, speaker labels, etc.
  • Batch converter:
    • Create automatic transcriptions from an online media file using a CMU Sphinx installation
    • Create batches of media for automatic transcription
    • Create a list of automatic keywords with relevance statistics
  • Export Tool
    • Support for media metadata and Open Educational Resource (OER) licences
    • Support for exporting time-coded transcriptions in multiple formats:
      • human readable:  plain text and HTML
      • HTML5 compatible captions: online media caption format (webVTT)
      • XML format suitable for archiving and preservation
  • Data feed in RSS format to facilitate online visibility

All SPINDLE code is available from the open repository https://github.com/ox-it/spindle-code/

Posted in oerri, Spindle, ukoer | Tagged , , , | Leave a comment

SPINDLE project: Lessons Learnt

The SPINDLE project is wrapping up and will end in September 2012. Please find below some of the lessons learnt during the project.

  • We can obtain good keywords even if the automatic transcription has got lots of errors.
  • You do not need perfect automatic transcription to implement word search for your Open Educational Resources.
  • The importance of timecoded transcriptions to create captions, chapters or marks for your Open Educational Resources. Automatic Speech-to-Text alignment can help you if you already have a manual transcription.

  • Adobe Premiere Pro is excellent for video editing, but not for automatically transcribing thousands of podcasts. If you need the automatic transcription of one or more audio or video podcasts, then the Speech Analysis tool of Adobe Premiere Pro can be helpful,  but not for batch processing. In contrast, CMU Sphinx allowed us to run the batch transcriptions of thousands of podcasts efficiently.
  • The Pareto principle (or 80/20 rule) applies to the automatic keyword generation from automatic transcriptions. We will need to dedicate 80% extra time to generate automatic keywords accurately for 20% of our podcasts (difficult recording conditions, long distance microphones, multiple speakers, specialised vocabulary, multiple accents, etc). We were able to generate accurately keywords for a majority of our podcasts without having to deal with those issues. The podcasts that are difficult to transcribe automatically could be transcribed manually in the future or wait for further funding.
  • The use of a High Throughput Computing cluster (Condor) was extremely beneficial for the project. We could submit all the transcription jobs to the cluster and get the results in a timely manner. Usually there were up to 60 transcription jobs running in parallel in the cluster.
  • The combination of skills of the project members was an important factor to the success of this short project. We had a diversity of skills in our team, from open educational resources to natural language processing, automatic speech recognition and web development.
  • The variety of representation of timecoded transcripts was also a subject of discussion during the project. Finally, we decided to have a TEI/XML representation of the automatic/manual transcription including the time information and the automatic keywords. On the other hand, a transcription can be exported into a variety of formats (text, HTML, srt, webVTT, XML) in the developed online caption editor platform.
Posted in oerri, podcasting, Spindle, ukoer | Tagged , , , | Leave a comment

SPINDLE Speech to text caption engine for media producers

The SPINDLE project is proud to announce we have now extended the functionality of the keyword extraction tool by creating a much bigger set of tools based around a captioning editor – the idea is that the tools will hide some of the complexity of moving media between various research tools and allow cataloguers to work together on the correction of online media podcasts. We’re calling this prototype toolset the SPINDLE caption engine but it does much more than captions.

Many thanks to our  programmers Sergio Grau and Jonathan Oddie for many weeks of solid coding in this new area over the summer. A huge learning curve for everyone.

The main functionality of this online toolkit is threefold:

  • Batch Speech to Text Conversion
  • Caption Editor
  • Export Tool

Batch Conversion

The first area the toolkit covers is allowing a manager to send batch sets of files to a server instance of the CMU SPHINX speech to text application. The conversion of audio to text within SPHINX is relatively slow, ~1-2 x real time, and so it’s best to batch encode files overnight. This is supported by the SPINDLE system reading in a list of files and then allowing the manager to select the files that need to be prioritised to convert to time-coded text.

Caption Editor

As we’ve discussed in detail through-out the project there will always be the need to provide a correction tool when using Automatic Speech to Text. The project has therefore developed a simple editor to correct mistakes in the text and importantly add punctuation. The editor supports the original time information. The editor allows captioning to made from scratch via a movie or audio URL or from the output file of the SPHINX batch process. The toolkit also supports simple manual importing of the two types of mark-up text files the project has been testing:

  • CMU SPHINX format
  • Adobe Premiere XMP format

The caption editor has a list of chunked text entry slots that are synced to a time period in the video. The idea is that you use the play/pause buttons under the recording, perhaps using the HTML 5 button to play the recording 75% slower, and start correcting the automatic speech to text. It’s taken quite a while to find a good compromise in the UI to allow a cataloguer to break the text up into chunks suitable for overlay captions yet still be able to navigate quickly up and down the file. We still find the correction process a slow and time-consuming process so there is the ability to stop and come back later, with an indicator showing the % of file transcribed.

Features include:

  1. Allowing user corrections and punctuation
  2. Generates time-coded captions in HTML5 supported WebVTT format
  3. Generates a keyword list from transcript text
  4. Exports to caption format (WebVTT), plain text or HTML transcript and archive XML

Automatic keywords

The Caption toolset supports the core work done on automatic keyword extraction. Obviously this gets better as you tidy up the mistakes in the captions but as we’ve seen over the length of the project the algorithm developed seems to still have great benefits on uncorrected text.

Technical Details

Server:

  • Python (Django) Server
  • Twitter bootstrap UI framework
  • Celery (celeryproject.org) for queuing
  • Also requires a message queue server (RabbitMQ)
  • Installation of CMU Sphinx (optional)

Transcript Editor:

  • AJAX-based
  • jQuery and backbone.js
  • REST API through django-tastypie
Posted in dissemination, oerri, Spindle, ukoer | Tagged | Leave a comment

Searching the Great Writers Library

‘Libraries are the wardrobes of literature, whence men, properly informed may bring forth something for ornament, much for curiosity, and more for use’ William Dyer

Are you looking for a particular ebook or a lecture on a literature topic? Try the Great Writers Inspire Resources Library. Here, you can search for and access thousands of literary open education resources such as audio and video podcasts, essays, ebooks, images and more. Explore the virtual shelves by searching by keyword in the title, author, and importantly by media type.

The new portal has hundreds of free Oxford literature lectures, over 1,500 ebooks and many thematic essays.

If you don’t know where to begin, you can browse our writer collections and themed collections for inspiration.

The Great Writers library:
http://writersinspire.org/library

The Great Writers Inspire collections:
http://writersinspire.org

Posted in Content, dissemination, eBook, Great Writers, ukoer | Leave a comment

Project Spindle Update: Condor Cluster

Stop the presses! As part of the SPINDLE project we are running the whole University of Oxford podcast database through an automatic speech recognizer using the Phonetics Laboratory Condor cluster.

Condor is a workload management system for computing intensive jobs such as automatic speech recognition. We can submit jobs to the cluster and the cluster will process them whenever there is a node available. So far, up to 50 speech recognition jobs are running in parallel. We select one of these automatic transcriptions straight from the cluster. We extract the relevant keywords (as explained in this previous post) from the automatic transcription of the podcast Understanding Financial Control from the Building a Business series:

More results coming soon, stay tuned.

Posted in oerri, Spindle, ukoer | Tagged , , | Leave a comment

Spindle Project in a snapshot

Please find below a figure representing the main goal of the SPINDLE project using as an example one of our most succesful podcasts, The nature of human beings and the question of their ultimate origin.

An additional figure can be found below describing the project workflow to obtain those keywords automatically. Note that the automatic transcription using Automatic Speech Recognition is 56.29% accurate (out of 100 words around 56 are correct) and that we obtain automatically the most relevant keywords using the Log-likelihood measure as explained in this blog post. The word cloud (using Wordle) of the resulting 100 most significant keywords can be found in the following figure.

If you want to have more information about the SPINDLE project please access our recent blog posts.

Posted in oerri, Spindle, ukoer | Tagged , , | Leave a comment

PDF, XML, TextGrid, XMP, TXT and then…

As part of the SPINDLE project we are producing a set of automatic transcriptions and automatic keywords for the university podcasts to improve the OER discoverability. In this post we are going to analyse the variety of different formats that are already in use to represent these transcriptions.

At the moment, we already have a small set of human transcriptions stored as .pdf and .xml files.

Please find below a snapshot of the pdf file for the podcast Globalisation and the effect on economies:

Please find below a snapshot of the .xml file for the same podcast:

If we perform the automatic speech-to-text alignment we obtain a Praat Textgrid containing the time information of each individual word, that can be accessed as a regular text file as below,

or visualised using Praat:

We are also creating  automatic speech-to-text transcriptions using Large Vocabulary Continuous Speech Recognition Software.

If we use Adobe Premiere Pro we will obtain an XMP file. Please find below a snapshot of the XMP file (note the low word accuracy for the first sentence, total word accuracy of the automatic transcription 66.08%):

If we use Sphinx-4 we obtain a text output that can be post-processed into any other format (very low Word Accuracy in the first sentence as well, total Word Accuracy 46.56%,).

So far we have the following file formats XML, PDF, TextGrid, XMP and TXT and we would like to obtain a unified representation of our transcription including the time information. We are thinking of using TEI/XML, similar to the approach we suggested for linking transcriptions to the British National Corpus audio. This TEI/XML representation could be then exported to a variety of format such as TTML, SRT or WebVTT. Should we use this TEI/XML representation or should we use a video caption standard such as TTML or WebTT directly? Pros and cons? We will report back with an answer soon. Meanwhile, any thoughts or suggestions are welcome. Stay tuned!

Posted in oerri, Spindle, ukoer | Tagged , , | Leave a comment

Great Writers Going for Gold

When UK album sales reach 100,000 they are awarded a gold disc by the BPI. Using that as a benchmark, the Great Writers Inspire series of short talks have reached total downloads of 118,579 through iTunesU. We knew these inspirational talks donated by Oxford academics for the benefit of all for free were gold standard, but it’s nice to see that thousands of people across the world agree.

This series of talks were promoted by iTunesU from 1 May (see Peter Robinson’s post) and we have benefited from some pretty impressive downloads ever since. When one of our podcast series is promoted on the iTunesU site it does wonderful things to our download stats. And of course once they get high downloads because of the promotion, they appear in the top download charts and the ‘what’s hot’ section, and so it continues. Our Great Writers Inspire series of talks have experienced this success and have regularly appeared in the global top 10 in recent weeks. The figure of 118,579 is for audio and video downloads from mid April until the second week in July.

Download figures are influenced by all sorts of things: recordings which appear at the top of the list always receive higher downloads, as will talks with familiar titles. Stars of the series include Chaucer by Professor Daniel Wakelin with over 10,000 downloads and Shakespeare and the Stage by Professor Tiffany Stern with 9500+. Jane Austen’s Manuscripts Explored by Professor Kathryn Sutherland has achieved over 8000 downloads in just 5 weeks! High download statistics are not the only indicator of quality, and talks with more obscure titles naturally appear lower in our charts. Even so, talks on lesser known writers have achieved still impressive downloads of 5000 and more.

If you haven’t already done so, please visit the talks on iTunesU or via the University of Oxford’s podcast portal – a global audience of this size can’t be wrong!

Posted in Content, dissemination, Great Writers, iTunes U, Oxford, podcasting, ukoer | Leave a comment

Schools engagement – feedback from students and teachers

Two of our Student Ambassadors recently visited a local school to show them the Great Writers Inspire website. Cleo Hanaway has reported back on the feedback received from both the students (Year 12) and their teachers in a couple of posts on the Great Writers Inspire blog. These posts are repeated here.

On Friday 22nd June I returned to my old school – Cheney, Oxford. Accompanied by fellow student ambassador Kate O’Connor, I introduced A-level English students (year 12, going on 13) to http://writersinspire.org/, discussed my ‘great writer’, and received some really useful feedback on the website.

Cheney School Logo

From a personal point of view, it was great to see how the school had progressed since I left (ten years ago!). When I was there, we just had a couple of computers in the library – now there are around 30! The librarian and teachers are very keen to use online learning resources where possible; they were very interested to find out what http://writersinspire.org/ has to offer.

The students’ feedback focused on seven main areas: usability; layout; writers; themes; ebooks; podcasts; essays. Below, I have transcribed a list of direct quotations from the students.

Before you read the list of students’ quotes, I’d like to point out that we’ve already acted upon some of their suggestions. For example, as requested by Cheney, both Oscar Wilde and Thomas Hardy are now in our writers list. Unfortunately, it doesn’t look like we’re going to be able to move forward with George Orwell – he’s still in copyright. We’ve also improved our search functionality; if you search for ‘pastoral’, for example, you now get two pages of search results showing all of the items which include the word ‘pastoral’.

USABILITY:

Good points:

  • ‘It was easy to navigate around – it was really well laid out’
  • ‘It was good that it was a collection where you could easily find stuff’

Areas for improvement:

  • ‘My only issue about the website is the search function. It’s hard to get specific themes like “Pastoral”, for example – we’re doing that at the moment’
  • ‘It would be good to have, like, a blog for users – or a discussion forum’

LAYOUT:

Good points:

  • ‘It looks very nice’
  • ‘It’s a good layout’
  • ‘I thought the layout was really nice’
  • ‘I though the layout of the website was very easy to navigate and pleasant’

Areas for improvement:

  • ‘If it was more kind-of-like “jumps-out-at-you” then it would be more like a cool a website. Maybe more colourful and things popping out at you – I don’t know. It’s quite dull when you look at it.’

WRITERS:

Good points:

  • ‘My favourite part was the selection of writers that are available’
  • ‘We were researching Charlotte Bronte and we liked the fact that there was stuff about her personal life, not just her work; you can get a more rounded view’
  • ‘It was great learning about Aphra Behn – we’d never heard of her before and she’s really interesting’

Areas for improvement:

  • ‘You should have Orwell and Wilde’
  • ‘Thomas Hardy, George Orwell, and Oscar Wilde weren’t on there – we learn about them in school and it would be useful to have them on here’
  •  ‘It would be good to have more obscure authors that we wouldn’t learn about in school’
  •  ‘It would be helpful to have a small amount on more writers, rather than just not having them at all’
  • ‘It would be good to have links to similar writers – it would help with comparative coursework’
  • ‘In the writers section it would be good to differentiate between poets, novelists, and dramatists, for students who don’t know anything about them yet’

THEMES:

Good points:

  • ‘I really like the authors and themes – being able to go through it in different ways’
  • ‘The section on Victorian Gothic is really good – we’re doing that at school at the moment’
  • ‘I thought the themes things were interesting. I was recently researching Victorian Gothic and it took quite a while. It was useful having it all here’

Areas for improvement:

  • ‘It would be good to split up Shakespearean tragedy and comedy’
  • ‘In the themes section it would be useful to a sample of a political work’
  • ‘A couple more themes would be good’

eBOOKS:

Good points:

  • ‘The library is really useful – it’s so wide-ranging’
  • ‘I thought the fact that you could read ebooks that are now out of copyright was really helpful – like Ulysses

Areas for improvement:

  • ‘The massive PDFs took a long time to download’

PODCASTS:

Good points:

  • ‘It was really good having video recordings and sound recordings’
  • ‘I really liked the lectures and stuff’ ‘
  • ‘I liked the lectures – I think you should really like big this up as this is what’s special about this website’

Areas for improvement:

  • ‘It would be good to have a comments section for the videos, so you could say, like “there’s a really good bit about 2 minutes in”’

ESSAYS:

Good points:

  • ‘The bibliographies at the end of each essay are useful for further research’
  • ‘The biographies of the writers give a real insight into the writers’ lives – it’s good’
  • ‘The biography section is really interesting. A lot times I look up writers lives I’m not sure if it’s true –websites write different things to each other. By looking it up on here I’m more sure that it’s trustworthy’
  • ‘I thought the short essays on authors and themes were really useful. They were concise and give you good background, so you don’t have to trawl through – like – really useless stuff on the web’

Areas for improvement:

  • ‘It would be useful to have a list of other books (with synopsises) at the end of the author essays’

Below are some brief comments from two of Cheney’s English teachers: Pat Tope, Leader of Key Stage 5, and Gary Snapper, a department teacher with a research interest in the transition from A-Level to university.

The quotes below have been transcribed from an audio recording.

Pat Tope:

‘It was useful to give us a forum for people to discuss literature. Some of the students have definitely got into considering writers that they wouldn’t have considered before. For example, two girls were interested in Aphra Behn and they’d never heard of her before. They were interested in the fact that she was such an early female author. I think the website prompts that sort of thing; students can investigate aspects of literature that they wouldn’t have thought of before. I think that the problem with the site is that it is very random in terms of the people that you’ve got there – it’s difficult to direct students there and say, well, “whatever you want you’ll find it here”. It’s a little bit hit and miss as to whether you’d find it or not. So that would be the issue. But no, I liked the way that both of you interacted with the students; it was really good and really appropriate. It was great.’

Gary Snapper:

‘I think the session was great. I really think that having people come in from outside, from a university, is a very very positive thing because it, simply by your presence and by seeing people who have gone to the next stage and the stage after that, brings literature alive. It brings literature alive in a way we can’t do because we’re fixtures. I think that connection is always very important and I always look forward to that. It just makes them think a little bit differently and a bit more widely. But, beyond that, the site itself was very useful in doing that. I particularly like the lectures on the site; I think they’re very very useful. We’ve found that the recent proliferation of good lectures on the internet is great – although there aren’t many about. It is good to have another source of them. In fact, I have used the Oxford University site that has lectures on it already. As a way of finding out what’s there – although, as Pat says, it’s still a bit random – it’s really useful. It will be useful for specific texts both for us and for them. For instance, when we come to do As You Like It, and do a bit more of the Gothic and the Pastoral, I think it will come into its own. But obviously it would be great if it were more consciously geared to what A-Level students are actually doing. And in terms of the timing, although it was good in that it coincided with students thinking about their comparative coursework, it would have also been good timing at a stage later in the middle of year 13 or a few weeks into year 13 when they were beginning to get into the texts themselves and exploring ideas in more depth. I think that the session was really useful and just about the right length, well-timed – although, perhaps a little less time on browsing the website might have been better. Again, if it had been later in the course and they had been looking at specific texts they could have spent a bit more time – things are still a bit general at the moment.’

You can view some of Gary Snapper’s research articles here.

Posted in dissemination, Great Writers, ukoer | Leave a comment