What is Spindle?
SPINDLE has been a project funded by JISC as part of their “Rapid Innovation in Open Educational Resources” programme. The project experimented with speech-to-text technologies to automatically create transcripts of Open Educational Resources (OER), and develop new tools to generate better keywords to help with the indexing and description of OER.
How do I transcribe automatically from speech to text?
We investigated three options for automatic transcription of podcasts:
Adobe Premiere Pro is excellent for video editing, but not for transcribing thousands of podcasts automatically. If you require the automatic transcription of one or more audio/video podcasts, then the Speech Analysis tool of Adobe Premiere Pro can be helpful, but cannot be used for batch processing of transcriptions of audio or video. On the other hand, CMU Sphinx allowed us to run the batch transcriptions of thousands of podcasts efficiently.
How accurate is automatic Speech to Text?
It depends. The key factor in our experience is the quality of the recording – a professional recording using a good tie-clip microphone gives the best results. A microphone far away from the speaker in a noisy room with echoes gives the worst results. It also depends of course on the clarity and accent of the speaker. In the very best situation we have had results where 6 out of every 10 words are automatically transcribed. In this case the gist of the lecture is obvious. This can drop to much lower results of say 3 out 10 words with poor quality recordings. In this case the results are probably too confused to read as normal English and are too poor to generate a good range of keywords. It’s important to realise that all automatic transcripts will need significant editing and checking, particularly to insert correct punctuation in order to make them readable for human users.
How do I generate keywords automatically from transcriptions?
We used two methods:
Antconc is a desktop application (which works on Windows, Mac and Linux), and generating keywords involves starting the programme, loading the text and the reference word-list, and manually running the function to generate the keywords. The user has the opportunity to adjust various parameters, and change the reference corpus, so we found this useful when we were investigating the best ways to generate relevant keywords. But the nature of this interactive application meant that it couldn’t be deployed in an automated workflow to generate keywords from multiple podcasts.
So, instead, we wrote a script to generate the keywords, which could be inserted into our automated workflow, and could be invoked and run programmatically without human intervention.
How does the algorithm for keyword filtering work ?
We compared the words in the automatic transcription with the speech transcribed in a large corpus of English called the British National Corpus. Words that are repeated much more often than in normal speech are likely to be keywords.
Where can I download the code generated during the SPINDLE project?
The code is available from https://github.com/ox-it/spindle-code/.
How can I align audio and transcription automatically?
We used the Penn Phonetics Lab Forced Aligner (P2FA), an application which has emerged from academic research in phonetics. The staff at the Phonetics Laboratory at the University of Oxford had identified this in an earlier JISC-funded research project as the state of the art for the automatic alignment of everyday contemporary English speech, and had gained expertise in using it. P2FA is free to download and use, and doesn’t have any licence conditions attached to it.
P2FA is a python script which interfaces with the Hidden Markov Model Toolkit (HTK) aligner, and with a set of good quality acoustic models. It is necessary to install HTK , and use it according to the HTK End User Licence Agreement, which is not restrictive in terms of how the software is used. HTK is usually available from http://htk.eng.cam.ac.uk, but not accessible 12-09-2012.
What formats did you use for caption work ?
We used WebVTT – this is a simple to understand HTML5 web format for presenting groups of words over a video.