University media servicesare traditionally skilled as content creators and trained in audio and video filming and editing, yet are increasingly being tasked to provide media for multiple online open delivery platforms which entail complex workflows, new production costs and increased IT and cataloguing skills. Digital audio and video resources ideally need to be made available with time-coded transcripts (in part, for accessibility requirements), rich subject metadata (as there isn’t free text indexing available), and available in multiple open formats that suit online delivery platforms (e.g. mobile, iOS).
A typical HE production workflow for a central media service might generate around 500 to 1,000 audio-video items per year with potentially a good proportion of these available openly as OER with Creative Commons licences for public outreach. This equates to around 750 hours of material per annum to process. The cost of production of transcripts can be expensive, around £120 per hour of material for human transcription services, with additional costs for organising the material, checking timecode alignment, exporting to text formats and creating subject related keywords. Keywords need to be generated and, ideally, exposed on the same web page as the media for increased indexing via search engines and the subsequent gains in content discoverability. The cost per annum of extended cataloguing and full transcripts can therefore be prohibitively expensive for general use.
Systems and tools are needed to facilitate the semi-automatic processing of audio or video filesinto time-coded transcripts with a set of relevant subject keywords. All automatic transcription services (ATS) using LVCSR software have a large degree of error, estimated at three incorrect words per 10, but ATS do offer potential for much richer subject specific metadata for cataloguing purposes. If new OER tools and a production workflow could be developed to expose this targeted vocabulary of keywords online in the HTML (via LRMI RDFa and Schema.org microformats) and RSS feeds it would make the OER material more accessible to users through improved discoverability in search engines and greatly improve cataloguing locally and in OER repositories.
Sergio Grau, Programmer for Spindle writes …
As part of the SPINDLE project we are going to try to let users know what is inside a podcast. Our goal is to generate automatically additional information to complement the existing metadata (title, series, speakers, unit, short description and list of keywords).
As part of this additional information we are going to use Speech-to-Text software to generate automatically the transcription of an audio podcast. This automatic transcription is relevant to index and describe the existing audio data. It could also be useful as a starting point for human-generated transcriptions.
How do we generate automatically the transcription of an audio podcast?
Audio podcasts are going to be converted into text automatically using state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) software.
Which software packages are we going to use?
We are going to evaluate two different software packages, an Open-Source package, CMU Sphinx and a commercial video-editing software package, Adobe Premiere Pro.
How do we evaluate the performance of each software package?
We are aware the automatic transcription process is not going to be 100% accurate (not even humans can transcribe 100% accurately). In order to know how good our automatic transcription is we are going to do an experiment using a test corpus.
The test corpus is going to be composed of 20 audio podcasts from the University of Oxford podcast directory. We are selecting podcasts that have already been transcribed by humans. We will select podcasts from a variety of accents, topics and recording conditions to evaluate the robustness of the LVCSR software.
We will then compare the automatic-generated transcriptions obtained using both CMU Sphinx and Adobe Premiere Pro with the human-generated transcription to evaluate the overall performance of both packages.
Here is the waveform for the podcast on Globalisation that was mentioned inn the first SPINDLE post. Note that it is impossible to recognise anything other than perhaps gaps between words.