Can we use Automatic Speech to Text Technology to generate better cataloguing of our open content ?
A new project at OUCS starts in March, funded by JISC as part of their ‘Rapid Innovation in Open Educational Resources‘ programme. SPINDLE brings together the Open Spires / ITunesU media team at IT Services with experts in linguistics and speech technology from the Phonetics Laboratory and the Faculty of Linguistics, Phonetics and Phonology to work on the University’s growing collection of video and audio podcasts of lectures. The project will experiment with speech-to-text technologies to automatically create transcripts of lectures, and develop new tools to generate better keywords to help with the indexing and description of the lectures.
The success of automatic speech-to-text processing depends on various factors, including the quality of the recording; the accent, dialect, and speech style of the speaker; and the topics of the discourse.
Unfortunately the Oxford University podcasts exhibit a lot of variability in all of these aspects! The project team will report on these factors, and investigate and discuss approaches that can be shared with similar academic organisations with large collections of content suitable for digitization and indexing. The project is funded by JISC in order to develop these techniques and technologies for the benefit of the wider academic community, and an important part of the project will be to report, in the form of publicly available blogs, on the various technical options, barriers, problems, and work-arounds.
The project aims to
- Make demonstrators in HTML5 of synchronised media playback utilising time-coded transcripts and keywords
- Automatic speech-to-word alignment of a series of 300 existing manually transcribed lectures to generate a time-coded transcript for baseline comparison and evaluation tests. Report on the success of two LVCSR software toolkits of batch speech to text transcription:
- Using an open-source speech recognition toolkit – CMU Sphinx
- Using a traditional media services desktop editing system – Adobe Premiere
- Generate and store lists of time-coded keywords by filtering with:
- Scripts to remove high-frequency function words
- Scripts for rapid lexicon development
- Generate code and algorithms to help parse any text file for non-common spoken words
- Transfer all outputs into the two main captioning formats – TTML and WebVTT
- Store the associated keywords in the institutional media OXITEMS database
- Expose the material in RSS, and an open source Drupal CMS delivery platform:
- Expose Keywords in the HTML (using RDFa / LRMI and Schema.org microformats)
- Expose Tags that allow searches by keyword
- Report on the SEO and discoverability implications of improved keyword capturing, including potential guidance for interested parties.
|WP1||Write project plan. Set up project blog tag. Plan dissemination activities including presentation at Beyond Text and video overview. Schedule start-up meeting with academic partner. Survey the OER collections for a suitable series of early tests. Video screencast of objectives released.Deliverable: project plan, project blog, dissemination plan||M1|
|WP2||Make demonstrators in HTML5 of synchronised audio playback utilising test media, time-coded test transcripts in TTML and WebVTT. Repeat with video. Test HTML5 demonstrator with keyword index linking to time code points within the media and to example related OER services.Deliverables: A series of HTML5 demonstrations||M1-3|
|WP3||Automatically Align a series of 300 existing manually transcribed lectures with the associated media to generate a time-coded transcript seriesDeliverables: Time-coded transcripts for baseline comparison tests with automatic versions||M1-2|
|WP4||Testing two methods of speech to text conversion and comparing results with hand transcribed baseline material.
New Caption tools developed to convert the output formats to WebVTT caption formats. Test with ~ 500 hrs.
|WP5||Generate Keyword parser code and store lists of time-coded keywords by filtering with:
Deliverable: Parser scripts and workflows, experimental simplified web application to link the services
|WP6||Expose the material in a open source CMS delivery platform – Drupal:
Deliverable: Report on the discoverability implications of improved keyword capturing. Final outputs and code deposited in IT Services github, summary blog reports. Videos of project activities and training material released.
Factors that affect automatic speech to text success
The success of automatic speech-to-text processing depends on various factors, including the quality of the recording; the accent, dialect, and speech style of the speaker; and the topics of the discourse. Unfortunately the Oxford University podcasts exhibit a lot of variability in all of these aspects! The project team will test the speech to text tools on a variety of lectures and report on the key factors that affect the success of the automatic transcriptions compared to a human transcriber. It’s hoped that we can show that under ideal recording conditions the machine transcriptions will offer efficiencies to a media service. The recording factors will affect the quality of the keywords and further captioning work, and SPINDLE will investigate and discuss technical approaches that may improve cataloguing and that can be shared with similar academic organisations with large collections of content suitable for digitization and indexing. The project is funded by JISC in order to develop these mulit-disciplinary techniques and technologies for the benefit of the wider academic community, and an important part of the project will be to report, in the form of publicly available blogs, on the various technical options, barriers, problems, and work-arounds.
Why does text versions and better keywording of audio and video lectures matter ?
The team will also investigate further potential uses of the transcripts, such as providing full text transcripts for accessibility, and full-text search of the lectures, and will build demonstrators of synchronised media playback utilizing time-coded transcripts in the emerging HTML5 standard. As well as enhancing the usability and discoverability of the OERs, the transcripts, time-aligned with the digital audio and video, will also represent a useful research resource for those interested in studying language, for example in linguistics and English-language teaching, and as much-needed training data for further developments of speech recognition and alignment.
The project will allow much more accessible material to produced by media teams working with Open Educational material. One last teaser from the Great Writers project :
Open Spires – Open Education projects - http://openspires.oucs.ox.ac.uk/
Open Spires Podcasts released for resue with creative commons licence – http://podcasts.ox.ac.uk/open/
SPINDLE project blog posts – http://blogs.it.ox.ac.uk/openspires/spindle
JISC Technical Strand – Rapid Innovation in Open Educational Resources