Automatic Speech-to-Text Transcription: Preliminary Results

As part of the project SPINDLE we are running a series of experiments to evaluate the use of  Large Vocabulary Continuous Speech Recognition Software for the automatic transcription of podcasts. We already know this automatic transcription is not going to be 100% accurate at transcription level but ‘good enough’ to enrich the existing metadata of the University podcasts with a set of keywords generated from this automatic transcription.

Today we will present some preliminary results of three different podcasts already available at the University of Oxford Podcasts website. We used the Speech Analysis tool from Adobe Premiere Pro CS5 to automatically transcribe these three podcasts. We selected the English UK language option and the High (slower) quality parameter.

The Table below shows the characteristics of the three podcasts (title, duration, number of words in the manual transcription and number of words in the automatic transcription). We report the automatic transcription results in terms of Word Accuracy using the Levenshtein distance between the manual transcript and the automatic transcript (the higher the better).

Title Duration #Words Transcription # Words Premiere
Word Accuracy
Copenhagen
COP 15: What happened and What next?
1:08:12 11194 8744 17.14%
Global
Recession: How Did it Happen?
36:21 6019 6044 36.37%
The
nature of human beings and the question of their ultimate origin
1:28:16 12916 13021 56.29%

Analysing the results we see that the range of accuracy goes from 17% to 56%. Why is accuracy so variable? Listening to the recordings and analysing the audio signals we see that the recording conditions of these three podcasts are really different from each other and that is what we consider the important factor in obtaining such different results.

The first podcast contains background noise and even a video conference speaker.

The second podcast has a really low signal.

The last podcast was professionally recorded and edited and therefore obtains the best results.

There may be other factors affecting the accuracy of the automatic transcription such as the podcast topic (language model), out of vocabulary words (dictionary) or accents (acoustic model).

In following weeks we will report how do we generate keywords automatically from these automatic transcripts. Stay tuned!

Posted in oerri, podcasting, Spindle, ukoer | Leave a comment

Leave a Reply