Automatic Speech-to-Text Alignment for Audio Indexing

A small percentange of our podcasts has been manually transcribed by a professional transcription company. In that case, we do not need to generate an automatic transcription using Large Vocabulary Continous Speech Recognition software. However we need to know when each of these individual words have been uttered in the podcast.

Our goal will be then to automatically obtain a time-aligned word index of a podcast using its word transcript. We will use this time alignment information to index our audio podcasts through the project HTML5 demonstrators along the keywords we will obtain automatically through the SPINDLE project.

To obtain this time-aligned index we will perform an automatic alignment of an audio podcast with its transcription at word level. This means we will obtain at what particular time in the podcast each word is uttered. We will perform the alignment using an Automatic-Speech-to-Text aligner (for example the HTK-based P2FA).

For example, the transcription of “Globalisation and the effect on economies” contains a word transcript and speaker information.

Please find below a snapshot of Praat showing the automatic alignment at word and phoneme level of “Globalisation and the effect on economies” obtained using P2FA.

In following weeks we will describe this Automatic Speech-to-Text alignment process and we will start reporting results from our test set. Stay tuned!

Posted in oerri, podcasting, Spindle, ukoer | Leave a comment

Leave a Reply