Automatic Keyword Generation from Automatic Speech-to-Text Transcriptions

In this post we are going to show the keywords generated automatically from the automatic transcriptions obtained using the Speech Analysis Tool from Adobe Premiere Pro.

First of all, we use Natural Language Processing techniques to normalise the automatic transcription.

Next, we use a reference corpus to which we can compare our automatic transcriptions. We chose the spoken part of the British National Corpus (BNC) as our reference corpus. We may use other corpora in the future such as the British Academic Spoken English corpus or our own collection of podcasts transcriptions.

Finally, the Automatic Keyword Generation system (developed as a Python script) compares the frequency of words in the automatic transcription to the frequency in the reference corpus to identify usually frequent and infrequent words using Log-likelihood.

We have plotted word clouds using Wordle. These word clouds show the 100 most significant words sorted by the Log-likelihood measure (the larger the Log-likelihood the bigger the word in the word cloud).

We should note that we are generating keywords from automatic transcriptions and not from human transcriptions. Therefore, we are going to indicate the word accuracy (WA) of each automatic transcription next to the title (the higher the word accuracy the better) obtained as explained in the previous post.

Copenhagen COP 15: What happened and What next? (WA = 17.14%)

Global Recession: How Did it Happen? (WA = 36.37%)

The nature of human beings and the question of their ultimate origin (WA = 56.29%)

Finally, we need to know how accurate these automatic generated keywords are. Our next task is then to compare these generated keywords from automatic transcriptions to keywords generated from their respective human transcriptions. We will report results in following weeks.

Posted in oerri, Spindle, ukoer | Leave a comment

Leave a Reply