In this post we are going to show the automatic generation of keywords from the automatic transcription of a podcast. First of all, please find below a figure showing the main workflow of the SPINDLE project.

From our podcasts, we obtain an automatic transcription by using CMU Sphinx or the Speech Analysis Tool from Adobe Premiere Pro. Alternatively, a podcast could be transcribed by our media team or by using an external transcription service.
Once we have a transcription, how can we obtain the most relevant words? Using the Log-likelihood method. This method compares the frequency of a word in the transcription with the frequency of the same word in a large corpus. For example, the word “banks” occurs 17 times in the automatic transcription of this podcast, Global Recession: How Did it Happen? and 201 in a large corpus. Why the word “banks” is relevant?
Collecting word frequencies from a large corpus
First of all we need a reference corpus to which we can compare our automatic transcriptions. This corpus should be large enough to contain most words and general enough to be representative of the language. We chose for our experiments the spoken part of the British National Corpus (BNC) as our reference corpus.
The characteristics of the spoken part of the BNC corpus can be found below:
- 589,347 sentences
- 11,606,059 words
So, now we know we have more than 11 million words in our reference corpus. So, taking into account that the word “banks” occurs 201 times out of 11.6 million words and 17 times out of 5439 times in our transcription, how do we calculate the relevance of the word “banks”?
Step 1
- Use Natural Language Processing techniques to normalise the corpus (remove punctuation and stopwords)
- Calculate for each word in the British National Corpus how many times does that word occur in the corpus (a)
- Calculate the total number of words in the corpus (c)
The final file is composed of 56,029 words and the number of occurrences of each word. An extract of that file can be found below:
- banks: 201
- crisis: 195
- companies: 758
- ….
Generating relevant keywords and bigrams
Step 2
- Use Natural Language Processing techniques to normalise the transcription (remove punctuation if necessary and stopwords)
- Calculate for each word in the transcription how many times does that word occur in the transcription (b)
- Calculate the total number of words in the transcription (d)
Step 3
- Calculate the Log-likelihood, G2, of each individual word



- Sort the words by Log-likelihood value (the higher the better)
Step 4
- Calculate frequent bigrams counting the number of occurrences
Example of Automatic Keywords Generation
We used the keyword generation tool to generate the relevant keywords and bigrams of the automatic transcription of the podcast Global Recession: How Did it Happen? (Correct Words = 32.9%). We selected a bad automatic transcription to show that even with a low number of correct words we can extract some relevant keywords and bigrams automatically.
Keywords Generated (word: Log-likelihood)
banks : 141.12175627
crisis : 73.3976004078
companies : 67.8498685789
assets : 61.8910800051
haiti : 47.7956942776
interest : 41.3390170289
credit : 39.6149918395
crunch : 35.9334074944
senate : 32.4501608202
profited : 30.625124757
sitcom : 30.625124757
ansa : 30.625124757
nineteen : 29.0864140753
economy : 28.6440250819
nineties : 27.5138518651
haitian : 26.8069860979
sanctioning : 26.8069860979
center : 26.8069860979
regulate : 25.4923775621
hashing : 25.0818400138
haitians : 25.0818400138
stimulus : 24.5089608603
united : 24.1102094531
successful : 21.8091735308
financial : 21.7481087661
key : 21.6791751296
caught : 21.1648006228
eases : 21.0970376283
bankruptcy : 21.0970376283
rates : 21.0105869453
kind : 20.8040324729
cited : 20.6246470912
backs : 19.9877139071
borrowing : 19.9877139071
crimes : 19.5817617075
countries : 19.5490491082
essentially : 19.334521352
fiscal : 19.1532240523
Collocations Generated (collocation: #occurences)
[interest rates] : 5
[financial crisis] : 4
[wall street] : 3
[nineteen nineties] : 3
[credit crunch] : 3
[british government] : 3
Word Cloud (using Wordle)

Conclusion
We should note that we are generating keywords from automatic transcriptions and not from human transcriptions. Therefore, we obtain along relevant keywords and bigrams some keywords and bigrams that are not that relevant or directly, out of topic. However, through the SPINDLE project we have generated automatically thousands of relevant keywords and bigrams for our collection of podcasts that are going to increase in the near future the discoverability and accessibility of our podcast collection.