SPINDLE Automatic Keyword Generation: Step by Step

In this post we are going to show the automatic generation of keywords from the automatic transcription of a podcast. First of all, please find below a figure showing the main workflow of the SPINDLE project.

From our podcasts, we obtain an automatic transcription by using CMU Sphinx or the Speech Analysis Tool from Adobe Premiere Pro. Alternatively, a podcast could be transcribed by our media team or by using an external transcription service.

Once we have a transcription, how can we obtain the most relevant words? Using the Log-likelihood method. This method compares the frequency of a word in the transcription with the frequency of the same word in a large corpus. For example, the word “banks” occurs 17 times in the automatic transcription of this podcast, Global Recession: How Did it Happen?  and 201 in a large corpus. Why the word “banks” is relevant?

Collecting word frequencies from a large corpus

First of all we need a reference corpus to which we can compare our automatic transcriptions. This corpus should be large enough to contain most words and general enough to be representative of the language. We chose for our experiments the spoken part of the British National Corpus (BNC) as our reference corpus.

The characteristics of the spoken part of the BNC corpus can be found below:

  • 589,347 sentences
  • 11,606,059 words

So, now we know we have more than 11 million words in our reference corpus. So, taking into account that the word “banks” occurs 201 times out of 11.6 million words and 17 times out of 5439 times in our transcription,  how do we calculate the relevance of the word “banks”?

Step 1

  1. Use Natural Language Processing techniques to normalise the corpus (remove punctuation and stopwords)
  2. Calculate for each word in the British National Corpus how many times does that word occur in the corpus (a)
  3. Calculate the total number of words in the corpus (c)

The final file is composed of 56,029 words and the number of occurrences of each word. An extract of that file can be found below:

  • banks: 201
  • crisis: 195
  • companies: 758
  • ….

Generating relevant keywords and bigrams

Step 2

  1. Use Natural Language Processing techniques to normalise the transcription (remove punctuation if necessary and stopwords)
  2. Calculate for each word in the transcription how many times does that word occur in the transcription (b)
  3. Calculate the total number of words in the transcription (d)

Step 3

  1. Calculate the Log-likelihood, G2, of each individual word
  2. Sort the words by Log-likelihood value (the higher the better)

Step 4

  1. Calculate frequent bigrams counting the number of occurrences

Example of Automatic Keywords Generation

We used the keyword generation tool to generate the relevant keywords and bigrams of the automatic transcription of the podcast Global Recession: How Did it Happen? (Correct Words = 32.9%). We selected a bad automatic transcription to show that even with a low number of correct words we can extract some relevant keywords and bigrams automatically.

Keywords Generated (word: Log-likelihood)

banks : 141.12175627
crisis : 73.3976004078
companies : 67.8498685789
assets : 61.8910800051
haiti : 47.7956942776
interest : 41.3390170289
credit : 39.6149918395
crunch : 35.9334074944
senate : 32.4501608202
profited : 30.625124757
sitcom : 30.625124757
ansa : 30.625124757
nineteen : 29.0864140753
economy : 28.6440250819
nineties : 27.5138518651
haitian : 26.8069860979
sanctioning : 26.8069860979
center : 26.8069860979
regulate : 25.4923775621
hashing : 25.0818400138
haitians : 25.0818400138
stimulus : 24.5089608603
united : 24.1102094531
successful : 21.8091735308
financial : 21.7481087661
key : 21.6791751296
caught : 21.1648006228
eases : 21.0970376283
bankruptcy : 21.0970376283
rates : 21.0105869453
kind : 20.8040324729
cited : 20.6246470912
backs : 19.9877139071
borrowing : 19.9877139071
crimes : 19.5817617075
countries : 19.5490491082
essentially : 19.334521352
fiscal : 19.1532240523

Collocations Generated (collocation: #occurences)

[interest rates] : 5
[financial crisis] : 4
[wall street] : 3
[nineteen nineties] : 3
[credit crunch] : 3
[british government] : 3

Word Cloud (using Wordle)


We should note that we are generating keywords from automatic transcriptions and not from human transcriptions. Therefore, we obtain along relevant keywords and bigrams some keywords and bigrams that are not that relevant or directly, out of topic. However, through the SPINDLE project we have generated automatically thousands of relevant keywords and bigrams for our collection of podcasts that are going to increase in the near future the discoverability and accessibility of our podcast collection.

Posted in oerri, Spindle, ukoer | Tagged , , | 5 Comments

5 Responses to “SPINDLE Automatic Keyword Generation: Step by Step”

  1. Hi Sergio,

    I’m just going through your SPINDLE blog posts now and I must say I like the simplicity of your workflow described here which is reiterated in your project outcomes post also.

    I’m interested in how to scale this work by bringing these creative commons podcasts and the open-source automatic transcription tool into the FLAX project for further English language learning and teaching resources development and collections building. We both presented at the Beyond Books event in June so you may remember me from the TOETOE project? In addition to what you outline here as a positive outcome with generating keywords for improved discoverability and accessibility of the resources, these automatically generated keywords make for good vocabulary pre-teaching/learning resources also. I’m interested to know what your workflow for eliminating erroneous keywords that were automatically generated by the software was, so this work can be scaled also.

    You’ve used a Wordle which is automatically recognisable to students and teachers to identify frequency across the reference BNC and sub OpenSpires podcast corpora and this is a useful visual aid pedagogically-speaking. I think it would also be useful to explore how these keyword lists from the automatic transcription software can be linked to existing tools and collections in FLAX, including the BNC which is OUCS-managed and further open corpora and tools so that students and teachers can see how these keywords are used across different corpora, both spoken and written, to show the contextualisation of key words and phrases in use.

    This might be a good time to start working with the BASE corpus (also managed by the Oxford Text Archive) for linking these generated keyword lists to for the development of further language learning and teaching OER in FLAX. I’m thinking off the cuff here at the minute but will be with the FLAX team in NZ from early November until the end of the year to have a play with more corpus resources and ways for opening them up for teaching and learning. For TOETOE I will also be doing a lot of work with the evaluation of existing open corpus-based resources by engaging different stakeholders – English teaching practitioners working in different contexts e.g. classroom-based, online, open education etc. – in workshops and interviews around their perceptions of the resources currently in FLAX and other open projects and also by trying to gauge the types of resources that they want. I think it would be useful to refer to SPINDLE and its findings for the potential development of language learning and teaching OER and how these findings can be of relevance to using openly licensed podcast content on iTunesU, the TED lectures etc. I mentioned your project in a twitter discussion yesterday among English for Academic Purposes practitioners on the topic of automatic lecture transcription – see the transcript here http://chirpstory.com/li/25724

    If you’re OK with me commenting on your project here once I’m back in NZ for how we can re-use your resources and findings in our open projects this might be a useful way for other resource developers and interested teaching practitioners to see the types of issues that arise.

    All the best,


  2. [...] another rapid innovation JISC-funded OER project at the Beyond Books conference at Oxford. The Spindle project, also based at OUCS, has been exploring linguistic uses for Oxford podcasts with work based [...]

  3. Sergio Grau Puerto says:

    Hi Alannah,

    I am back from holidays so sorry for the late reply. Thanks for your interest in our project. I do remember your presentation at the Beyond Books event.

    The good news about the SPINDLE project is that the keyword software is ready to be re-used for any other particular purpose. I like the idea to show the keywords in context to visually see how these words are used in the podcast transcription and/or in any other corpora.

    On the other hand, we have not yet started filtering the keywords but it is in our to do list and we will keep everybody informed through our blog.

    I will be happy to discuss more about the SPINDLE project or any other re-use of our resources or findings so feel free to comment here or in your blog about your findings.


  4. Hi again, Sergio and Peter, and sorry I didn’t see this response until recently (I didn’t get any pingback through my email – I guess this is because I can’t subscribe…anyway, will check daily whenever I send you a reply to your blog☺ Also, my comments will cut across a lot of your blog post topics – would you prefer I respond directly to those individual posts?) We’re also experiencing a slight delay at this end as we’re having to re-start the FLAX server to refresh the functionality of the language collections for some mysterious reason. Anyway, it’s given time over to planning for further resources development and refinement and this is where we’d like to start with what we’ve learned from Spindle for the use of Oxford podcasts in FLAX:

    For a demonstration collection in FLAX:

    1. Linking resources:
    We’d like to start small with higher quality sound podcast resources that can be linked to existing FLAX collections e.g. the learning collocations collection and text analysis tools like the noun/verb/adverb/adj phrases and wikify functions that were created for the BAWE corpus collections in FLAX. The FLAX project also has a focus on collections building in addition to offering powerful reference tools so that users can build their own collections and assign activities for the learning of key phrases e.g. automated cloze and scrambled sentences exercises for language learning of searched items in the collections – it would be useful to explore how we can build pathways to help teachers and learners do this with the podcast corpora in FLAX as well. In addition to linking podcast keywords to e.g. collocations from written corpora, e.g. the BAWE and Wikipedia corpora in FLAX, for a comparison of uses of keywords across different discourse types, we will also look into linking to further contextualised uses for the same keywords in the BASE and the spoken part of the BNC. Do you have a bunch of filtered podcast transcripts that we could start with? And, perhaps we should also start with some of the professionally transcribed podcasts just so we can get going with the linking of resources.

    2. Linking tools:
    We’re also really keen to have the display of audio time coded waveform data to accompany the playing of audio and audiovisual files. We think the automatic Speech-to-Text alignment process that you discuss here is more effective than having a sound-only file playing in the background of a static transcribed text file, which is often the case with a lot of language learning resources dealing with speaking and listening, because of the added visual and linguistic support with time coding that this offers. Peter, you showed me a demo of part of the Spindle Toolkit at Oxford ITS on your computer for the spindle caption editor prototype with waveform data for the crowd-sourcing of transcripts. Is there something we need to do to join or overide Nexus so we can access the github links to source code and resources posted on the Spindle blog for re-using the various toolkit items in FLAX? By the way, the video in the last blog post about the toolkit is very helpful also, thank you.

    3. Improving quality:
    The word accuracy data that you got from comparing manual and automatic transcriptions was very telling. Do you anticipate that these quality issues in recording and editing will have an impact on IT Services re. policy for future podcast recordings and post-production editing? No doubt poor sound and editing quality may increase costs in time with deciphering etc. in manually transcribed work also. I think this will be an issue for helping to scale this type of work with generating keywords for metadata (discoverability) as well as linguistic support (resource enhancement) for podcast OER use/re-use/re-purposing.

    Thanks for reading and looking forward to hearing from you.


  5. OK, we’ve got into the github for your toolkit. Will write back soon with ideas for what we can do with it.

    Sergio, we’ve just taken a look at the BASE corpus xml files but it’s not POS tagged so we won’t be able to effectively link it to the keywords in the podcasts for e.g. collocations and phrases as there would be a lot of errors if we tried to assign automatic POS tags due to the lack of sentence boundaries, the use of uncapitalised i etc. in the BASE transcripts. It’d be better if we built a collection for this in the same way as we have for the BAWE corpus where we show the complete transcripts but it probably wouldn’t be that interesting without the audiovisual files that accompany these so ww’ll need to get permission to use these as well.
    Do you know of a well POS tagged and free to use for educational purposes academic spoken corpus that we could use for linking keywords in the podcasts to? We will link to the fabulously POS-tagged BNC spoken section and then we can link to the written corpora we have for comparative purposes, but it would be nice to have some more academic spoken English resources. Any thoughts and suggestions would be welcome. I’ve taken a quick look at MICASE but it appears to be transcripts and audiofiles only like the BASE.

    Will be in touch again soon,


Leave a Reply