Automatic Keyword Generation: Human Transcriptions vs Automatic Transcriptions

As part of the SPINDLE project we are producing a set of automatic transcriptions for the university podcasts. We will then use these automatic transcriptions to produce automatically a set of keywords to increase the OER discoverability of our podcasts. For a small subset of the podcasts we already have human transcriptions. We will use these human transcriptions to compare with the automatic transcriptions both at transcription level and at keyword level.

Please find below a snapshot of the human transcription for the podcast Globalisation and the effect on economies:We could use this transcription to generate a list of keywords automatically. However, if there are no human transcriptions available (remember that they are expensive to produce) or in our case to compare with human transcriptions, we create an automatic transcription of a podcast using Large Vocabulary Continuous Speech Recognition Software.

Using Adobe Premiere Pro  we obtain an automatic transcription with a Word Accuracy (WA) rate of 66.08% (WA = 66.08%). Please find below the automatic transcription for the two first paragraphs using Premiere Pro.

Using Sphinx-4 we get an automatic transcription with a Word Accuracy rate of 46.65% (WA=46.56% that could be improved using acoustic adaptation or extending the language model, work in progress!). Please find below the automatic transcription for the two first paragraphs using Sphinx-4.

We would like then to compare the keywords generated by these transcriptions so we plotted the keywords obtained using the Log-likelihood measure of each individual word  (as explained in this post) using Wordle (the larger the Log-likelihood the bigger the word in the plot).

Human Transcription (WA = 100%)

Automatic Transcription: Adobe Premiere Pro (WA = 66.08%)

Automatic Transcription: CMU Sphinx (WA = 46.56%)

We can see that words like borders, political, transnational, decisions, transactions, finance, crisis and many others are relevant in the three word clouds. Globalisation unfortunately was not recognised by the Sphinx-4 LVCSR software. We can see that there is a way of obtaining automatically keywords using automatic tools that are similar to the keywords that would be obtained using a human-generated transcription. However, we would like to measure this similarity. We will report back about how to quantify this similarity in following posts.

In the next blog post we would like to discuss how we will store the information generated by the automatic transcription tool and the keyword generation tool. Should we use TEI/XML or should we use a captioning formats such as TTML or WebVTT? Stay tuned!

Posted in oerri, Spindle, ukoer | Tagged , , | Leave a comment

Automatic Keywords of the Podcast of the Week

Which one of our podcasts is the podcast of the week? It had to be the most recent podcast about the most discussed topic in the news this week.

As part of the Spindle project we are investigating the use of automatic keywords for OER discoverability. Please find below the automatic keywords generated from the automatic transcription of the podcast of the week. Click on the image to listen!

Posted in oerri, podcasting, Spindle, ukoer | Tagged , , | Leave a comment

Automatic Keyword Generation from Automatic Speech-to-Text Transcriptions

In this post we are going to show the keywords generated automatically from the automatic transcriptions obtained using the Speech Analysis Tool from Adobe Premiere Pro.

First of all, we use Natural Language Processing techniques to normalise the automatic transcription.

Next, we use a reference corpus to which we can compare our automatic transcriptions. We chose the spoken part of the British National Corpus (BNC) as our reference corpus. We may use other corpora in the future such as the British Academic Spoken English corpus or our own collection of podcasts transcriptions.

Finally, the Automatic Keyword Generation system (developed as a Python script) compares the frequency of words in the automatic transcription to the frequency in the reference corpus to identify usually frequent and infrequent words using Log-likelihood.

We have plotted word clouds using Wordle. These word clouds show the 100 most significant words sorted by the Log-likelihood measure (the larger the Log-likelihood the bigger the word in the word cloud).

We should note that we are generating keywords from automatic transcriptions and not from human transcriptions. Therefore, we are going to indicate the word accuracy (WA) of each automatic transcription next to the title (the higher the word accuracy the better) obtained as explained in the previous post.

Copenhagen COP 15: What happened and What next? (WA = 17.14%)

Global Recession: How Did it Happen? (WA = 36.37%)

The nature of human beings and the question of their ultimate origin (WA = 56.29%)

Finally, we need to know how accurate these automatic generated keywords are. Our next task is then to compare these generated keywords from automatic transcriptions to keywords generated from their respective human transcriptions. We will report results in following weeks.

Posted in oerri, Spindle, ukoer | Leave a comment


Congratulations to the Great Writers Inspire Student Ambassadors who were runners up in the Student Podcasting category at last night’s OxTALENT Awards, an annual event held to celebrate Oxford’s innovative use of technology in teaching and learning. This award recognises the amazing contribution of our seven post-graduate students from the English Faculty; Alex Pryce, Kate O’Connor, Cleo Hanaway, Charlotte Barrett, Erin Johnson, Colleen Curran and Dominic Davies. With varying subject specialisations they find existing  open content for the project, write contextual essays, blog and promote the project via  social media and other engagement activities. You can read their blog posts on the Great Writers Inspire blog.

Posted in dissemination, events, Great Writers, Oxford, podcasting, ukoer, Uncategorized | Leave a comment

Schools engagement

This Friday Cleo Hanaway, one of our Student Ambassadors on the Great Writers Inspire project, will be running a workshop with some sixth form students (year 12s) at a local school. This will be quite a test for and the first time that we will have shown the site to students from the pre-University sector.

Cleo has worked with the teacher to find out what the students will be studying in the next academic year so that she can tailor the session and demonstrate some materials relevant to their course. The students will be given the opportunity to search the site and offer feedback on their user experience. Then Cleo will set the students the task of writing 100-300 words on their great writer with the aim that the best contributions are published on the Great Writers Inspire blog.

This workshop will form part of our evaluation activities and will hopefully show how successful the site is at engaging school students in the subject area and giving them a flavour of university study. Cleo will post a report on the Great Writers blog with her findings.


Posted in Content, dissemination, events, Great Writers, Oxford, ukoer | Leave a comment

Jumping for Joyce

Next week one of the Great Writers Inspire Student Ambassadors, Cleo Hanaway, will be attending the XXIII International James Joyce Symposium in Dublin.  Cleo will be chairing a session on ‘Joyce, Impact and Public Engagement’ and will be sharing her experiences of working on the Great Writers project in her talk ‘pro bono publico: Picturing Ulysses and Blogging about Joyce’. It is a great time to be talking about Joyce as the works published in his lifetime have just come out of copyright – perfect timing for our project!

Cleo will also be taking part in a panel discussion on RTE Radio 1 on Wednesday 13 June at 7 pm, and will do her best to squeeze in a mention of Great Writers Inspire, so listen in if you get the chance.

Our Student Ambassadors are doing amazing work for the project – not only collecting and creating content to inspire our target audience, but providing inspiration to the project team on ways to engage, disseminate and improve the impact of the site. They are, to quote David Kernohan, ‘awesome’!

Posted in Content, dissemination, Great Writers, ukoer | Leave a comment

Automatic Speech-to-Text Transcription: Preliminary Results

As part of the project SPINDLE we are running a series of experiments to evaluate the use of  Large Vocabulary Continuous Speech Recognition Software for the automatic transcription of podcasts. We already know this automatic transcription is not going to be 100% accurate at transcription level but ‘good enough’ to enrich the existing metadata of the University podcasts with a set of keywords generated from this automatic transcription.

Today we will present some preliminary results of three different podcasts already available at the University of Oxford Podcasts website. We used the Speech Analysis tool from Adobe Premiere Pro CS5 to automatically transcribe these three podcasts. We selected the English UK language option and the High (slower) quality parameter.

The Table below shows the characteristics of the three podcasts (title, duration, number of words in the manual transcription and number of words in the automatic transcription). We report the automatic transcription results in terms of Word Accuracy using the Levenshtein distance between the manual transcript and the automatic transcript (the higher the better).

Title Duration #Words Transcription # Words Premiere
Word Accuracy
COP 15: What happened and What next?
1:08:12 11194 8744 17.14%
Recession: How Did it Happen?
36:21 6019 6044 36.37%
nature of human beings and the question of their ultimate origin
1:28:16 12916 13021 56.29%

Analysing the results we see that the range of accuracy goes from 17% to 56%. Why is accuracy so variable? Listening to the recordings and analysing the audio signals we see that the recording conditions of these three podcasts are really different from each other and that is what we consider the important factor in obtaining such different results.

The first podcast contains background noise and even a video conference speaker.

The second podcast has a really low signal.

The last podcast was professionally recorded and edited and therefore obtains the best results.

There may be other factors affecting the accuracy of the automatic transcription such as the podcast topic (language model), out of vocabulary words (dictionary) or accents (acoustic model).

In following weeks we will report how do we generate keywords automatically from these automatic transcripts. Stay tuned!

Posted in oerri, podcasting, Spindle, ukoer | Leave a comment

Automatic Speech-to-Text Alignment for Audio Indexing

A small percentange of our podcasts has been manually transcribed by a professional transcription company. In that case, we do not need to generate an automatic transcription using Large Vocabulary Continous Speech Recognition software. However we need to know when each of these individual words have been uttered in the podcast.

Our goal will be then to automatically obtain a time-aligned word index of a podcast using its word transcript. We will use this time alignment information to index our audio podcasts through the project HTML5 demonstrators along the keywords we will obtain automatically through the SPINDLE project.

To obtain this time-aligned index we will perform an automatic alignment of an audio podcast with its transcription at word level. This means we will obtain at what particular time in the podcast each word is uttered. We will perform the alignment using an Automatic-Speech-to-Text aligner (for example the HTK-based P2FA).

For example, the transcription of “Globalisation and the effect on economies” contains a word transcript and speaker information.

Please find below a snapshot of Praat showing the automatic alignment at word and phoneme level of “Globalisation and the effect on economies” obtained using P2FA.

In following weeks we will describe this Automatic Speech-to-Text alignment process and we will start reporting results from our test set. Stay tuned!

Posted in oerri, podcasting, Spindle, ukoer | Leave a comment

The Main Event : tracking reuse using Google Analytics

On our Great Writers Inspire site pages, say an introduction to modernist poet Ezra Pound, there are a lot of possibilities for re-use. We have worked hard to maximize (and also simplify) OER reuse for any one coming to the site. However, much as making things both possible and simple is desirable, the gold in these very hills is not always this act, but knowing that this act has taken place. It is great people would visit Great Writers and use any resource, but it is a shame (for the sake of funding, and encouraging more content) that we can’t tell this act has happened.

With Great Writers we’ve tried to make sure we can tell people have at least expressed as interest in reuse. Apologies for segwaying into legalspeak – but we can record that  some one clicks on one our reuse instructions, we can’t tell if they action it further. So let’s look through one of our pages and see what information we can store (all anonymously, and yes, we have a cookie policy).

So on this example on a talk on William Blake we can record

1) The audio being played

2) How much of the audio is played

3) A share on twitter or facebook, or via an email

4) Whether the file is downloaded

5) Whether the embed HTML code, or HTML 5 is copied

6) Whether the “cite” text is copied

7) Whether any of the text is copied at all

These examples are all done via adding a Google Analytics Event to the <a> html element for that link.

As an example


Is all of the code we need to add to track this event in Google Analytics.

Download is the “Category” of the event, and Audio is the “Action” of the event. We’ve a schema tying categories to actions, and have aimed to create a lexicon we can share across all our projects, including our other current OER project on World War One

As this data goes to Google Analytics we can track this and generate reports on it relatively simply, with Google doing all the heavy lifting.

Posted in Content, copyright, dissemination, Great Writers, technical, ukoer | Leave a comment

Types of Transcription

Previously on the SPINDLE post series, we discussed the automatic transcription of podcasts using Large Vocabulary Continuous Speech Recognition software. In that case, we had an audio podcast and we wanted to obtain an automatic transcription. Fortunately, for a subset of our podcasts we do have a human-generated transcripts. So, which additional information could be obtained automatically using an audio podcast and its manual transcription?

The answer next week. First of all we are going to have a look at different types of transcription.

Transcription types

Transcription is a linguistics term used to describe the representation of spoken language in a written form. For our project we distinguish two different types of transcriptions:

  1. Word  transcripts: human transcribers produce a full word-by-word transcript but not a segmentation of the audio. This word transcription can contain additional information such as speaker identification, speech changes, change of topic, untranscribed words, etc but no time information.
  2. Time-segmented transcripts: human transcribers segment the audio file into small audio clips (usually to reflect a change of speaker, a new sentence, etc) and transcribe each of these clips individually with the most appropriate sentence. Using this approach a human transcriber can segment the audio file into small audio clips and then transcribe each of these small audio clips. Segmenting the audio file reduces the complexity of the transcription of large audio files. At the same time, segmenting the audio file into smaller clips is a way of synchronizing the audio with the transcription at sentence or speaker level.

Transcription software

Human transcribers use specialised software to transcribe audio podcasts (for example, I usually use Transcriber or Praat).  This specialised software allows users to transcribe at phrase, word or phoneme level, add speaker identification, indicate a change of topic, segment the audio into clips and then listen to these clips, adjust the length of a clip, navigate through clips, etc.

Please find below a snapshot of Praat showing an snapshot (seconds 2.47 to 3.63) of the transcription and alignment at word and phoneme level of “Globalisation and the effect on economies.

In following weeks we will describe how we obtained this word and phoneme level alignment automatically from a word transcript with no time information. Stay tuned!

Posted in oerri, podcasting, Spindle, ukoer | Leave a comment