OER International Case Study Published

Further to the recent post about Oxford’s OER International project funded by the HEA, you can now read the full case study.

The purpose of Oxford OER International was to identify suitable elements of the University of Oxford’s existing OER collection to be showcased internationally. By improving the web presence of Oxford’s OER outputs, designed with the international user in mind, the project was able to promote a selection of resources hand-picked for their suitability for an international audience.

The project enhanced the potential for engagement with international audiences by ensuring that the selected content was more easily discoverable through improved descriptions and additional metadata to indicate level (introductory, intermediate, advanced). Advocacy from world-class academics and appreciative users, clear routes to Oxford’s other OER projects, and the inclusion of other links focussed on international admissions were all included to present a true showcase of Oxford’s best international outputs. The project evaluated strategies to improve discoverability of content by a global audience and investigated a range of tracking and feedback methods for understanding their use.

This case study highlights successful approaches to understanding the needs of an international audience, for example by exploring how improved cataloguing metadata can be used to enhance discoverability and by demonstrating how targeted promotion of relevant content through better visibility and marketing can lead to higher usage and by introducing a tracking analytics strategy to evaluate usage and search behaviour. It also includes a simple 5-step methodology which is offered as a model for other OER creators to follow. 

 The 5 steps to gaining global reach

1. Getting a feel for your audience

  • focussing on your target audience
  • understanding some key aspects of their data for example: most popular pages, best traffic sources, most popular countries and languages.

2. Framing your objectives

  • avoiding vanity metrics
  • working out metrics for your key stakeholders.

3. Audit where you are in relation to your objectives

  • what are your primary traffic sources?
  • what behaviour can you see them making?
  • what can you tell about your geographical visitors?

4. Revising your objectives, planning and implementing some improvements

  • increase your traffic sources
  • optimise your content for searches.

5. Evaluate and repeat steps 3 and 4.

Posted in Content, dissemination, impact, Oxford, ukoer, Uncategorized | Leave a comment

Oxford OER International

We have recently received funding for a short project by the OER International strand of the HEA/JISC Open Educational Resources Phase 3 Programme. As part of this very quick turn-around project we have just made live our new open content page on the University’s podcasting website http://podcasts.ox.ac.uk/open. Huge thanks go to Steve Pierce for making the vision come alive.

The purpose of Oxford OER International was to identify suitable elements of the University of Oxford’s existing OER collection to be showcased internationally. By improving the web presence of Oxford’s OER outputs, designed with the international user in mind, the project was able to promote a selection of resources hand-picked for their suitability for an international audience. The project enhanced the potential for engagement with international audiences by ensuring that the selected content was more easily discoverable through improved descriptions and additional metadata to indicate level (introductory, intermediate, advanced). Advocacy from world-class academics and appreciative users, clear routes to Oxford’s other OER projects, and the inclusion of other links focussed on international admissions were all included to present a true showcase of Oxford’s best international outputs. The project briefly explored strategies to improve discoverability by an international audience and methods for understanding their tracking and use, and these are to be included in the final case study. The case study will highlight successful approaches, for example by describing how metadata can be used to enhance discoverability and demonstrating how tracking methods can support international promotion.

Details of the final case study will be posted here when it is published.

Posted in dissemination, Oxford, podcasting, ukoer | Leave a comment

SPINDLE – Speech to Text to Keywords to Captions – The Grand Finale

SPINDLE: Increasing OER discoverability by improved keyword metadata via automatic speech to text transcription.

A summary of the project  using the words of the voice-over that accompanies the SPINDLE overview video that documents the project.

1. Aim – Generate keywords automatically from recorded lectures

2. Spindle was funded by JISC through the “Open Educational Resources – Rapid Innovation” strand. - http://www.jisc.ac.uk/whatwedo/programmes/ukoer3/rapidinnovation.aspx

3. Spindle was a technical project whose key objective was to explore generating cataloguing keywords from recorded lectures.

4. Spindle reviewed the accuracy of “speech to text” tools available to media producers for automatically generating a text transcript from a recording file.

5. Spindle created a program that automatically filters the uncorrected transcript to a set of statistically interesting keywords. The program analyses the lecturer’s words and compares them with the British National Corpus of Spoken Words.

Better keywords improve the discoverability of open content !

6. Spindle went on much further than expected than the initial plan to create a “captioning” toolset to help media producers deal with cataloguing media

With this toolkit, a media service can now:

- batch process recordings to create transcripts automatically ( using the free toolset CMU Sphinx)

- generate keywords

- correct any transcript errors while listening to the media

- and export into time-coded captioning and archive formats

7. The Spindle captioning toolset was written in Python using the DJANGO framework

8. The Spindle code is publicly available to re-use in an online  repository under an open source licence  - [ Github code repository - https://github.com/ox-it/spindle-code hashtag #spindle #OERRI ]

9. All reports and further information are available through the Spindle blog – http://blogs.it.ox.ac.uk/openspires/category/spindle – hashtag #spindle

SPINDLE Overview Movie

SPINDLE Overview Movie

Watch the SPINDLE 2 minute overview video using the above text as the voice-over at:

http://media.podcasts.ox.ac.uk/oucs/spindle/spindle_overview.mp4

The SPINDLE Workflow and Caption Editor Toolkit

Posted in dissemination, grandfinale, Spindle | Tagged | Leave a comment

SPINDLE – Benefits and Impact

Project SPINDLE is about to end. As lead on the project here at the Academic IT services I’ve tried to summarise the main impact and benefits of the work:

  • Training – improved skills within the OpenSpires and Media teams
  • Discoverability -making media more discoverable and accessible,
  • Content -the creation of better cataloguing resources, tools and data
  • Knowledge exchange – through the documentation of the workflow and the creation of free to use open source tools helping others to build on our work
  • Community building – working with others to explore ideas for time-coded texts and media

The project was funded by the JISC to rapidly innovate around technical issues that support the release of Open Educational Resources. The single biggest benefit of the project has been in training and skills acquisition for our media production team – by allowing time and funding to foster a multidisciplinary collaboration across linguistics, phonetics and computer science to research and create the prototype service. The fast-paced short five month project has achieved all of it’s original aims and through the efforts of combining our summer intern programmer with an expert in speech to text software we have manged to move beyond the area of keyword cataloging and create a more complex prototype web application to process transcripts as media is created. This captioning toolkit will speed up work, be very cost-effective and allow crowd-sourced corrections to be exported into emerging HTML5 captioning and archival formats.

Here is a list of the substantial benefits of the project:

  • SPINDLE developed a round trip work flow for transcription correction and created over 20 blog reports evaluating this work.
  • SPINDLE researched the use of automatic speech to text programs to generate transcriptions automatically. This automatic transcription serves as a starting point to create manual transcriptions and captions, as well as the base to generate keywords automatically.
  • SPINDLE documented how to use Adobe Premiere to make transcripts and how a media unit might install the research toolkit CMU SPHINX 4 to transcribe podcasts - https://github.com/ox-it/spindle-code/tree/master/speechToText
  • A large corpus of text – SPINDLE proved that the workflow could generate keywords automatically for 3,426 podcasts. Once these keywords are migrated into our delivery channels they will lead to better indexing and cataloguing, and better discoverability of our Open Educational Resources (OER) by search engines.
  • Accessibility  – We generated unchecked and uncorrected caption file data in WebVTT timecoded format for our OER video podcasts
  • Archival formats – We investigated an archival format for the keywords and transcripts using the Text Encoding Initiative encoding format which also include OER licence information

We developed code:

  • Programming scripts for finding non-common keywords from text transcripts - http://github.com/ox-it/spindle-code
  •  A new prototype online transcription editor – A toolkit that aids captioning work – freely available in a github code repository - http://github.com/ox-it/spindle-code
  • Integrating the SPINDLE Caption Editor to CMU Sphinx, and to import Adobe Premiere XMP transcript files and investigated an API to the Koemi commercial web service
  • To help accessibility via text and video caption formats – Exporting to plain text, HTML, Web VTT and a data RSS feed.
We improved speech to text skills across the OpenSpires and media services team and hence the University of Oxford, by fostering a multidisciplinary collaboration across academic IT services, linguistics, phonetics and computer science to create the prototype service. We also developed expertise in other subjects such as research tools ( CMU Sphinx), text encoding ( TEI, XML and HTML5), programming (Django,web services), accessibility formats (WebVTT) and automatic speech-to-text alignment.
The next technical steps are to
  • Test the prototype software in a day to day production server environment
  • Review and reduce any minor keyword cataloguing errors
  • Ingest the cataloguing data into our main databases
  • Expose the new cataloguing keywords on the 4,000+ media items delivered by the Academic IT Services in feeds and web pages  – primarily  Oxford on iTunesU and http://podcasts.ox.ac.uk
The next research work :
  •  Explore ways of filtering even further the keywords by ranking and removing words that are unlikely to be used in online searches
  •  Explore the practicalities and costs of crowd-sourcing the correction of raw automatic transcriptions of the lectures with the new caption software
  •  Explore using the benefits and weaknesses of automatic draft text as full text search
  •  Compare the costs of managing volunteers correcting automatic transcripts to the cost and accuracy of using a professional transcription service.
Further work with academic authors:
  • Attitudes to OER text transcript release -  information on contributor attitudes to displaying texts alongside a lecture.
  • Policy for approval of texts
  • Investigating storing a voice-bank or key subject terms database to help the software improve regular transcription
Future research ideas
The project also offers many future benefits and avenues to explore for researchers and HE services:
  • Corpus Linguistics and language  – SPINDLE offers a unique snapshot of text representing the academic language over a four year period at Oxford.
  • English as a foreign language – There has been interest and debate by the language learning community on SPINDLE and captioning lectures here – http://chirpstory.com/li/25724
  • Media Production Services – there is interest in using the SPINDLE work within automatic lecture capture solutions- http://opencast.org
  • Translation of texts to foreign languages
  • Data mining – research across the disciplines
Posted in dissemination, impact, oerri, Spindle | Tagged | Leave a comment

Navigating Open Oxford: the new OpenSpires Mind Map

Are you interested in seeing the bigger picture of Open Oxford? Try the new interactive OpenSpires Mind Map, freely available online. This new map is designed as a gateway into Open Educational practice at the University of Oxford. Here, you can explore the story and achievements of OpenSpires, read how the openness initiative can benefit academic practice and find ways to get involved at the University of Oxford.

As a part of the OER revolution OpenSpires has now overseen a number of major OER projects at the University of Oxford, and is still growing. This new interactive map showcases all the diverse projects under the OpenSpires umbrella since it was established in 2009. It is a useful starting point for beginners, including Key Definitions, and How To, as well as answering some FAQs. It also goes deeper, offering information about the strategies behind OpenSpires projects like Ripple, Triton and Great Writers Inspire. It is hoped that this map will be a multi-faceted tool to help explain and celebrate various aspects of OpenSpires.

For more information explore the Mind Map or the OpenSpires homepage, or read the LTG Case Studies blogpost.

The OpenSpires Mind Map was created by Alexandra Paddock as part of a summer internship at IT Services.

Posted in Oxford, ukoer | Leave a comment

Great Writers – taking stock

With a fast-paced 1 -year project it is easy to forget some of the interesting bits along the way. As we write our final report we have taken the opportunity to reflect on all aspects of the project and this has been made easier by the excellent blogging of our student team and our academic supporters. The final report will be available in mid-October but until then here are some mini-reports and reflective posts which give a taste of our outputs and findings.

Ebooks

http://writersinspire.wordpress.com/2012/05/10/the-ipad-in-the-library/, http://writersinspire.wordpress.com/2012/04/19/engage-event-ebooks-ereaders-elearning/,

Teaching case study (video)

http://writersinspire.org/content/teaching-shakespeare-schools

Engagement with schools

http://writersinspire.wordpress.com/2012/07/17/schools-engagement-at-cheney-teachers-comments/, http://writersinspire.wordpress.com/2012/07/17/schools-engagement-at-cheney-oxford/

Engagement with the wider community

http://writersinspire.wordpress.com/category/events/engage-events/

How to inspire students

http://writersinspire.wordpress.com/2012/04/20/engage-and-inspire/

Copyright/CC

http://writersinspire.wordpress.com/2012/04/19/engage-event-copyright-and-licencing/ , http://writersinspire.wordpress.com/2012/04/17/copyright/, http://writersinspire.wordpress.com/2012/03/28/releasing-and-reusing-creative-commons-material/, http://writersinspire.wordpress.com/2012/02/09/who-owns-scholarship/, http://writersinspire.wordpress.com/2012/01/25/creative-commons/

Digital literacy

http://writersinspire.wordpress.com/2012/08/17/down-the-rabbit-hole-discovering-open-educational-resources/, http://writersinspire.wordpress.com/2012/05/01/the-satisfaction-of-a-reliable-and-interesting-source/

Posted in Content, dissemination, Great Writers, ukoer | Leave a comment

SPINDLE Automatic Keyword Generation: Step by Step

In this post we are going to show the automatic generation of keywords from the automatic transcription of a podcast. First of all, please find below a figure showing the main workflow of the SPINDLE project.

From our podcasts, we obtain an automatic transcription by using CMU Sphinx or the Speech Analysis Tool from Adobe Premiere Pro. Alternatively, a podcast could be transcribed by our media team or by using an external transcription service.

Once we have a transcription, how can we obtain the most relevant words? Using the Log-likelihood method. This method compares the frequency of a word in the transcription with the frequency of the same word in a large corpus. For example, the word “banks” occurs 17 times in the automatic transcription of this podcast, Global Recession: How Did it Happen?  and 201 in a large corpus. Why the word “banks” is relevant?

Collecting word frequencies from a large corpus

First of all we need a reference corpus to which we can compare our automatic transcriptions. This corpus should be large enough to contain most words and general enough to be representative of the language. We chose for our experiments the spoken part of the British National Corpus (BNC) as our reference corpus.

The characteristics of the spoken part of the BNC corpus can be found below:

  • 589,347 sentences
  • 11,606,059 words

So, now we know we have more than 11 million words in our reference corpus. So, taking into account that the word “banks” occurs 201 times out of 11.6 million words and 17 times out of 5439 times in our transcription,  how do we calculate the relevance of the word “banks”?

Step 1

  1. Use Natural Language Processing techniques to normalise the corpus (remove punctuation and stopwords)
  2. Calculate for each word in the British National Corpus how many times does that word occur in the corpus (a)
  3. Calculate the total number of words in the corpus (c)

The final file is composed of 56,029 words and the number of occurrences of each word. An extract of that file can be found below:

  • banks: 201
  • crisis: 195
  • companies: 758
  • ….

Generating relevant keywords and bigrams

Step 2

  1. Use Natural Language Processing techniques to normalise the transcription (remove punctuation if necessary and stopwords)
  2. Calculate for each word in the transcription how many times does that word occur in the transcription (b)
  3. Calculate the total number of words in the transcription (d)

Step 3

  1. Calculate the Log-likelihood, G2, of each individual word
  2. Sort the words by Log-likelihood value (the higher the better)

Step 4

  1. Calculate frequent bigrams counting the number of occurrences

Example of Automatic Keywords Generation

We used the keyword generation tool to generate the relevant keywords and bigrams of the automatic transcription of the podcast Global Recession: How Did it Happen? (Correct Words = 32.9%). We selected a bad automatic transcription to show that even with a low number of correct words we can extract some relevant keywords and bigrams automatically.

Keywords Generated (word: Log-likelihood)

banks : 141.12175627
crisis : 73.3976004078
companies : 67.8498685789
assets : 61.8910800051
haiti : 47.7956942776
interest : 41.3390170289
credit : 39.6149918395
crunch : 35.9334074944
senate : 32.4501608202
profited : 30.625124757
sitcom : 30.625124757
ansa : 30.625124757
nineteen : 29.0864140753
economy : 28.6440250819
nineties : 27.5138518651
haitian : 26.8069860979
sanctioning : 26.8069860979
center : 26.8069860979
regulate : 25.4923775621
hashing : 25.0818400138
haitians : 25.0818400138
stimulus : 24.5089608603
united : 24.1102094531
successful : 21.8091735308
financial : 21.7481087661
key : 21.6791751296
caught : 21.1648006228
eases : 21.0970376283
bankruptcy : 21.0970376283
rates : 21.0105869453
kind : 20.8040324729
cited : 20.6246470912
backs : 19.9877139071
borrowing : 19.9877139071
crimes : 19.5817617075
countries : 19.5490491082
essentially : 19.334521352
fiscal : 19.1532240523

Collocations Generated (collocation: #occurences)

[interest rates] : 5
[financial crisis] : 4
[wall street] : 3
[nineteen nineties] : 3
[credit crunch] : 3
[british government] : 3

Word Cloud (using Wordle)


Conclusion

We should note that we are generating keywords from automatic transcriptions and not from human transcriptions. Therefore, we obtain along relevant keywords and bigrams some keywords and bigrams that are not that relevant or directly, out of topic. However, through the SPINDLE project we have generated automatically thousands of relevant keywords and bigrams for our collection of podcasts that are going to increase in the near future the discoverability and accessibility of our podcast collection.

Posted in oerri, Spindle, ukoer | Tagged , , | 5 Comments

SPINDLE Frequently Asked Questions

What is Spindle?

SPINDLE has been a project funded by JISC as part of their “Rapid Innovation in Open Educational Resources” programme. The project experimented with speech-to-text technologies to automatically create transcripts of Open Educational Resources (OER), and develop new tools to generate better keywords to help with the indexing and description of OER.

How do I transcribe automatically from speech to text?

We investigated three options for automatic transcription of podcasts:

Adobe Premiere Pro is excellent for video editing, but not for transcribing  thousands of podcasts automatically. If you require the automatic transcription of one or more audio/video podcasts, then the Speech Analysis tool of Adobe Premiere Pro can be helpful, but cannot be used for batch processing of transcriptions of audio or video. On the other hand, CMU Sphinx allowed us to run the batch transcriptions of thousands of podcasts efficiently.

How accurate is automatic Speech to Text?

It depends. The key factor in our experience is the quality of the recording – a professional recording using a good tie-clip microphone gives the best results. A microphone far away from the speaker in a noisy room with echoes gives the worst results. It also depends of course on the clarity and accent of the speaker. In the very best situation we have had results where 6 out of every 10 words are automatically transcribed. In this case the gist of the lecture is obvious. This can drop to much lower results of say 3 out 10 words with poor quality recordings. In this case the results are probably too confused to read as normal English and are too poor to generate a good range of keywords.  It’s important to realise that all automatic transcripts will need significant editing and checking, particularly to insert correct punctuation in order to make them readable for human users.

How do I generate keywords automatically from transcriptions?

We used two methods:

Antconc is a desktop application (which works on Windows, Mac and Linux), and generating keywords involves starting the programme, loading the text and the reference word-list, and manually running the function to generate the keywords. The user has the opportunity to adjust various parameters, and change the reference corpus, so we found this useful when we were investigating the best ways to generate relevant keywords. But the nature of this interactive application meant that it couldn’t be deployed in an automated workflow to generate keywords from multiple podcasts.

So, instead, we wrote a script to generate the keywords, which could be inserted into our automated workflow, and could be invoked and run programmatically without human intervention.

How does the algorithm for keyword filtering work ?

We compared the words in the automatic transcription with the speech transcribed in a large corpus of English called the British National Corpus. Words that are repeated much more often than in normal speech are likely to be keywords.

Where can I download the code generated during the SPINDLE project?

The code is available from https://github.com/ox-it/spindle-code/.

How can I align audio and transcription automatically?

We used the Penn Phonetics Lab Forced Aligner (P2FA), an application which has emerged from academic research in phonetics. The staff at the Phonetics Laboratory at the University of Oxford had identified this in an earlier JISC-funded research project as the state of the art for the automatic alignment of everyday contemporary English speech, and had gained expertise in using it. P2FA is free to download and use, and doesn’t have any licence conditions attached to it.

P2FA is a python script which interfaces with the Hidden Markov Model Toolkit (HTK) aligner, and with a set of good quality acoustic models. It is necessary  to install HTK , and use it according to the HTK End User Licence Agreement, which is not restrictive in terms of how the software is used. HTK is usually available from http://htk.eng.cam.ac.uk, but not accessible 12-09-2012.

What formats did you use for caption work ?

We used WebVTT – this is a simple to understand HTML5 web format for presenting groups of words over a video.

Posted in oerri, Spindle, ukoer | Tagged , , , | Leave a comment

SPINDLE Project Outputs

  • SPINDLE set up and documented a workflow to generate the automatic transcription of future open access audio and video podcasts using an online platform concentrating on generating automatic keyword extraction for better cataloguing.
  • SPINDLE tested and documented this workflow by:
    • developing a method to generate keywords and relevant word pairs automatically
    • generating in a batch process automatic speech-to-text keywords and timecoded transcripts from a database of over 3,400 podcasts
    • documenting the problems of accuracy in automatic transcriptions by testing and reporting the results of using two commonly used speech to text tools and services against baseline hand-transcribed transcripts
    • investigating the use of Automatic Speech-to-Phoneme alignments for our existing manual transcriptions that did not already include time-code information
  • SPINDLE also successfully designed and documented a filtering program for automatically extracting  keywords and relevant word pairs from uncorrected time-coded transcripts by selecting non-common words.
  • SPINDLE extended the functionality of the keyword extraction tool by creating an online web application to manage the transcription of online media podcasts. The main functionality of this online platform is:
  • Caption editor:
    • to edit time-coded transcripts whilst reviewing against the original online media file
    • to allow registered users to transcribe in parallel, with support for crowd-sourcing corrections
    • import into the Caption editor time-coded transcriptions in XMP, srt or CMU Sphinx formats
    • to edit transcriptions to provide corrections, punctuation, caption length chunking, speaker labels, etc.
  • Batch converter:
    • Create automatic transcriptions from an online media file using a CMU Sphinx installation
    • Create batches of media for automatic transcription
    • Create a list of automatic keywords with relevance statistics
  • Export Tool
    • Support for media metadata and Open Educational Resource (OER) licences
    • Support for exporting time-coded transcriptions in multiple formats:
      • human readable:  plain text and HTML
      • HTML5 compatible captions: online media caption format (webVTT)
      • XML format suitable for archiving and preservation
  • Data feed in RSS format to facilitate online visibility

All SPINDLE code is available from the open repository https://github.com/ox-it/spindle-code/

Posted in oerri, Spindle, ukoer | Tagged , , , | Leave a comment

SPINDLE project: Lessons Learnt

The SPINDLE project is wrapping up and will end in September 2012. Please find below some of the lessons learnt during the project.

  • We can obtain good keywords even if the automatic transcription has got lots of errors.
  • You do not need perfect automatic transcription to implement word search for your Open Educational Resources.
  • The importance of timecoded transcriptions to create captions, chapters or marks for your Open Educational Resources. Automatic Speech-to-Text alignment can help you if you already have a manual transcription.

  • Adobe Premiere Pro is excellent for video editing, but not for automatically transcribing thousands of podcasts. If you need the automatic transcription of one or more audio or video podcasts, then the Speech Analysis tool of Adobe Premiere Pro can be helpful,  but not for batch processing. In contrast, CMU Sphinx allowed us to run the batch transcriptions of thousands of podcasts efficiently.
  • The Pareto principle (or 80/20 rule) applies to the automatic keyword generation from automatic transcriptions. We will need to dedicate 80% extra time to generate automatic keywords accurately for 20% of our podcasts (difficult recording conditions, long distance microphones, multiple speakers, specialised vocabulary, multiple accents, etc). We were able to generate accurately keywords for a majority of our podcasts without having to deal with those issues. The podcasts that are difficult to transcribe automatically could be transcribed manually in the future or wait for further funding.
  • The use of a High Throughput Computing cluster (Condor) was extremely beneficial for the project. We could submit all the transcription jobs to the cluster and get the results in a timely manner. Usually there were up to 60 transcription jobs running in parallel in the cluster.
  • The combination of skills of the project members was an important factor to the success of this short project. We had a diversity of skills in our team, from open educational resources to natural language processing, automatic speech recognition and web development.
  • The variety of representation of timecoded transcripts was also a subject of discussion during the project. Finally, we decided to have a TEI/XML representation of the automatic/manual transcription including the time information and the automatic keywords. On the other hand, a transcription can be exported into a variety of formats (text, HTML, srt, webVTT, XML) in the developed online caption editor platform.
Posted in oerri, podcasting, Spindle, ukoer | Tagged , , , | Leave a comment