SPINDLE Speech to text caption engine for media producers

The SPINDLE project is proud to announce we have now extended the functionality of the keyword extraction tool by creating a much bigger set of tools based around a captioning editor – the idea is that the tools will hide some of the complexity of moving media between various research tools and allow cataloguers to work together on the correction of online media podcasts. We’re calling this prototype toolset the SPINDLE caption engine but it does much more than captions.

Many thanks to our  programmers Sergio Grau and Jonathan Oddie for many weeks of solid coding in this new area over the summer. A huge learning curve for everyone.

The main functionality of this online toolkit is threefold:

  • Batch Speech to Text Conversion
  • Caption Editor
  • Export Tool

Batch Conversion

The first area the toolkit covers is allowing a manager to send batch sets of files to a server instance of the CMU SPHINX speech to text application. The conversion of audio to text within SPHINX is relatively slow, ~1-2 x real time, and so it’s best to batch encode files overnight. This is supported by the SPINDLE system reading in a list of files and then allowing the manager to select the files that need to be prioritised to convert to time-coded text.

Caption Editor

As we’ve discussed in detail through-out the project there will always be the need to provide a correction tool when using Automatic Speech to Text. The project has therefore developed a simple editor to correct mistakes in the text and importantly add punctuation. The editor supports the original time information. The editor allows captioning to made from scratch via a movie or audio URL or from the output file of the SPHINX batch process. The toolkit also supports simple manual importing of the two types of mark-up text files the project has been testing:

  • CMU SPHINX format
  • Adobe Premiere XMP format

The caption editor has a list of chunked text entry slots that are synced to a time period in the video. The idea is that you use the play/pause buttons under the recording, perhaps using the HTML 5 button to play the recording 75% slower, and start correcting the automatic speech to text. It’s taken quite a while to find a good compromise in the UI to allow a cataloguer to break the text up into chunks suitable for overlay captions yet still be able to navigate quickly up and down the file. We still find the correction process a slow and time-consuming process so there is the ability to stop and come back later, with an indicator showing the % of file transcribed.

Features include:

  1. Allowing user corrections and punctuation
  2. Generates time-coded captions in HTML5 supported WebVTT format
  3. Generates a keyword list from transcript text
  4. Exports to caption format (WebVTT), plain text or HTML transcript and archive XML

Automatic keywords

The Caption toolset supports the core work done on automatic keyword extraction. Obviously this gets better as you tidy up the mistakes in the captions but as we’ve seen over the length of the project the algorithm developed seems to still have great benefits on uncorrected text.

Technical Details


  • Python (Django) Server
  • Twitter bootstrap UI framework
  • Celery (celeryproject.org) for queuing
  • Also requires a message queue server (RabbitMQ)
  • Installation of CMU Sphinx (optional)

Transcript Editor:

  • AJAX-based
  • jQuery and backbone.js
  • REST API through django-tastypie
Posted in dissemination, oerri, Spindle, ukoer | Tagged | Leave a comment

Leave a Reply