SPINDLE Project Outputs

  • SPINDLE set up and documented a workflow to generate the automatic transcription of future open access audio and video podcasts using an online platform concentrating on generating automatic keyword extraction for better cataloguing.
  • SPINDLE tested and documented this workflow by:
    • developing a method to generate keywords and relevant word pairs automatically
    • generating in a batch process automatic speech-to-text keywords and timecoded transcripts from a database of over 3,400 podcasts
    • documenting the problems of accuracy in automatic transcriptions by testing and reporting the results of using two commonly used speech to text tools and services against baseline hand-transcribed transcripts
    • investigating the use of Automatic Speech-to-Phoneme alignments for our existing manual transcriptions that did not already include time-code information
  • SPINDLE also successfully designed and documented a filtering program for automatically extracting  keywords and relevant word pairs from uncorrected time-coded transcripts by selecting non-common words.
  • SPINDLE extended the functionality of the keyword extraction tool by creating an online web application to manage the transcription of online media podcasts. The main functionality of this online platform is:
  • Caption editor:
    • to edit time-coded transcripts whilst reviewing against the original online media file
    • to allow registered users to transcribe in parallel, with support for crowd-sourcing corrections
    • import into the Caption editor time-coded transcriptions in XMP, srt or CMU Sphinx formats
    • to edit transcriptions to provide corrections, punctuation, caption length chunking, speaker labels, etc.
  • Batch converter:
    • Create automatic transcriptions from an online media file using a CMU Sphinx installation
    • Create batches of media for automatic transcription
    • Create a list of automatic keywords with relevance statistics
  • Export Tool
    • Support for media metadata and Open Educational Resource (OER) licences
    • Support for exporting time-coded transcriptions in multiple formats:
      • human readable:  plain text and HTML
      • HTML5 compatible captions: online media caption format (webVTT)
      • XML format suitable for archiving and preservation
  • Data feed in RSS format to facilitate online visibility

All SPINDLE code is available from the open repository https://github.com/ox-it/spindle-code/

Posted in oerri, Spindle, ukoer | Tagged , , , | Leave a comment

Leave a Reply