As part of the SPINDLE project we are producing a set of automatic transcriptions and automatic keywords for the university podcasts to improve the OER discoverability. In this post we are going to analyse the variety of different formats that are already in use to represent these transcriptions.
At the moment, we already have a small set of human transcriptions stored as .pdf and .xml files.
Please find below a snapshot of the pdf file for the podcast Globalisation and the effect on economies:
Please find below a snapshot of the .xml file for the same podcast:
We are also creating automatic speech-to-text transcriptions using Large Vocabulary Continuous Speech Recognition Software.
If we use Adobe Premiere Pro we will obtain an XMP file. Please find below a snapshot of the XMP file (note the low word accuracy for the first sentence, total word accuracy of the automatic transcription 66.08%):
If we use Sphinx-4 we obtain a text output that can be post-processed into any other format (very low Word Accuracy in the first sentence as well, total Word Accuracy 46.56%,).
So far we have the following file formats XML, PDF, TextGrid, XMP and TXT and we would like to obtain a unified representation of our transcription including the time information. We are thinking of using TEI/XML, similar to the approach we suggested for linking transcriptions to the British National Corpus audio. This TEI/XML representation could be then exported to a variety of format such as TTML, SRT or WebVTT. Should we use this TEI/XML representation or should we use a video caption standard such as TTML or WebTT directly? Pros and cons? We will report back with an answer soon. Meanwhile, any thoughts or suggestions are welcome. Stay tuned!