In the Studio: Speech to Text and 64-bit Performance in Adobe Production Premium CS4
This is a review of Adobe's Speech to Text feature, which is found in both Adobe Premiere Pro CS4 and Soundbooth CS4. I'm stating this upfront because somewhere along the line in the 4 months since I was commissioned to test and write about this new feature and when and I completed my review, I decided that I would be doing a disservice to my readers by spending the entire article discussing a feature that is just a bit ahead of its time, while ignoring one that has been around for a few versions but is finally fully functional. More on the Speech to Text stuff in a bit.
The Missing Link (Until CS4)
In the 8 years I have been editing on Adobe Premiere and Premiere Pro, I've seen a lot of new features and improvements over seven different versions. Mind you, some of the versions were .5 and even a notable .51 version, but despite their fractional upgrade nomenclature, these half-steps have been some of the most significant upgrades, including Premiere 6.5, which introduced real-time previews, and Premiere Pro 1.51, which included HDV support. By far, the most exciting new feature with Adobe's latest NLE, Premiere Pro CS4, is the Dynamic Link functionality that allows the editor to send a Premiere timeline to Encore without first rendering to an intermediate or delivery codec.
Although once you send an initial sequence to Encore you cannot send additional sequences to Encore using the same Dynamic Link function to form additional timelines, I discovered that once an Encore project is created and remains open, Adobe now permits the drag and drop of actual Premiere Pro sequences straight into Encore! Although the time-savings benefit is attractive, being able to skip an intermediate codec improves the final encode quality while at the same time taking the guesswork out of bitrate calculations by allowing Encore to automatically select the optimal bitrate for the project.
And that isn't even the best part. Because Encore is linking to an active and editable Premiere Pro sequence, when you inevitably find that you need to make a small (or major) change to the original Premiere timeline(s),
it automatically updates in Encore without any re-exporting, relinking, or redoing anything other than having Encore retranscode to your DVD, Blu-ray, or Flash delivery codec.
And it gets better still. The same Dynamic Link function also allows Encore menus to be edited in Photoshop, Premiere Pro audio files in Soundbooth, Premiere Pro video files in After Effects as a new composition, and-most excitingly-Premiere Pro sequences in the new stand-alone Adobe Media Encoder. Now I feel it is worth noting that having all of these individual applications rather than one larger "mega-app" is a benefit in the new era of 64-bit OS computing.
The difference between 32 and 64 is only 32, but when they represent an exponent on the binary base integer of 2, the difference is staggering. On a 32-bit OS, the memory ceiling is 2^32 addresses of RAM, or 4GB of RAM, that can be referenced, but a 64-bit OS has a theoretical memory ceiling of 2^64 addresses of RAM, enough for 16 exabytes of RAM.
Now, video editors might be used to using the prefixes "mega," "giga," and "tera" on an almost daily basis, but the next magnitudes are less common, so I figured I'd share them with you: "peta" comes next at 1,000^5, and then "exa" at 1,000^6, which means a 64-bit OS can address a cool 4 billion times more RAM than a 32-bit OS can. So now, while most applications are still written on 32-bit architecture (with notable exceptions being the 64-bit versions of Photoshop and Lightroom CS4), individually they can address less than 4 GB of RAM, regardless of how much RAM is paired with the 64-bit OS. So running multiple applications solves this limitation as each application can address its own RAM.
Running programs in the background, such as Adobe Media Encoder or Encore, also allows an editor to continue working on one project while another is encoding.
Back to Speech to Text
Now, back to our regularly scheduled review of the Speech to Text feature. My Adobe review team left me with the impression that the idea behind this feature is to create more value in the videos that are being edited, especially with the movement of programming and advertising budgets from TV to online and portable devices. The reason why my review got buried so deep within this article is that this particular feature is at once a bit ahead of its time and not quite ready for prime time, much like Dynamic Link was only a few versions ago.
Like the name of the feature implies, Speech to Text creates transcriptions from the audio track with individual words that are time indexed and linked to the original video. The transcription is performed by Adobe Media Encoder, so as long as you have a 64-bit OS on a new quad-core with plenty of RAM, you can start your edit while the transcription is happening in the background. The accuracy of the transcription itself depends on the combination of a clean audio track and a speaker who enunciates clearly with an accent that matches one of the four English language choices: U.S., Australian, U.K., and Canadian (or one of the other six other languages that are offered). The choice of accents made this Canadian chuckle and at the same time beam with pride that his relatively flat accent deserves its own accent module while New Englanders, New Yorkers, and New Orleanians all share the same U.S. accent option. I'm sure a similar pride is being felt in at Australia, a former British colony that for some reason earned its own accent module while England, the former seat of the British Empire, shares its module with the Scots, Irish, and Welsh.
I experienced transcription accuracy ranging from 60% to 90% through a variety of projects and speakers, but even the 90% accuracy figure wasn't high enough for me to use as a proper transcript without considerable corrections, which I found were more time-consuming then simply transcribing the audio the old-fashioned way. As each word is linked to a specific timecode, editing while preserving this link required double-clicking on the incorrect word and then typing in the correct word.
Doing this for the longer and important words is worth the effort, but the majority of the corrections were with the small words, and there's no way to select a group of words and simply retype them. You can copy and paste the entire transcription to any word processor, but because the transcription often gets the word completely wrong, you can't simply run a spell-check and expect the meaning to be accurate. In fact, the Speech to Text feature doesn't add any punctuation, so listening to the original audio is required anyway.
To illustrate my point that you can't rely on the transcription, in Table 1 (below) I've included some of the transcription mistakes from one interview I did with a professional coach that changed the meaning of his original dialogue so much that portions were unrecognizable.
|crew talks||true toxin|
|it's a great experience||it's a great stress|
|Chile [said four times]||still, July, Chile [twice]|
|engagement piece [two times]||any given peace, gaze been peace|
Linking the Transcript to the Video—Imagine the Possibilities
Despite these initial shortcomings, I do feel that the Speech to Text feature creates exciting possibilities. The purpose of linking text from the transcript to the video is so that you can search for specific words either in the editing process or in the finished video to jump to a specific part of the video that you are looking for. Right now the feature is the most useful during the editing process.
A week before I wrote this article, I finished editing a series of videos for a consulting company based around the themes of culture, difference, and recognition, featuring its awards ceremony and its proudest moment. Although I listened to each of the 30 15-minute interviews to discover these common themes and then put my sound-bite-filled clips on separate themed sequences, I could have saved a lot of time by simply searching the transcripts for these keywords.
Just to see if this would have been feasible, I searched the transcript after the fact and found four of the five keywords and clips that I ended up using in my edit from one speaker by simply searching the text in the metadata panel where the speech transcript is found. It did miss one of the clips I ended up using, but I couldn't fault Adobe on this one, as the speaker didn't actually use the keyword in his answer.
What I learned from this is that to facilitate future searches, I should put a microphone on the interviewer and record his or her audio on a separate audio track to allow a search of both the question and the answer to quickly find keywords.
One shortcoming of the speech transcript is that the entire transcript from each original clip is linked to the video every time it is used, even if you trim its in and out points, and it doesn't join itself with the transcripts from adjacent clips on the timeline. Nesting the sequence severs the link between the clip and the transcript, so as far as I could tell, the only way to get a transcript, from an edited project is to export the audio track to a new file, import it into the project, retranscribe that new audio clip, and then re-edit any transcription errors you may have corrected in the original transcription of the raw footage.
That additional step might be time-consuming, but my Adobe review team tells me they are working with Google to ensure that this metadata can be read by the search engine when it is part of an SWF file. Currently, the only way for search engines to index video is by user-generated metadata, which limits its value, but once the metadata matches the actual content of video, it will suddenly become more valuable to advertisers who want to accurately target advertising with video like never before.
Until this happens, the next best thing is to publish an HTML version of the video transcript along with the video, which is exactly what Hal Landen of VideoUniversity.com did with my recent video review of the ILY Athena duplicating tower (www.videouniversity.com/vu-webcast-2).
So there you have it-the Adobe Speech to Text feature, a very promising work in progress. Here's hoping it someday grows up to be as useful as its fully matured and nigh-indispensable Dynamic Link brethren.
Shawn Lam (video at shawnlam.ca) runs Shawn Lam Video, a Vancouver video production studio. He specializes in stage event and corporate video production and has presented seminars at WEVA Expo 2005–7 and 2009. He won a Silver Creative Excellence Award at WEVA Expo 2008 and an Emerald Artistic Achievement Award at Video 08.