Rethinking the Transcript Editing Experience
Check out the YouTube video that accompanies this article here.
Transcript editing is its own beast. Let me explain.
On the surface it looks similar to your typical essay or document. The only differences are the speaker labels (who says what utterance) and timestamps. That's pretty much it.
And if you're working with a small transcript and short media it's not difficult to edit to verbatim. Set the audio to play at half speed on VLC, minimize the window, and try to keep up.
But if you're editing anything longer than five minutes, with multiple speakers, and doing this multiple times with different transcripts, the aggravation starts bubbling real quick.
The timestamps, for one, don't change when you edit something down. Sometimes that's fine, sometimes it's not. Changing speakers is a pain. When you're dealing with video captions are SRT
/VTT
files, it's an even bigger annoyance. There are cascading effects that really get you mired.
Quick change of subject, and some background. All of 2023 our team at Wordcab was trying to find the next thing we were going to put energy and resources into. Wordcab is an API-as-a-Service, and has some (very) large competitors. We focus on what makes us special but it isn't enough.
We wanted to introduce a new product that could leverage our APIs, as well as any new AI advancements that, at this point, arrive at a daily cadence. Instead of swimming against the wave, we wanted to make a better surfboard to ride it. Focus on user experience, a modular feature system.
We chose to build an editor would act as a solid base to build on top of. Editors can be so much more, especially transcript editors.
As cliche as it is to say we built something from first principles, that's what we did. We started with a blank page, white on grey background in terms of CSS, and then asked a lot of people a lot of questions to figure out what will make editing transcripts fun. Not less painful, but fun and fast.
Wordflo draws from my decade of experience as a copywriter, my lifelong gaming habit, minimalist brands I like, and gut instinct. It has three risky features.
The biggest and riskiest feature is elevating words to first-class citizens and relegating characters to the sidelines. Meaning, you don't edit, navigate, or redact characters, but entire words.
It's risky because a lot of people would find this unfamiliar and potentially even uncomfortable. It's closer to Notion than Docs, but it's even more demarcated. But discipline equals freedom, as Jocko Willink espouses.
My theory is that this approach will lead to a lot less frustration in the long run for transcript editors (I mean the people doing the editing).
The second risky feature is promoting editors to power users from the start. Keyboard shortcuts are made stupidly simple, with most being a single key (ex. "T" for table of contents, "S" to show or hide edits and redactions, "C" to open comments, "F" for find and replace/strikethrough).
Usually you hold CTRL or ALT or something, but this just doesn't feel good when you're going a mile a minute, trying to edit a boring transcript as fast as possible. The compromise here is that you need to highlight words and press "R" to edit them in a little popup box. But my instinct is that automatic transcription and diarization will improve quickly enough that this will be a rarity.
The third (and less risky) feature is hard-synced media and text playback. Audio/video and text are essentially one. When you highlight something with your mouse, or hold SHIFT and use the arrow keys to select words, and then press BACKSPACE, it redacts the audio/video and changes every single subsequent timestamp.
There are so many things we've added to make things faster and more fun, but the above are most noticeable.
I made a video accompanying this piece that walks through the three features and some others, and gives a better sense of what we're aiming to create. Click the blue button below to check it out, or click this link.