Special Thanks: Statsig Team + Timothy Chan, James Kirk, Mike Vernal, Patrick Phipps, and Jessica Owen
OpenAI published a new artificial intelligence model that can transcribe speech with near-human accuracy. Known as Whisper, the model makes 50% fewer errors than its predecessors. We ran an A/B test on Statsig comparing the error rate of Whisper against Google's flagship Speech-to-Text API. The purpose of the test was to understand which platform produces fewer transcription errors on Caption’s production workload and how the change in AI model impacts the user experience.
The results show that Whisper is the clear winner in transcription accuracy.
Captions is a new mobile creator studio that uses AI to help creators through the entire process of content creation, from scripting to recording to editing and sharing. Our app offers a transcript-based video editing interface that makes video editing as simple as text editing. We also produce professional-looking, word-by-word captions that are synced to voice without needing expensive human transcription services.
Captions uses AI to transcribe videos in real-time and offers an intuitive transcript editor to make any desired updates. When Captions returns inaccurate results, it creates additional work for our users and decreases the overall quality of their experience. We strive to minimize the number of corrections our users need to make to transcripts produced by our AI. To that end, we’re always in pursuit of the most accurate Automatic Speech Recognition (ASR) model. We've been running Google's Speech-to-Text API in production for the last year, and it's been working well. But we wanted to see what Whisper could do.
The Journey to Production
Our journey kicks off with a proof-of-concept. OpenAI offers installation instructions for the Whisper Python library. We converted those instructions into a Dockerfile and used Google Cloud Console to provision an NVIDIA A100 machine to run our containerized application.
The application listens on a PubSub topic. Each message in the topic represents a transcription request with the audio file location and a request id. As messages stream in, the application runs Whisper to transcribe the audio files and writes its results to a database.
Using Statsig’s feature gates, we targeted internal users for our initial rounds of testing. Statsig is a comprehensive statistical tool that provides real-time analytics and has a simple user interface. It is our go-to tool for running A/B tests and analyzing outcomes.
With the Whisper pipeline hooked up to the frontend, we rallied the team to a bug bash. We took inspiration from our Customer inbox to identify videos that could test Whisper’s limits. Over the last year, we’d been using Google to create transcriptions for our customers, and kept a log of when its Speech-to-Text performed particularly poorly. These videos shared common characteristics:
- background noise, such as ambient room noise, outside noise or music playing in the background
- speaker is performing music (singing, rapping, spoken-word poetry)
- features people speaking English with an accent
- rapid speech
We assembled ~20 videos that displayed one or more of the listed characteristics and compared the performance between the contenders. We were blown away by how much Whisper outperformed our existing system in transcription accuracy. In one instance, we watched Whisper transcribe Eminem's "Godzilla" perfectly - a feat considering the song holds the Guinness World Record for the Fastest Rap in a No.1 Single, with 224 words packed into 31 seconds. Google’s Speech-to-Text was nowhere close to transcribing it. If you're doubting Whisper because Eminem's lyrics are publicly available, we've got a video for you. Whisper is able to accurately transcribe even the most complicated freestyle rap, as demonstrated by the following video of rapper Mac Lethal packing 400 words in a minute.
We declared Whisper the winner of the internal testing round. Our new goal was to ship it to production to see how it would perform under real-world conditions. However, there were 2 key challenges that needed solving: scalable infrastructure and word-level timestamps.
How We Made Whisper Work for Captions
Whisper's highest quality model is intensive in terms of both GPU and memory usage. Given that we aim to serve users their transcription under a minute, our infrastructure needs to scale appropriately to handle traffic and serve those requests in a timely manner. We chose Kubernetes.
Kubernetes is a popular orchestration engine for automating deployment, scaling, and management of containerized applications. Google Kubernetes Engine (GKE) enables us to quickly deploy the Whisper model and share GPUs across instances with simple yaml. We were fortunate to time our move to GKE with their release of image streaming. GKE’s new system dramatically improves the time to pull down image updates and rollout changes from minutes to seconds.
Captions allows users to style their transcriptions to best reflect their personal brand and message. As part of the customization options, users can choose to display spoken words, one word at a time; or set images, sounds, emojis and font colors to specific words. The challenge is that Whisper produces timestamps for segments, not individual words. To maintain feature parity, we implemented our own algorithm to compute word-level timestamps by leveraging a model that predicts the likelihood that a word in an audio clip corresponds to a particular timeframe. The algorithm performs the following steps:
- Split the audio track into multiple files, one per Whisper segment
- Apply the model to each audio snippet and Whisper text - the model generates a graph of probabilities for word-to-timeframe pairs
- Walk the graph to determine the most likely word-timestamp pair
- Return the most probable word-timestamps for all segments
We’ve been collecting user metrics with Statsig’s Custom Metrics over the course of the year. We hypothesized that Whisper would reduce the number of corrections a user needed to make to their transcription. We looked at the following events to test our hypothesis:
Statsig recommended that we use ratio metrics, a number that represents the relationship between two values. In our case, the numerator is the total number of user corrections and the denominator is the total number of transcripts.
Without ratio metrics, we might think that a decrease in correction events means that users are not making corrections. However, it could just mean that there are fewer users on the app because they have left. The ratio metrics show that users are still on the app, and that the corrections are actually down.
The A/B Test
The experiment ran for 2 weeks on 50% of our production users. The A/B results showed Whisper overwhelmingly outperformed Google’s Speech-to-Text. The following screenshots from our Statsig console show the outcomes of the experiment.
23% fewer daily active users made any corrections
20% fewer additions of missing word
45% fewer corrections per transcription
Improvement is sustained N days since first exposure
When data scientists are conducting A/B tests, they may find that the results are affected by the "novelty effect.” The novelty effect is when users try a new feature out of curiosity, even if the feature is not actually better. To check for a novelty effect, data scientists segment users into new and returning groups. If the feature is winning for returning users but not for new users, it's likely that the results are being affected by the novelty effect. The graph, "Days Since Exposure", shows that fewer corrections are sustained well past the initial exposure. This means that Whisper is more effective in reducing the number of corrections made by users over time, rather than just in the short term.
The Surprising Insight around English Accents
Our team wondered how Whisper would perform with accented English, since we knew this was something Google's Speech-to-Text struggled with. Using Statsig’s Custom Queries, we looked into the same metrics split by locale.
62% fewer Australian-English speakers made any corrections
65% fewer British-English speakers made any corrections
35% fewer German-English speakers made any corrections
44% fewer Canadian-English speakers made any corrections
While we aim to deliver transcriptions in under a minute, the reality is Whisper transcription times are a function of the audio duration and length of our request queue. For a Whisper transcription, the median latency is 36 seconds and 99p is 124 seconds. On the flipside, the Google transcription median latency is 6 seconds and 99p is 111 seconds. While users expressed frustration at the prolonged wait times, in practice, they still found the higher quality transcript worth the wait.
Stats aside, our users have written to us with positive feedback about the improved transcriptions.
And, we’re discovering incredible unplanned benefits to using Whisper on Captions. Notably, though our app doesn’t support transcriptions in Romanian (yet!), we’ve found that Romanian speakers have still been able to leverage Captions to generate flawless transcription translations. Whisper is magic.
We tested Whisper and Google in a first-ever production head-to-head, and Whisper proved better on multiple categories. In particular, Whisper excelled in accuracy for videos featuring rapid speech and English accents — and so we rolled it out to 100% of our customers. We discovered an added bonus, auto-translation to English text, that we hadn’t even planned.
We’re super excited about what else we can do with Whisper and future AI models as we continue to grow Captions.
The results shared on this post are based on 1-3 minute talking videos in English, mainly focused on a single speaker. There are many directions to take from here.
Additional Language Support
Since our experiment, we’ve released Whisper for French, German, Swedish, Italian and Dutch with similarly positive feedback. Captions offers transcriptions in 22 languages - many of which are still powered by Google’s Speech-to-Text. Though Whisper was trained on 96 languages outside of English, the team is working on evaluating the accuracy of Whisper on the remaining supported languages.
We called out that Whisper is auto-translating non-English speech into English text. This is a cool hidden feature, which could be refined into an officially supported feature.
The NVIDIA A100 is a top-of-the-line GPU designed to handle the most demanding AI workloads, with pricing that reflects its premium performance. Going forwards, we’d like to experiment with different machine types / GPU setups to fine-tune our Whisper infrastructure cost and performance.
Support Longer Videos and Multi-Speaker Diarization
As we continue to expand the capabilities of our mobile creator studio, we want to support more use cases, including longer videos and multi-speaker videos.
This is just the beginning of our journey. If any of our work interests you, come join us at Captions. We're hiring!