Technology

OpenAI’s Whisper Model Crushes Google in AI Head-to-Head

Learn how Captions used Statsig to test the performance of OpenAI's new Whisper model against Google's Speech-to-Text.

by
Kim Win
by
November 9, 2022
-
6
Min Read
Share

https://www.getcaptions.app/blog-post/openai-beats-google

Special Thanks: Statsig Team + Timothy Chan, James Kirk, Mike Vernal, Patrick Phipps, and Jessica Owen

OpenAI published a new artificial intelligence model that can transcribe speech with near-human accuracy. Known as Whisper, the model makes 50% fewer errors than its predecessors. We ran an A/B test on Statsig comparing the error rate of Whisper against Google's flagship Speech-to-Text API. The purpose of the test was to understand which platform produces fewer transcription errors on Caption’s production workload and how the change in AI model impacts the user experience.

The results show that Whisper is the clear winner in transcription accuracy.

Introducing Captions

Captions is a new mobile creator studio that uses AI to help creators through the entire process of content creation, from scripting to recording to editing and sharing. Our app offers a transcript-based video editing interface that makes video editing as simple as text editing. We also produce professional-looking, word-by-word captions that are synced to voice without needing expensive human transcription services.

Captions uses AI to transcribe videos in real-time and offers an intuitive transcript editor to make any desired updates. When Captions returns inaccurate results, it creates additional work for our users and decreases the overall quality of their experience. We strive to minimize the number of corrections our users need to make to transcripts produced by our AI. To that end, we’re always in pursuit of the most accurate Automatic Speech Recognition (ASR) model. We've been running Google's Speech-to-Text API in production for the last year, and it's been working well. But we wanted to see what Whisper could do.

The Journey to Production

Our journey kicks off with a proof-of-concept. OpenAI offers installation instructions for the Whisper Python library. We converted those instructions into a Dockerfile and used Google Cloud Console to provision an NVIDIA A100 machine to run our containerized application.

The application listens on a PubSub topic. Each message in the topic represents a transcription request with the audio file location and a request id. As messages stream in, the application runs Whisper to transcribe the audio files and writes its results to a database.

Using Statsig’s feature gates, we targeted internal users for our initial rounds of testing. Statsig is a comprehensive statistical tool that provides real-time analytics and has a simple user interface. It is our go-to tool for running A/B tests and analyzing outcomes.

With the Whisper pipeline hooked up to the frontend, we rallied the team to a bug bash. We took inspiration from our Customer inbox to identify videos that could test Whisper’s limits. Over the last year, we’d been using Google to create transcriptions for our customers, and kept a log of when its Speech-to-Text performed particularly poorly. These videos shared common characteristics:

  • background noise, such as ambient room noise, outside noise or music playing in the background
  • speaker is performing music (singing, rapping, spoken-word poetry)
  • features people speaking English with an accent
  • rapid speech

We assembled ~20 videos that displayed one or more of the listed characteristics and compared the performance between the contenders. We were blown away by how much Whisper outperformed our existing system in transcription accuracy. In one instance, we watched Whisper transcribe Eminem's "Godzilla" perfectly - a feat considering the song holds the Guinness World Record for the Fastest Rap in a No.1 Single, with 224 words packed into 31 seconds. Google’s Speech-to-Text was nowhere close to transcribing it. If you're doubting Whisper because Eminem's lyrics are publicly available, we've got a video for you. Whisper is able to accurately transcribe even the most complicated freestyle rap, as demonstrated by the following video of rapper Mac Lethal packing 400 words in a minute.

We declared Whisper the winner of the internal testing round. Our new goal was to ship it to production to see how it would perform under real-world conditions. However, there were 2 key challenges that needed solving: scalable infrastructure and word-level timestamps.

Download our app now

Exclusively from the App Store.

How We Made Whisper Work for Captions

Scalable Infrastructure

Whisper's highest quality model is intensive in terms of both GPU and memory usage. Given that we aim to serve users their transcription under a minute, our infrastructure needs to scale appropriately to handle traffic and serve those requests in a timely manner. We chose Kubernetes.

Kubernetes is a popular orchestration engine for automating deployment, scaling, and management of containerized applications. Google Kubernetes Engine (GKE) enables us to quickly deploy the Whisper model and share GPUs across instances with simple yaml. We were fortunate to time our move to GKE with their release of image streaming. GKE’s new system dramatically improves the time to pull down image updates and rollout changes from minutes to seconds.

Architecture of Whisper’s production deployment

Word-level timestamps

Captions allows users to style their transcriptions to best reflect their personal brand and message. As part of the customization options, users can choose to display spoken words, one word at a time; or set images, sounds, emojis and font colors to specific words. The challenge is that Whisper produces timestamps for segments, not individual words. To maintain feature parity, we implemented our own algorithm to compute word-level timestamps by leveraging a model that predicts the likelihood that a word in an audio clip corresponds to a particular timeframe. The algorithm performs the following steps:

  • Split the audio track into multiple files, one per Whisper segment
  • Apply the model to each audio snippet and Whisper text - the model generates a graph of probabilities for word-to-timeframe pairs
  • Walk the graph to determine the most likely word-timestamp pair
  • Return the most probable word-timestamps for all segments

The Hypothesis

We’ve been collecting user metrics with Statsig’s Custom Metrics over the course of the year. We hypothesized that Whisper would reduce the number of corrections a user needed to make to their transcription. We looked at the following events to test our hypothesis:

Statsig recommended that we use ratio metrics, a number that represents the relationship between two values. In our case, the numerator is the total number of user corrections and the denominator is the total number of transcripts.

Without ratio metrics, we might think that a decrease in correction events means that users are not making corrections. However, it could just mean that there are fewer users on the app because they have left. The ratio metrics show that users are still on the app, and that the corrections are actually down.

The A/B Test

The experiment ran for 2 weeks on 50% of our production users. The A/B results showed Whisper overwhelmingly outperformed Google’s Speech-to-Text. The following screenshots from our Statsig console show the outcomes of the experiment.

23% fewer daily active users made any corrections

20% fewer additions of missing word

45% fewer corrections per transcription

Improvement is sustained N days since first exposure

When data scientists are conducting A/B tests, they may find that the results are affected by the "novelty effect.” The novelty effect is when users try a new feature out of curiosity, even if the feature is not actually better. To check for a novelty effect, data scientists segment users into new and returning groups. If the feature is winning for returning users but not for new users, it's likely that the results are being affected by the novelty effect. The graph, "Days Since Exposure", shows that fewer corrections are sustained well past the initial exposure. This means that Whisper is more effective in reducing the number of corrections made by users over time, rather than just in the short term.

The Surprising Insight around English Accents

Our team wondered how Whisper would perform with accented English, since we knew this was something Google's Speech-to-Text struggled with. Using Statsig’s Custom Queries, we looked into the same metrics split by locale.

62% fewer Australian-English speakers made any corrections

65% fewer British-English speakers made any corrections

35% fewer German-English speakers made any corrections

44% fewer Canadian-English speakers made any corrections

Transcription Latency

While we aim to deliver transcriptions in under a minute, the reality is Whisper transcription times are a function of the audio duration and length of our request queue. For a Whisper transcription, the median latency is 36 seconds and 99p is 124 seconds. On the flipside, the Google transcription median latency is 6 seconds and 99p is 111 seconds. While users expressed frustration at the prolonged wait times, in practice, they still found the higher quality transcript worth the wait.

User Feedback

Stats aside, our users have written to us with positive feedback about the improved transcriptions.

Dutch: Very annoying, but I must confess, the update works well, it was ready in no time!

And, we’re discovering incredible unplanned benefits to using Whisper on Captions. Notably, though our app doesn’t support transcriptions in Romanian (yet!), we’ve found that Romanian speakers have still been able to leverage Captions to generate flawless transcription translations. Whisper is magic.

Conclusion

We tested Whisper and Google in a first-ever production head-to-head, and Whisper proved better on multiple categories. In particular, Whisper excelled in accuracy for videos featuring rapid speech and English accents — and so we rolled it out to 100% of our customers. We discovered an added bonus, auto-translation to English text, that we hadn’t even planned.

We’re super excited about what else we can do with Whisper and future AI models as we continue to grow Captions.

Next Steps

The results shared on this post are based on 1-3 minute talking videos in English, mainly focused on a single speaker. There are many directions to take from here.

Additional Language Support

Since our experiment, we’ve released Whisper for French, German, Swedish, Italian and Dutch with similarly positive feedback. Captions offers transcriptions in 22 languages - many of which are still powered by Google’s Speech-to-Text. Though Whisper was trained on 96 languages outside of English, the team is working on evaluating the accuracy of Whisper on the remaining supported languages.

We called out that Whisper is auto-translating non-English speech into English text. This is a cool hidden feature, which could be refined into an officially supported feature.

Improved Infrastructure

The NVIDIA A100 is a top-of-the-line GPU designed to handle the most demanding AI workloads, with pricing that reflects its premium performance. Going forwards, we’d like to experiment with different machine types / GPU setups to fine-tune our Whisper infrastructure cost and performance.

Support Longer Videos and Multi-Speaker Diarization

As we continue to expand the capabilities of our mobile creator studio, we want to support more use cases, including longer videos and multi-speaker videos.

This is just the beginning of our journey. If any of our work interests you, come join us at Captions. We're hiring!

by
Kim Win
by
Technology

Stay in the loop

Subscribe to our newsletter and get all the news from Captions.
No spam, we promise.

Download our app now

Exclusively from the App Store.

More Blog Posts

Technology

How JT Barnett Built a Content Engine to Grow His TikTok to 290k+ Followers

JT Barnett is an active content creator and TikTok advisor to rapidly growing brands. In this session, he shares his tactical advice for creators looking to grow their following.

October 6, 2022
-
5
Min Read
Education

The Most Common Pitfalls for Emerging Creators

Giselle Ugarte is a content creator, coach, and founder. In our session, we unpack the most common pitfalls for content creators - and how to avoid them.

October 6, 2022
-
5
Min Read
Giselle Case Study
Technology

How Giselle Ugarte Grew Her TikTok to 175K+ By Doubling Down on Engagement

Giselle Ugarte is a prolific creator, coach, and founder. In our deep dive, she shares why captions are essential if you want to make accessible - and engaging - content.

October 6, 2022
-
7
Min Read
Education

How Zach Mitchem Scaled His Audience to 33k+ Followers Using Captions

Zach Mitchem is a Tiktoker and YouTuber helping creators grow their followings. Below, Zach shares how Captions differentiates his videos from the masses.

August 31, 2022
-
5
Min Read
Education

YouTube 101: Tactics for Growing a Viral Video Channel From Zero to One

Zach Mitchem is a Tiktoker and YouTuber helping creators grow their followings. Below, Zach unpacks how to leverage the algorithm to grow a viral YouTube channel.

August 31, 2022
-
6
Min Read
Education

Captions Growth Playbook: How to Turn Your Followers Into Customers

Elise Micheals, a social media creator and men’s coach, unpacks her strategy behind revenue-generating video content on TikTok, Instagram, and LinkedIn.

August 22, 2022
-
5
Min Read
Education

How Elise Micheals Uses Captions to Capture Attention & Engagement

Elise Micheals, a social media creator and men’s coach, shares her experience using Captions — from reducing upload times to supercharging user engagement.

August 22, 2022
-
7
Min Read
Education

Sydney on Socials: Why Editing on Captions Turbocharges Engagement

Our team sat down with Sydney McDonald, a TikTok creator and coach, to unpack how Captions saves her endless time and money during the content editing process.

August 18, 2022
-
5
Min Read
Education

How to Easily Quantify Which Content Drives Follower Growth on TikTok

Sydney McDonald, a viral TikTok creator and social media coach, uncovers her tactics behind optimizing captioned content for the TikTok algorithm.

August 18, 2022
-
6
Min Read
Education

How to Drive Revenue and Real-Time Profits From Your TikTok Audience

We sat down with Cory Connors, a sustainable packaging consultant who’s built a vibrant TikTok community, on how creators can build profits from their following.

August 16, 2022
-
4
Min Read
Technology

The Coreygated Playbook: Why Captioning Boosts Engagement Metrics

Cory Connors, a rapidly growing TikTok creator in the packaging space, shares how he uses Captions to save time, accelerate editing, and boost engagement.

August 15, 2022
-
4
Min Read
Education

How to Add Captions to Instagram Reels

Learn how to automatically add captions to your Instagram Reels with Captions app.

August 6, 2022
-
2
Min Read
YouTube Shorts with Captions App
Education

How to Add Captions to YouTube Shorts

Learn how to add captions and subtitles to your YouTube shorts videos with the Captions App. Get accurate, closed captions in any language.

August 5, 2022
-
2
Min Read
Education

How Jules Montgomery scaled her following to 290k+ using Captions

Jules Montgomery, a TikTok creator and the two-time Founder and CEO, teaches how Captions helps creators significantly cut down on editing time and get back to doing what they actually love.

August 4, 2022
-
4
Min Read
Education

Audience Building 101: How to Find Your First 1,000 Followers on TikTok

We sat down with Jules Montgomery, a TikTok creator and the two-time Founder and CEO, to explore the creator economy and going from 100 to 100,000 followers.

August 3, 2022
-
8
Min Read
Captions Logo
Education

How to Add Captions to Your TikTok Videos & Increase Your Audience Reach

Learn how you can use captions on TikTok. With the right caption, a simple video turns into a rich and engaging content that not only entertains but also educates your audience. It's important to add context while keeping it playful.

July 18, 2022
-
2
Min Read