Hannes Heikinheimo
Sep 19, 2023
1 min read
Blog
OpenAI has generated a lot of interest in its Whisper automatic speech recognition (ASR) system since launching the open source model in September 2022. However, there is little data about Whisper's in-the-wild performance across languages and models. To fill this gap, we tested several Whisper models against manually transcribed YouTube videos for 19 different languages.
While OpenAI has published Whisper accuracy numbers for some English open source data sets, there is relatively little information on performance for other languages. Furthermore, the most common open source benchmarks, such as Common Voice and LibriSpeech, are rather clean audio, captured in relatively good acoustic conditions, and contain well articulated speech. Transcription in real life use cases is typically messier. The audio often has poor acoustic conditions and articulation, thick accents, hesitation, overlapping speech, and so on. These factors all made it attractive to conduct a more robust analysis of Whisper performance across model sizes, languages, and audio quality.
To test the models, we manually transcribed 5 hours' worth of YouTube videos in different languages to establish the ground truth. Youtube videos naturally contain the aforementioned “messiness” and therefore, the word error rates (WER) obtained with Youtube are perhaps a better proxy, compared to an open source benchmark, to what you might expect from typical in-the-wild transcription scenarios. We used the youtube data to test different-sized Whisper multilingual speech recognition models, comparing their transcripts to the ground truths to calculate WER. We also computed the relative word error rate reduction between Whisper small and medium, denoted WERR: S → M.
The resulting word error rates are presented in the table below:
large | medium | small | base | tiny | WERR: S → M | |
---|---|---|---|---|---|---|
English | 0.15 | 0.17 | 0.17 | 0.20 | 0.23 | 0.00 |
Italian | 0.16 | 0.17 | 0.22 | 0.33 | 0.46 | 0.24 |
German | 0.18 | 0.18 | 0.21 | 0.27 | 0.37 | 0.14 |
Spanish | 0.19 | 0.19 | 0.20 | 0.28 | 0.37 | 0.07 |
French | 0.26 | 0.26 | 0.29 | 0.37 | 0.47 | 0.09 |
Portuguese | 0.25 | 0.28 | 0.28 | 0.39 | 0.48 | 0.02 |
Japanese* | 0.29 | 0.30 | 0.34 | 0.44 | 0.11 | |
Danish | 0.30 | 0.30 | 0.41 | 0.64 | 0.83 | 0.25 |
Swedish | 0.29 | 0.31 | 0.38 | 0.51 | 0.64 | 0.19 |
Indonesian | 0.31 | 0.31 | 0.38 | 0.52 | 0.17 | |
Greek | 0.29 | 0.31 | 0.44 | 0.62 | 0.79 | 0.29 |
Chinese* | 0.33 | 0.33 | 0.35 | 0.44 | 0.06 | |
Thai* | 0.34 | 0.34 | 0.52 | 0.59 | 0.71 | 0.34 |
Tagalog | 0.36 | 0.37 | 0.48 | 0.70 | 0.87 | 0.24 |
Korean | 0.40 | 0.40 | 0.44 | 0.51 | 0.09 | |
Norwegian | 0.42 | 0.42 | 0.46 | 0.75 | 0.93 | 0.09 |
Finnish | 0.41 | 0.43 | 0.53 | 0.70 | 0.85 | 0.19 |
Arabic | 0.52 | 0.53 | 0.61 | 0.75 | 0.88 | 0.14 |
Hindi | 0.60 | 0.67 | 0.104 | 0.108 | 0.35 |
* Character error rate in stead of word error rate.
The top-performing languages for Whisper transcription accuracy are English, Italian, German, and Spanish. Mid-performing languages include French, Portuguese, and Japanese, while the worst-performing languages are Arabic and Hindi.
It is worth noting that the small model often offers the best value for money. There are only slight gains in running the large or medium models in most languages. However, there are some exceptions where the medium model does provide relevant accuracy gains. Languages such as Italian, Danish, Greek, Thai, Tagalog, and Finnish show a noticeable improvement in accuracy when using the medium model compared to the small model.
Additionally, the large model does not provide significant accuracy gains over the medium or small models for most languages. This suggests that, in general, the small and medium models offer the balance between cost and performance.
* Actually, Whisper does offer Dutch, but we just couldn't resist the temptation 😎
Speechly is a YC backed company building tools for speech recognition and natural language understanding. Speechly offers flexible deployment options (cloud, on-premise, and on-device), super accurate custom models for any domain, privacy and scalability for hundreds of thousands of hours of audio.
Hannes Heikinheimo
Sep 19, 2023
1 min read
Voice chat has become an expected feature in virtual reality (VR) experiences. However, there are important factors to consider when picking the best solution to power your experience. This post will compare the pros and cons of the 4 leading VR voice chat solutions to help you make the best selection possible for your game or social experience.
Matt Durgavich
Jul 06, 2023
5 min read
Speechly has recently received SOC 2 Type II certification. This certification demonstrates Speechly's unwavering commitment to maintaining robust security controls and protecting client data.
Markus Lång
Jun 01, 2023
1 min read