Hannes Heikinheimo
Sep 19, 2023
1 min read
Voice chat has become an essential feature in many games and social media platforms. Axlebolt Studios found that adding Voice Chat to Standoff 2 improved 90 day retention rates by over 60%.
Tod Bouris of Vivox showed figures dramatically higher at Unite Copenhagen in 2019. “The metrics show that people that use communications during their gaming, game more and more often than those who don’t. Voice is a social element that adds stickiness and retention to your games that you can’t get from anywhere else.” Voice chat users, he said, spent twice the amount of time playing as non-voice users and were five times more likely to be playing after five weeks.
An Oxford Academic study from 2007 found that “voice chat leads to stronger bonds and deeper empathy than text chat.” As Subspace put it in 2021, “Voice deepens the immersive world, helps forge social bonds, and strengthens online play.”
Given the benefits and user expectations, it is no wonder that online games, social media platforms, and metaverses are providing voice chat. However, there is a downside.
We interviewed 20+ experts in the Online Gaming & Metaverse space. Here are the key takeaways for Voice Moderation.
We also know from research by Pew, ADL, and others that voice chat is also a primary source of toxic behavior and harassment in online games. And the impact of these negative behaviors leads between a quarter and a third of users to reduce their gameplay or avoid specific games altogether.
The answer to this problem is, of course, moderation. But this is where the complications emerge. Voice chat is more complex than text chat because you first need to capture and analyze the voice data before you can apply the text analysis tools. This is where companies make mistakes that undermine their key objectives.
There may always be a human element in voice chat moderation, but you just cannot scale humans cost-effectively enough to monitor all chats or user generated content at all times. One human-led approach is to simply record chats and then only review them when a complaint is filed. This is also time intensive, provides no early warning system, introduces individual bias into the evaluation process, and hiring typically cannot keep up with usage growth.
Human-led moderation only seems to work at a very small scale with volunteer community monitoring. Smaller Discord servers are an example of this. Anything else requires automation of the moderation process.
Not all automated voice chat monitoring will give you the same results. The most common mistake that arises with these solutions is poor transcriptions. After you have transcribed the voice chat into text, you can more rapidly and cost-effectively analyze the data for toxicity and harassment or validate complaints.
What happens when the transcriptions are poor? A poor transcription can result in errors of omission and commission. The omission errors miss the bad behavior because they either misunderstand words or the context of the comments. The commission errors flag normal behavior is bad. Another term for this is False Negatives and False Positives.
Automatic speech recognition (ASR) is not easy to do well. Users are often speaking quickly, may have accents, and use industry or company specific jargon in their speech. Plus, the audio quality is not always good, there may be noise in the background, and it is highly variable across users and environments. All of these factors make it hard to produce a high-quality transcript which can undermine the automated analysis of the data.
These same challenges may lead to errors of commission – identifying problems where they don’t exist. This can also be a big issue. If you incorrectly flag a user or take action against a user without cause, it undermines their loyalty and often leads to negative comments both within and outside of the game or community. Plus, inadvertently flagging a benign comment for human moderation review adds cost unnecessarily.
Few ASR solutions are designed for all of these challenges. Producing high-fidelity transcripts and analyses often require you to train the speech recognition model for a particular use case. The off-the-shelf cloud speech recognition models typically have significant limitations in customization in addition to one more big problem.
It is clear that human-led moderation practices don’t scale very well in terms of cost. The same is true for many cloud-based ASRs. One moderately sized metaverse with 50,000+ daily active users (DAUs) saw a bill of nearly $15,000 for one day of cloud-based voice chat transcription from a popular provider. And this pricing rate doesn’t decline significantly as scale rises.
Relying solely on cloud-based transcription can become prohibitively expensive very quickly. People using voice chat are active users, and that means they talk a lot, and that means your transcription costs rise rapidly. One company we spoke with found that their efforts to “optimize” cost led to poorer results and other hidden expenses, so they abandoned moderation altogether. We know that is not a good idea. So, the question is, how do you provide high-quality automated moderation tools that minimize errors and cost while also providing a healthy voice chat experience?
Speechly offers solutions that run in the cloud, on device, or in a hybrid model. In addition, our accuracy is higher than leading ASR solutions such as Google, even before the optimization of custom model training which is a feature we also offer. These elements combine to provide a dramatic improvement in both cost and accuracy.
Many games, social networks, and metaverses have learned some hard lessons about text chat moderation. Our hope is that you can avoid having to learn the new lessons of voice chat moderation by avoiding three of the common mistakes.
How did we learn about these? We met with over 20 professionals working in online gaming, the metaverse and social media earlier this year to learn about their challenges. They were interested in using Speechly for some other features, and it turned out to be an optimal solution for automating voice chat moderation. And they liked the fact that we can provide the data to the existing moderation software without having to duplicate systems or replace them.
Let us know if you have faced these mistakes in the past or have any questions. You can reach out to us using our Contact Us form or try out our API for free by Signing Up for an account.
Photo by George Becker from Pexels
Speechly is a YC backed company building tools for speech recognition and natural language understanding. Speechly offers flexible deployment options (cloud, on-premise, and on-device), super accurate custom models for any domain, privacy and scalability for hundreds of thousands of hours of audio.
Hannes Heikinheimo
Sep 19, 2023
1 min read
Voice chat has become an expected feature in virtual reality (VR) experiences. However, there are important factors to consider when picking the best solution to power your experience. This post will compare the pros and cons of the 4 leading VR voice chat solutions to help you make the best selection possible for your game or social experience.
Matt Durgavich
Jul 06, 2023
5 min read
Speechly has recently received SOC 2 Type II certification. This certification demonstrates Speechly's unwavering commitment to maintaining robust security controls and protecting client data.
Markus Lång
Jun 01, 2023
1 min read