Hannes Heikinheimo
Sep 19, 2023
1 min read
For the past 5 years at Speechly, we have been researching and developing tools to easily add Fast, Accurate, and Simple Voice User Interfaces (Voice UIs) in Mobile, Web, and Ecommerce experiences. In this article, we'll introduce the concepts and guidelines we've found effective in creating Multi-Modal Voice experiences that enable users to complete tasks efficiently and effectively.
At Speechly, we approach Voice as an Interface. We believe Voice UIs should blend alongside existing modalities - like typing, tapping, and swiping - and take advantage of a visual display for providing real-time feedback to the user. As a result, a Speechly powered website/app can be controlled with both the Voice UI and the Graphical User Interface (GUI), allowing the user to choose the best input method for the occasion. You can also think of a Voice UI as a controller for app actions which makes it retrofittable to an existing application.
We contrast the “Speechly Model” to the popular “Voice Assistant Model” for Voice UIs seen in products like Apple’s Siri, Google’s Assistant and Amazon’s Alexa. All of these experiences are conversational in nature, optimized for hands-free use with voice, and overlook the best uses of a Voice UI in a Multi-Modal context.
Voice Assistants are digital assistants that are built for “Conversational Experiences” - where the user speaks a Voice Command and the system typically utters back a Voice Response. Certain hand’s free scenarios can be a good fit for the Voice Assistant model, such as IVR within Contact Centers, but it is not the best model when a user has access to a screen.
Instead of back and forth “Conversational Experiences”, Multi-Modal voice experiences should be based on real-time visual feedback. As the user speaks, the user interface should be instantaneously updated.
When humans talk with each other, we do more than transmit information by using words. We use different tones and emotions to give different meanings to our words depending on the context of a situation. This is very human-like, but not the way we want to communicate with a computer.
With a Multi-Modal Voice UI, speech has only one function: Command and Control the system to do what the user wants. Be clear that the user is talking with a computer, don’t try to imitate a human. In most cases, the application should not answer in natural language. It should react by updating the user interface, just like when clicking a button or making a search.
An issue commonly described by users of Voice UIs is the uncertainty related to what commands are supported. Within the Voice Assistant context, this arises from the mission of General Voice Assistant platforms to create an all knowing Assistant.
Understanding the supported functionality with traditional GUIs is less of a problem. Placing a button in the user's shopping cart that reads “Proceed to Checkout” is a very strong signal to the user that checkout is supported and by pressing the button the user will indeed proceed to the checkout process. This aspect is missing from Voice-Only solutions and is a strong benefit for Multi-Modal Voice UIs.
Good design is about providing the user with the easiest tools for completing a task.
Voice works great for use cases such as Voice Search – “Show me the nearest seafood restaurants with three or more stars”, Voice Input – “Add milk, bread, chicken and potatoes”, and Voice Command & Control - “Show sports news” or “Turn off all lights except the bedroom”.
On the other hand, touch is often the better option for quickly selecting from a couple of options.
There’s no need to replace your current user interface with an Assistant based Voice UI. A Multi-Modal Voice UI should blend as a UI Feature alongside existing modalities like typing, tapping, or swiping. Rather you should evaluate which tasks in your application are the most tedious and easiest to do by using your voice.
When a user sees a Voice UI for the first time, they will need some guidance on how to use it.
Guidance tips should be placed close to where the visual feedback will appear. You can hide the tips after the user has tried the Voice UI.
While voice assistants use a wake word so that they can be activated from a distance, your mobile or desktop application doesn’t need to. The hands free scenario is less relevant than you might initially think, as the user is already holding or within close proximity to a device. There are also privacy risks that are inherent with a Wake Word that are altogether avoided.
Push-to-Talk (Button on Screen or Physical Key/Button on Device) is the best way to operate a microphone in an application with a Multi-Modal Voice UI. When the user is required to press a button while talking, it’s completely clear when the application is listening. This also decreases latency by making endpointing very explicit, eliminating the possibility of endpoint false positives (system stops listening prematurely) and false negatives (systems does not finalize request after the user has finished the command).
On the desktop you can use the spacebar for activating the microphone.
You can also add a slide as an optional gesture to lock the microphone for a longer period of time. WhatsApp has a good implementation of the design in their app.
To make sure the user knows that the application is listening, signal clearly when the microphone button is pushed down. This is especially important when using the Push-to-Talk pattern.
You can use sound, animation, tactile feedback (vibration) or a combination to signal the activation. On a handheld touch screen device, make sure that the activated microphone icon is visible from behind the thumb when Push-to-Talk is activated.
Non-interruptive modalities include haptic, non-linguistic auditory, and perhaps most importantly visual feedback. Using these modalities, the application can react fast and without interruption to the user. For instance, in the case of “I’m interested in t-shirts,” the UI would swiftly show the most popular t-shirt products, instantly enabling the user to continue with a refining utterance such as, ”do you have Boss.” This narrows down the displayed products to show only the Boss branded t-shirts.
On the other hand, using a voice response makes this experience complicated for the user as any ongoing user utterance will be abruptly interrupted. Voice Response is also a slow channel for transmitting information and for returning users, hearing the same speech synthesis can lead to a worse user experience over time.
One important part of user experience is the perceived responsiveness of the application. Designers are using tricks such as lazy loading, doing tasks on background, visual illusions and preloading of content to make their applications seem faster and this should be done with Voice UIs, too.
In Voice-Enabled applications, immediate UI reaction is even more important. Immediate UI reaction encourages the user to use longer utterances and to continue the voice experience. In case of an error, it enables the user to recover fast.
When using voice effectively the user can control the UI an order of magnitude faster compared to tapping and clicking. This means there can be a lot of visual activity happening in the UI. It is important that the user can keep up with these UI reactions and understand the feedback.
Typically UI reactions manifest themselves in some sort of visual queues, micro animations and transitions. There is an instinctive inclination in the human visual cognition system to move visual focus to where movement is happening.
Therefore it is an antipattern to scatter UI reactions all over the visual field of the user, e.g. streaming transcription animation on top of the screen and other ui reactions at the bottom of the screen. This will result in the user's gaze bouncing back and forth on the screen making it nearly impossible to understand what is happening in the UI.
For this reason it is important to either centralize all visual UI reactions near one focal point, meaning that both the transcript as well as the visual transitions resulting from the Voice commands are shown very close to each other. The other option is to steer the user's gaze linearly on the screen with a cascade of animations happening either top to down or left to right.
Also, while a Voice UI needs to be as close to real-time as possible, you need to minimize flicker and visual unrest. You can use placeholder images and elements to make sure the application looks smooth and reacts fast.
Text transcription of a users voice input is the most important variable of feedback in case of an error. Lack of action tells the user their input was not correctly understood, but in case of an error in the Speech Recognition, the transcript can enable them to understand what went wrong quickly.
Transcripts can also be valuable for the user when everything goes right. It tells the user they are being understood and encourages them to continue with longer utterances. If you are using Speechly, you can use the tentative transcript to minimise feedback latency.
Natural Language Understanding is hard for many reasons. In addition to the Speech Recognition failing, the user can hesitate or mix up their words. This can lead into errors, just like a misclick can lead to errors with a GUI.
While there are multiple ways to reduce the amount of errors, the most important thing is to offer the user an opportunity to correct themselves quickly. Produce the best guess for correct action as quickly as possible and let the user refine that selection by either voice or touch.
When users give long Voice Commands there is a higher likelihood that the user will make an error in their speech. This is not a problem if the users get real-time feedback and can correct themselves naturally.
Multimodality enables users to use the GUI to correct themselves, but make sure to include an intent for verbal corrections as well. This makes it possible for users to say something like “Show me green, sorry I mean red t-shirts” without “failure”.
Another way to make corrections is with touch/click. Touch/click corrections are done best by offering the user a short list of viable options based on what they have said or done earlier.
If your user is filling a form by using voice commands, for example, they might only need to correct one field. It can be the most intuitive to tap the correct field and make the correction by using touch. Make sure you support both ways for corrections!
The big issue with voice assistants is that they are hard to use by touch. While voice is a great UI for many use cases, sometimes it’s not feasible. This is why all features in your application should be usable with both voice and touch. For example, you can use traditional search filtering with dropdown menus and include a microphone for using the filters by voice. This enables users to choose the modality that is best for the task at hand.
Originally published November 27, 2020, updated November 10, 2021
Speechly is a YC backed company building tools for speech recognition and natural language understanding. Speechly offers flexible deployment options (cloud, on-premise, and on-device), super accurate custom models for any domain, privacy and scalability for hundreds of thousands of hours of audio.
Hannes Heikinheimo
Sep 19, 2023
1 min read
Voice chat has become an expected feature in virtual reality (VR) experiences. However, there are important factors to consider when picking the best solution to power your experience. This post will compare the pros and cons of the 4 leading VR voice chat solutions to help you make the best selection possible for your game or social experience.
Matt Durgavich
Jul 06, 2023
5 min read
Speechly has recently received SOC 2 Type II certification. This certification demonstrates Speechly's unwavering commitment to maintaining robust security controls and protecting client data.
Markus Lång
Jun 01, 2023
1 min read