Improving Intelligibility with Spatial Audio
Jayson DeLancey

We strive to close the gap between the experience of face-to-face and online communications.  Bringing years of research in human perception to bear on the human experience of connection while physically separated is at the heart of what we do.  At a recent technical conference, Chief Architect Paul Boustead provided insights from the research in voice communications on how spatial rendering can improve intelligibility of video conferencing.

Nuisances and Affirmations

nuisance refers to any audio perceived in a communications context that is unwanted and distracting. A police siren, the hum of a fan or air conditioner, the barking of a dog, etc. are all examples of a nuisance. In order to reduce background noise like this, many media servers take the simple approach of sending only the few loudest audio streams detected.

This approach has an unwanted side-effect though of removing affirmations. An affirmation is the backchannel sound that lets others know when people are reacting. If you tell somebody a joke, you hope for the laughter to be contagious such that when one person laughs others join in. If you don’t get that affirmation of a laugh, “um”, or an “uh-huh”, the speaker may wonder what’s going on in the minds of those participating.

In a 2001 study by Shriberg(1), it was discovered that in business meetings with 4-8 participants the majority of the conversation is made up of overlapping talk spurts. Of this, a significant amount includes verbal affirmations.

By contrast, communications in gaming such as MMOs players rarely speak, less than 5% of the time. When they do speak though, it often is overlapping speech during game action with concurrent talkers. This is one of the original use case that led to work done at Dolby to render audio spatially based on the location of players in a virtual world.

Spatial Release from Masking

In the real-world we handle overlapping audio well. Though our ears pick up the linear sum of all the sounds we’re hearing at any given time, our brain can pull those apart to concentrate and listen. This ability to concentrate on a voice and filter out other sounds is referred to as Spatial Release from Masking.

Through an auditory scene analysis, we’re able to separate voices from other sounds. Recognizing these inter-aural time and volume differences, we are able to recognize key words and phrases. The cocktail party effect of hearing our name from across the room is an example of this. Human perception is adept at understanding overlapping audio.

Spatial Rendering

Dolby’s Head Related Transfer Function (HRTF) models how a sound in a particular location would sound when it hits your ear canal. Taking into account the shape of the head, ears, reverb in the room, etc. The implementation renders all streams detected as speech with an accurate ML-based Voice Activity Detector (VAD) for a good experience with headphones.

This incorporates key findings from the research, for example the larger the separation between speakers the better. Even 15 degrees is sufficient, but layout of speakers can be done automatically balancing first in front and adding alternating to the left and right. Most people find somebody talking from behind disconcerting.


Incorporating conferencing directly into your applications with the Interactivity SDK you benefit from many years of psycho-acoustic research deployed at scale with large video conferencing providers. Rendering voices spatially allow us to better understand natural communications, including overlapping speech and sounds for the best end-user experience.

  1. Shriberg, Elizabeth et. Al. “Observations on overlap: findings and implications for automatic processing of multi-party conversation”, INTERSPEECH, 2001.
Tags: spatial
Physitrack Enhances the Quality of its Industry-Leading Telehealth Solution with

Telehealth consultation is a global phenomenon that is being prioritized by care providers. As a result, Physitrack is using the Interactivity API to offer the highest quality telehealth video conferencing solution for therapists and patients.

Stephane Giraudie
Deliver Video Conference Recordings with Webhooks and AWS Lambda

Use Interactivity API Webhooks with AWS Lambda whenever a recording is ready to trigger an email notification or store the file in your own environment.

Jayson DeLancey
Say Hello to In-Flow: Introducing Interactivity APIs

Interactivity APIs allow you to provide real-time in-flow communications through the power of video and voice.

Stephane Giraudie
We're happy to chat about our APIs, SDKs...or magic.