Improving Audio Quality

The quality of audio has a direct correlation with how media is experienced. It has the power to make an audience feel immersed in the content or distract from understanding the message. This tutorial explores some of the common concepts and metrics associated with audio quality to explain how best to make use of the Dolby.io Media Processing APIs.

Noise

Noise is unwanted, unpleasant, loud, or disruptive audio that is not part of the creative intent when audio is captured. These sounds frequently have no structure and are disruptive in perceiving media content.

It is common to categorize noise as:

  • stationary or static noise
  • non-stationary noise

Stationary noise is often characterized by a steady, low volume background noise. This white noise can come from heating systems, air conditioners, fans circulating air or running in nearby computer equipment, or even the recording equipment itself such as a microphone hiss or electric hum of a power-line frequency.

Detecting this type of noise can be done through an analysis of the sound characteristics such as variations within frequency bands over time. There are different techniques for this type of digital signal processing to remove or subtract this type of noise.

Non-stationary noise is often irregular and appears infrequently or at a cyclical time in media. Some examples of this type of noise can be more varied:

  • a dog barking
  • birds chirping
  • keyboard clicks
  • an ambulance driving by
  • a book falling off a desk
  • speech such as babble, hubbub, or a cacophony of voices
  • throbbing or vibration sounds that are cyclical

All of these non-stationary sounds are considered unwanted noise too, but are not easy to distinguish from the sound profile itself. Another way to think of it is the inverse of how to look for stationary sounds. Instead of detecting noise, we use machine learning algorithms that know how to elevate the sounds desired through speech isolation of spoken words in certain types of media.

Measuring Noise with Analyze API

How much noise is present in a media file? When building a quality control application with user submitted content or processing a large collection of audio it can be useful to use the Media Processing Analyze API to learn which files have more noise than others. The Analyze Quick Start provides steps you can follow for analyzing your own media.

When inspecting the response for noise data there are a few data points of interest.

    "result": {
        "audio": {
            ...,
            "noise": {
                "level_average": -64.72,
                "snr_average": 41.99
            },
            ...
        }
    }

The level_average gives us the average level of stationary noise measured (in dbFS) across the length of the file. The snr_average gives you the stationary noise measured as a signal to noise ration (in dbFS). Generally, a low SNR is a mark of a healthy level of noise.

Reducing Noise with Enhance API

If you have unwanted noise in your media, how do you get rid of it? The Media Processing Enhance API can help with an intelligent approach to noise management. There are two elements to the algorithm:

  • noise reduction
  • speech isolation

With noise reduction, stationary background noise is suppressed. With speech isolation, speech is made louder relative to non-speech sounds. The presence of dialog is brought forward to make it more pronounced. In the Enhancing Media Quick Start there are steps for processing your own media. The default parameters try to balance noise reduction and speech isolation while also identifying other prominent impurities.

By default, a medium amount of noise reduction is applied automatically based on the level of noise relative to the signal. If you don't like the result you can adjust the noise reduction amount parameter to a fixed setting. For example, you can apply very aggressive stationary noise reduction by setting the amount to max. You can also dial it back if you discover artifacts, ghosting, or constrained vocal tones.

{
  "input": "s3://dolbyio/public/shelby/airplane.original.mp4",
  "output": "dlb://out/airplane.max-noise-reduction.mp4",
  "content": {
    "type": "mobile_phone"
  },
  "audio": {
    "noise": {
      "reduction": {
        "amount": "max"
      }
    }
  }
}

In some cases you may want speech to be amplified and all other sounds reduced, but without impacting the vocal performance. In this next example, several parameters are disabled to show the speech isolation with minimal other processing. Using speech isolation can have unintended effects such as attenuating non-speech sounds like music, so you may adjust the value accordingly.

{
  "input": "s3://dolbyio/public/shelby/airplane.original.mp4",
  "output": "dlb://out/airplane.max-speech-isolation.mp4",
  "content": {
    "type": "voice_recording"
  },
  "audio": {
    "loudness": { "enable": false },
    "dynamics": { "range_control": { "enable": false } },
    "filter": { "high_pass": { "enable": false } },
    "noise": { "reduction": { "enable": true } },
    "speech": {
      "sibilance": { "reduction": { "enable": false } },
      "isolation": { "enable": true, "amount": 100 }
    }
  }
}

Clipping

Clipping is a waveform distortion when the signal hits the max value possible on the recording and is cut unnaturally. It is this "clipping" or cutting of the top or bottom of a waveform that can introduce an undesirable and unnatural sound. The Media Processing Analyze API identifies these events that reduce the quality of media, so they can be addressed accordingly either by re-recording or by precise adjustment to minimize their sonic effect.

The Analyzing Media Quick Start provides step-by-step instructions for how you can process your own media. When inspecting the response you may see results like this:

    "result": {
        "audio": {
            "bandwidth": 10687,
            "clipping": {
                "num_sections": 1,
                "sections": [
                  {
                    "section_id": "cl_1",
                    "start": 31.54,
                    "duration": 0.12000000000000455,
                    "channels": [
                        "ch_0"
                    ]
                  }
                ]
            }
        }
    }

The bandwidth metric indicates a file's audio frequency range. This can identify files with reduced high-frequency content, such as incorrectly transcoded audio or perhaps the wrong source file used as part of a downmix process. As part of a quality control process, this can help identify whether a piece of media is suitable to be used in a particular use case.

The number of clipping events is a common smoke test when inspecting audio quality. The rationale for reporting the num_sections is to indicate how prevalent clipping is within a file. Often files will still be acceptable with only a few clipping events but if there are tens or hundreds of them there may be a problem.

Inspecting the first section of clipping is useful to determine if a clap/board has been recorded at the beginning of a take. The list of sections helps identify them quickly to review if they are intentional or if additional processing may be needed to remove the clippings.

Silence

Silence is the absence of content such as music or speech. Some silence is natural in speech but too much silence may indicate a problem where the intended signal was lost. For many applications, it may be desirable to remove silence entirely. Use of the Media Processing Analyze API can help identify how much of the total file is silent and the number of occurrences or sections with silence.

The definition of silence can be controlled using the threshold and duration configurations. The minimum value for duration is 0.5 seconds. This implies that the Analyze API can identify silence of at least 0.5 sec.

Measuring Silence with Analyze API

The Analyzing Media Quick Start provides step-by-step instructions for how you can run your own media. The response might look like this:

    "result": {
        "audio": {
            "silence": {
                "percentage": 5.3,
                "num_sections": 2,
                "sections": [
                    {
                        "section_id": "si_1",
                        "start": 3.08,
                        "duration": 2.08,
                        "channels": [
                            0
                        ]
                    },
                    {
                        "section_id": "si_2",
                        "start": 8.3,
                        "duration": 3.2,
                        "channels": [
                            0,
                            1
                        ]
                    },
                ]
            }
        }
    }

Dynamics

Media content can have dialog with uneven talker volumes depending on factors such as how loud an individual speaks or their positioning relative to a recording device. These dynamics can result in an unbalanced listening experience.

A leveling algorithm can identify speech sections and apply a time-varying amplification or attenuation as needed so that speech levels are brought closer together within a desired dynamic range. This means the soft-spoken person across the room and the booming voice close to a microphone can be fixed. Some talkers may be inconsistent or very dynamic when narrating such that there is fluctuations in their volume. This can happen when a talker is in motion or changes position relative to a microphone. The leveling algorithm can smooth this out in a natural and pleasant way.

An equalization or dynamic eq algorithm can analyze the spectral profile relative to a target such as professionally recorded speech recordings. Adjustments can be made dynamically to filter and apply this equalization. By analyzing the audio, adjustments can be made to make inputs resemble a typical profile for that type of media. This can help to compensate for recordings using home or consumer devices like mobile phones or low quality off-the-shelf microphones and recordings made in less optimal environments where the room affects the overall balance of the sound.

Dynamic Range Control with Enhance API

The Media Processing Enhance API will make improvements to the dynamic leveling and dynamic eq to help give better and more professional sounding recordings. The Enhancing Media Quick Start provides step-by-step instructions for processing your own media. You can tune your results by changing the amount of range control being applied. Levels for each speaker are measured using the short-term loudness as defined in EBU R.128. The target range is typically between 9dB and 6dB with speech frames given more subtle gains. The amount can be changed to further constrain or maximize the range depending on the type of material you have or sound desired.

{
  "input": "s3://dolbyio/public/shelby/airplane.original.mp4",
  "output": "dlb://out/airplane.max-dynamic-range.mp4",
  "content": {
    "type": "meeting"
  },
  "audio": {
    "dynamics": {
      "range_control": {
        "enable": true,
        "amount": "max"
      }
    }
  }
}

Disabling Dynamic EQ

In some cases such as with pre-processed audio or professional recording equipment the dynamic eq adjustments may not have the desired outcome. This may be the case with musical performances for example. In those use cases, the dynamic eq processing can be disabled.

{
  "input": "s3://dolbyio/public/shelby/indoors.original.mp4",
  "output": "dlb://out/indoors.no-dynamic-eq.mp4",
  "content": {
    "type": "music"
  },
  "audio": {
    "filter": { "dynamic_eq": { "enable": false } }
  }
}

Sibilance

Sibilance is a harsh consonant sound like "s", "sh", "x", "ch", "t", and "th" that originates from a talker's pronunciation of words. A sibilance reduction algorithm detects sounds like these by analyzing frequency regions for onsets of energy. When identified, these sounds can be attenuated or reduced to create audio that sounds closer to a studio recorded sound by compensating for non-professional recording equipment. This typically can be found in an upper frequency range (5khz - 8khz) but varies to the specific vocal range of the talker. This is a speech artifact in contrast to plosives which occur in lower frequency ranges.

Maximize Sibilance Attenuation with Enhance API

The Media Processing Enhance API has built-in sibilance reduction. The Enhancing Media Quick Start provides step-by-step instructions for processing your own media. You can adjust the amount of attenuation applied to suppress sibilance which could be necessary in certain circumstances.

{
  "input": "s3://dolbyio/public/shelby/airplane.original.mp4",
  "output": "dlb://out/airplane.max-sibilance-attenuation.mp3",
  "content": {
    "type": "podcast"
  },
  "audio": {
    "speech": {
      "sibilance": {
        "reduction": {
          "amount": "max"
        }
      }
    }
  }
}

Plosive

Plosives are "pops" caused by sounds like "p" and "b" spoken too close to the microphone. Often you will see a pop filter put in front of a microphone in a studio to minimize plosives. A plosive reduction algorithm detects these low-frequency speech artifacts and applies dynamic processing to suppress the pop sound, preserving the speech clarity. Unmanaged plosives can be very distracting to the listener and can also negatively affect other processing.

Maximize Plosive Attenuation with Enhance API

The Media Processing Enhance API has built-in plosive reduction. The Enhancing Media Quick Start provides step-by-step instructions for processing your own media. You can adjust the amount of attenuation applied to suppress plosives which could be necessary in certain circumstances.

{
  "input": "s3://dolbyio/public/shelby/airplane.original.mp4",
  "output": "dlb://out/airplane.max-plosive-attenuation.mp3",
  "audio": {
    "speech": {
      "plosive": {
        "reduction": {
          "amount": "max"
        }
      }
    }
  }
}