2020 API AWARDS WINNER - BEST IN MEDIA APIs LEARN MORE

Improving Audio Quality

The quality of audio has a direct correlation with how media is experienced. It has the power to make an audience feel immersed in the content or distract from understanding the message. This tutorial explores some of the common concepts and metrics associated with audio quality to explain how best to make use of the Dolby.io Media Processing APIs.

Noise

Noise is unwanted, unpleasant, loud, or disruptive audio that is not part of the creative intent when audio is captured. These sounds frequently have no structure and are disruptive in perceiving media content.

It is common to categorize noise:

  • stationary or static noise
  • non-stationary noise

Stationary noise is often characterized by a steady, low volume background noise. This white noise can come from heating systems, air conditioners, fans circulating air or running in nearby computer equipment, or even the recording equipment itself such as a microphone hiss or electric hum of a power-line frequency.

Detecting this type off noise can be done through an analysis of the sound characteristics such as variations within frequency bands over time. There are different techniques for this type of digital signal processing to remove or subtract this type of noise.

Non-stationary noise is often irregular and appears infrequently or at a cyclical time in media. Some examples of this type of noise can be more varied:

  • a dog barking
  • birds chirping
  • keyboard clicks
  • an ambulance driving by
  • a book falling off a desk
  • etc.

Speech categorized such as babble, hubbub, or a cacophony of voices could fall into this category. Certain throbbing or vibration sounds also fall into this category because they can be cyclical in nature.

All of these non-stationary sounds are considered unwanted noise too, but are not easy to distinguish from the sound profile itself. Another way to think of it is the inverse of how to look for stationary sounds. Instead of detecting noise, we use machine learning algorithms that know how to elevate the sounds desired through speech isolation of spoken words in certain types of media.

Measuring Noise with Analyze API

How much noise is present in a media file? When building a quality control application with user submitted content or processing a large collection of audio it can be useful to use the Media Processing Analyze API to learn which files have more noise than others. The Analyze Quick Start provides steps you can follow for analyzing your own media.

When inspecting the response for noise data there are a few data points of interest.

    "result": {
        "audio": {
            ...,
            "noise": {
                "level_average": -64.72,
                "snr_average": 41.99
            },
            ...
        }
    }

The level_average gives us the average level of noise measured (in dbFS) across the length of the file. The snr_average gives you the noise measured as a signal to noise ration (in dbFS). Generally, a low SNR is a mark of a healthy level of noise.

Reducing Noise with Enhance API

If you have unwanted noise in your media, how do you get rid of it? The Media Processing Enhance API can help with an intelligent approach to noise management. There are two elements to the algorithm:

  • noise reduction
  • speech isolation

With noise reduction, stationary background noise is suppressed. With speech isolation, speech is made louder relative to non-speech sounds. The presence of dialog is brought forward to make it more pronounced. In the Enhancing Media Quick Start there are steps for processing your own media. The default parameters try to balance noise reduction and speech isolation while also identifying other prominent impurities.

By default, a medium amount of noise reduction is applied automatically based on the level of noise relative to the signal. If you don't like the result you can adjust the noise reduction amount parameter to a fixed setting. For example, you can apply very aggressive stationary noise reduction by setting the amount to max. You can also dial it back if you discover artifacts, ghosting, or constrained vocal tones.

{
  "input": "s3://dolbyio/public/shelby/airplane.original.mp4",
  "output": "dlb://out/airplane.max-noise-reduction.mp4",
  "content": {
    "type": "mobile_phone"
  },
  "audio": {
    "noise": {
      "reduction": {
        "amount": "max"
      }
    }
  }
}

In some cases you may want speech to be amplified and all other sounds reduced, but without impacting the vocal performance. In this next example, several parameters are disabled to show the speech isolation with minimal other processing. Using speech isolation can have unintended effects such as attenuating non-speech sounds like music, so you may adjust the value accordingly.

{
  "input": "s3://dolbyio/public/shelby/airplane.original.mp4",
  "output": "dlb://out/airplane.max-speech-isolation.mp4",
  "content": {
    "type": "voice_recording"
  },
  "audio": {
    "loudness": { "enable": false },
    "dynamics": { "range_control": { "enable": false } },
    "filter": { "high_pass": { "enable": false } },
    "noise": { "reduction": { "enable": true } },
    "speech": {
      "sibilance": { "reduction": { "enable": false } },
      "isolation": { "enable": true, "amount": 100 }
    }
  }
}

Clipping

Clipping is a waveform distortion when the peak value is cut unnaturally. The Media Processing Analyze API can give insight into this type of artifact that could reduce the quality of media.

The Analyzing Media Quick Start provides step-by-step instructions for how you can process your own media. When inspecting the response you may see results like this:

    "result": {
        "audio": {
            "bandwidth": 10687,
            "clipping": {
                "first_event": 31.54,
                "num_sections": 1
            }
        }
    }

The bandwidth metric indicates whether a file has a restricted frequency range. This could be for a number of reasons such as the audio being incorrectly transcoded or perhaps the wrong source file being used as part of a downmix process. As part of a QC effort this can help identify whether a piece of media is suitable to be used for a particular use case.

The number of clipping events is a common smoke test when inspecting audio quality. The rationale for reporting the num_sections is to indicate how prevalent clipping is within a file. Often files will still be acceptable with only a few clipping events but if there are tens or hundreds of them there may be a problem.

The location of the first_event of clipping is useful to determine if a clapboard has been recorded at the beginning of the audio. It also may help you seek to the first event to get an additional review of the effect.

Silence

Silence is the absence of content such as music or speech. Some silence is natural in speech but too much silence may indicate a problem where the intended signal was lost. For many applications, it may be desirable to remove silence entirely. Use of the Media Processing Analyze API can help identify how much of the total file is silent and the number of occurrences or sections with silence. A section is a discrete block of at least half a second without any speech or music.

Measuring Silence with Analyze API

The Analyzing Media Quick Start provides step-by-step instructions for how you can run your own media. The response might look like this:

    "result": {
        "audio": {
            "silence": {
                "percentage": 5.3,
                "num_sections": 0
            }
        }
    }

Dynamics

Media content can have dialog with uneven talker volumes depending on factors such as how loud an individual speaks or their positioning relative to a recording device. These dynamics can result in an unbalanced listening experience.

A leveling algorithm can identify speech sections and apply a time-varying amplification or attenuation as needed so that speech levels are brought closer together within a desired dynamic range. This means the soft-spoken person across the room and the booming voice close to a microphone can be fixed. Some talkers may be inconsistent or very dynamic when narrating such that there is fluctuations in their volume. This can happen when a talker is in motion or changes position relative to a microphone. The leveling algorithm can smooth this out in a natural and pleasant way.

An equalization or dynamic eq algorithm can analyze the spectral profile relative to a target such as professionally recorded speech recordings. Adjustments can be made dynamically to filter and apply this equalization. By analyzing the audio and calculating energy per band adjustments are made to make inputs resemble the target frequency and conform the sound. This can compensate for recordings using home or consumer devices like mobile phones or low quality off-the-shelf microphones rather than a professional studio.

Dynamic Range Control with Enhance API

The Media Processing Enhance API will make improvements to the dynamic leveling and dynamic eq to help give better and more professional sounding recordings. The Enhancing Media Quick Start provides step-by-step instructions for processing your own media. You can tune your results by changing the amount of range control being applied. Levels for each speaker are measured using the short-term loudness as defined in EBU R.128. The target range is typically between 9dB and 6dB with speech frames given more subtle gains. The amount can be changed to further constrain or maximize the range depending on the type of material you have or sound desired.

{
  "input": "s3://dolbyio/public/shelby/airplane.original.mp4",
  "output": "dlb://out/airplane.max-dynamic-range.mp4",
  "content": {
    "type": "meeting"
  },
  "audio": {
    "dynamics": {
      "range_control": {
        "enable": true,
        "amount": "max"
      }
    }
  }
}

Disabling Dynamic EQ

In some cases such as with pre-processed audio or professional recording equipment the dynamic eq adjustments may not have the desired outcome. This may be the case with musical performances for example. In those use cases, the dynamic eq processing can be disabled.

{
  "input": "s3://dolbyio/public/shelby/indoors.original.mp4",
  "output": "dlb://out/indoors.no-dynamic-eq.mp4",
  "content": {
    "type": "music"
  },
  "audio": {
    "filter": { "dynamic_eq": { "enable": false } }
  }
}

Sibilance

Sibilance is a harsh consonant sound like "s", "sh", "x", "ch", "t", and "th" that originates from a talker's pronunciation of words. A De-esser algorithm detects sounds like these by analyzing frequency regions for onsets of energy. When identified, these sounds can be attenuated or reduced to create audio that sounds closer to a studio recorded sound by compensating for non-professional recording equipment. This typically can be found in an upper frequency range (5khz - 8khz) but varies to the specific vocal range of the talker. This is a mouth artifact in contrast to plosives which occur in lower frequency ranges.

Maximize Sibilance Attenuation with Enhance API

The Media Processing Enhance API has built-in sibilance reduction. The Enhancing Media Quick Start provides step-by-step instructions for processing your own media. You can adjust the amount of attenuation applied to suppress sibilance which could be necessary in certain circumstances.

{
  "input": "s3://dolbyio/public/shelby/airplane.original.mp4",
  "output": "dlb://out/airplane.max-sibilance-attenuation.mp3",
  "content": {
    "type": "podcast"
  },
  "audio": {
    "speech": {
      "sibilance": {
        "reduction": {
          "amount": "max"
        }
      }
    }
  }
}