From the 28th to the 30th of October Dolby.io was proud to sponsor and participant in PyData Global 2021. PyData, an annual data convention for analysts, scientists, and developers alike, focuses on helping foster and connect the community with educational resources and exposure to new tools and best practices.
More specifically for the Dolby.io team, we wanted to engage the audio data space and help connect with some experts, as well as some beginners to demonstrate the utility and power of audio analysis and research. To do this we participated in two webinars, a short lightning talk, and a full-length 90-minute tutorial.
Briefly, in the lightning talk and over the course of the 90-minute workshop we covered a variety of audio data creations tools and how they can be used to supplement and enhance your catalog of data. These tools included:
- PyAudio Analysis: An open-source audio signal analysis package by Theodoros Giannakopoulos that worked great for exploratory data analysis, clustering, and audio classification.
- Transcriptions and the Natural Language Tool Kit: A combination of two tools, one a transcriptions tool, in this case, Azure Cognitive Services Speech to Text, and another the NLTK for performing natural language analysis on the transcribed text. These two tools in combination were great for content analysis and clustering.
- Dolby.io Analyze and Analyze Audio: The Dolby.io media processing analysis suite created interesting data points from the underlying signal that could be used for speaker analysis and diarisation, quality analysis, and understanding the composition of audio defects in the audio.
As part of the workshop, we wanted to tie these three tools together in an overarching project that showed why extracting data from the audio was useful and the kinds of projects a user could explore with the available tools. Given that the theme of the workshop was analyzing sports podcast audio the idea was to improve advertisement insertion in podcast audio by using a combination of podcast content and speaker diarisation to pick a natural spot where an advertisement could be placed. For this goal we laid out some criteria for what makes inserting an advertisement more natural:
- The advertisement should be relevant to the conversation occurring near it.
- The conversation should be positive for the advertisement.
- The advertisement shouldn’t interrupt the flow of the conversation.
With the criteria defined we then set about creating a tool that could satisfy our use case.
Criteria #1
We started by aiming to satisfy criteria #1, content relevance. To do this we created a transcript of our target podcast episode and loaded it in, in this case, we used Azure Cognitive Services Speech-to-Text to create the transcription. If you are interested in learning how to create transcriptions with Speech-to-Text check out this helpful guide here.
#Python package for opening JSONs
import json
load_file = json.load(open("transcription_1.json))
#Azure Speech-to-Text outputs different versions of the transcription, we select only the lexical.
filtered_text = load_file["combinedRecognizedPhrases"][0]["lexical"]
Once the JSON containing our transcribed text was loaded in we then set about using the Natural Language Tool Kit (NLTK) to label all the words used in the transcription by the word type (Noun, Verb, etc.) and then labeled all the Nouns by whether they belong to a person, organization, place, and so on. If you are interested in a more detailed description of how the NLTK is able to do this read this article here.
import nltk
#We first "tokenize" the words by separating them into individual strings.
tokenized_text = nltk.word_tokenize(filtered_text)
#We then tag the words by what part of speech they correspond to.
pos_text = nltk.pos_tag(tokenized_text)
# We then use "Named Entity" recognition to split nouns into subgroups such as organizations.
ne_tree = nltk.ne_chunk(pos_text)
With all the words labeled we iterate over the tagged words and save the ones labeled organization.
orgs = []
for labeled_word in ne_tree:
if isinstance(i, nltk.tree.Tree) and labeled_word.label() == 'ORGANIZATION':
orgs.append(i[0][0])
#We convert the list to a set to remove duplicates.
set(orgs)
With a list of relevant organizations, we then picked which one we wanted to use as our advertisement insertion guinea pig, in this case, we picked the Milwaukee Bucks basketball team.
Criteria #2
Now that we knew the podcast mentioned the Bucks at some point in its 90 minute run time we then wanted to make sure we could satisfy criteria #2, positive sentiment relating to the organization or product. This criterion is very important as it could be detrimental to play an advertisement following a negative sentiment surrounding the product. This is because users are less likely to click on an advertisement for a product that their favorite host has just spent time talking negatively about. So to ensure this criterion is met we use some basic sentiment analysis to ensure that the hosts are talking about the organization or podcast positively.
To do this we use a VADER (Valence Aware Dictionary for Sentiment Reasoning) sentiment analysis model. VADER is a lexical approach to sentiment analysis, meaning it pre-assigns words with a positive or negative score. For example, “Bad” would get a high negative score and a low positive score, however, it doesn’t take into account the relationship between words such as “not bad”. This type of model is very rudimentary but a good place to start as a proof of concept.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
target_word = 'Bucks'
transcript = json.load(open("transcription_1.json))["recognizedPhrases"]
#Now that we have our target word we find where it is mentioned favorably.
for text_data in transcript:
text = text_data['nBest'][0]['display']
if target_word in text:
print(text)
print(sid.polarity_scores(text))
Using VADER we can score the sentence mentioning the Bucks to check that the overall sentiment is positive.
Criteria #3
Although we know that the podcast positively frames the Bucks, we can’t just insert an advertisement right when they mention the team name and interrupt the flow of the conversation. This process would be jarring to the viewer and potentially annoying as the podcast is interrupted mid-sentence or at an unnatural point. To prevent this we then use the Analyze Speech API to break down the composition of the speakers so we can algorithmically find a natural point to add the advertisement.
To achieve more natural advertisement insertion we can use the Dolby.io Analyze Speech API to break down the presentation by speaker. This can be done by submitting the media to the REST API and saving the output JSON. A full guide for using the analyze Speech API can be found here.
Once we have our Analyze Speech output JSON we then create an is_speaking
matrix filled with zeros. This matrix has a row for each unique speaker and a column for each second of the podcast. If someone from the podcast is speaking in that second we will replace the zero with a one. We can see this process executed in the code below.
import json
import numpy as np
load_file = json.load(open("podcast_analyze_speech.json))
#There is other data, however, we are only interested in the speech regions
region = load_file["processed_region"]["audio"]["speech.details"]
#Creat a matrix that is podcast duration by total podcast speakers
empty = np.zeros(max_people, max_duration)
#Get the talker in the processed region
for talker in region:
#Get the section that talker speaks at
for sect in talker["sections"]:
#Use the data collected on speaker durations to populate a row
c = 0
while c < sect["duration"]:
empty[talker["talker_id"]-1][int(sect["start"]) + c] = 1
c = c + 1
Now that we know who is speaking and when we can jump to when the podcast host mentions the Bucks team favorably. We can select a window of 120 seconds, the logic being we want to find a natural point to insert the advertisement within two minutes after the organization has been brought up, after that the conversation will have already shifted in another direction. Once we have this window we can transpose the matrix and sum along the rows to find segments where no one is speaking.
window = empty.T[start : start + 120]
#By summing along axis 1 we are find the segments where silence occurs.
np.sum(window, axis=1)
We can see the output of this array wherein the 120 seconds following the mention of the Bucks there are ones when people are talking, twos when two people are talking and zeros when no one is talking. These segments may only be for a few seconds however the pause in conversation signals a pause in the flow of the podcast, allowing an opportunity to insert an advertisement.
Now that we know that there is a pause we can insert our advertisement for Bucks merchandise, concluding the improved advertisement insertion pipeline. Using all three of these criteria we have created a tool that can more naturally insert advertisements into a podcast by looking at the content, sentiment, and flow of the conversation.
This code is available in a Jupyter Notebook on GitHub, along with other code exploring PyAudio Analysis, the Natural language Tool Kit, and the Dolby.io Analyze and Analyze Speech APIs.
Final Thoughts
PyData Global was a fantastic opportunity to showcase some of the fun and exciting tools available in the audio space. By using these tools in conjunction, some cool and creative innovations can be found in the audio space. Examples like the context-aware advertisement insertion are just the tip of this iceberg with so much data available to explore and utilize. If you are interested in seeing Dolby.io explore audio data and the tools more in-depth check out this recording of the presentation and get building some awesome audio solutions.