AV1 real-time encoding in standalone webrtc and Millicast platform has already been available since April 2020, as we already covered in previous posts. However, for those who wanted to use it by themselves in a web app, you had to recompile Chrome. While we provided pre-compiled binaries for the community and the happy few brave enough to test it that early, it was a single layer implementation and did not support SVC.
As usage of codecs evolve from closed circuits and dedicated lines to be more and more over the public internet, codecs themselves evolve and adopt features to make the media experience over the public internet better. A late Annex of H264 (Annex G), SVC has evolved as a mandatory feature for any modern codecs. AV1 is the first codec to support SVC by default. For those interested in more details about how SVC helps, Dr Alex E. wrote a great explanatory blog post back in 2016. Written about VP9, most of the points stay valid for AV1.
For all of last year, the Real-Time Group at AOMEDIA (part of the code group), was hard at work to finish the RTP payload specification which allows for RTP end points to leverage all the codec SVC features, but also for intermediate SFU to be better, stronger and faster. Cosmo Software spearheaded the implementation of all the tests and a reference SFU. The AV1 RTP payload specification is now almost final, tested up to 95%+.
Now is a good time to revisit why AV1 is more important to real-time media beyond just improved compression efficiency. We will also provide details about what to expect in terms of performance.
I. Innovation does not happen in a vacuum
In our fast paced world, it is too easy to focus on small things as they happen, and to ignore the big picture. However, no innovation happens in a vacuum, and looking at the trends and analyzing the trajectory is even more fascinating.
Dr Alex Eleftheriadis (a.k.a the other Dr Alex), did an extremely good job at documenting the evolution of communications as a whole in a recent post:
It is well written, extremely well documented by someone who not only lived the evolution from within, but also had to teach it to university students, and created one of the most technically creative companies in the field: Vidyo. I strongly recommend reading it.
In less than two weeks, two majors technologies were made standards and one major technology made available in Chrome:
- On Wednesday January 20th, all of the IETF RTCWEB drafts finally became Standards (or informative references) and received an RFC number. That represents more than a decade of work, by more than hundred of the brightest minds in the world on the protocols that serve as a base for WebRTC. Tens of new standards that emerged from that hard work have been made public, and available through the web platform!
- On the 26th of January, W3C announced the availability of Webrtc 1.0 as a standard as well, solidifying the standard and making it safer for people jump in and start implementing it.
- On January 21st, Google finally enabled AV1 SVC real-time encoding in Chrome, with the feature becoming available in canary builds a few hours or days later.
It is not serendipity. Real Time Media requires integration of several elements to work well, and all those elements are being worked on and evolving in parallel:
- A codec with SVC (Scalable Video Coding)
- A media Engine (coupling of codec, media and network transport)
- SFUs (Selective Forwarding Units) instead of MCU (Multipoint Conferencing Unit)
This is the classic example of the whole being more than the sum of its parts. Using these technologies in parallel offers amazing advantages:
- better network resiliency
- faster adaptation
- no media processing on the back end
- enabling next generation features like end-to-end encryption.
II. The RTC innovation trajectory
Bernard Aboba, Chief Architect at Microsoft, once wrote this about AV1 (links added by us):
Steve Jobs once said: “I’ve always been attracted to the more revolutionary changes. I don’t know why. Because they’re harder.” AV1 was designed to integrate with the next wave of WebRTC video innovation: E2E encryption, SVC and codec-independent forwarding. So it’s not about the video codec, but rather the next generation architecture. 1. With WebRTC now incorporating E2E encryption via Insertable Streams (and SFrame), and NSA now recommending E2E security, conferencing systems need an RTP header extension to forward packets since the payload may be opaque. So if a browser and codec doesn’t support Insertable Streams or a forwarding header extension integrated with the next generation codec, it will not meet NSA requirements, and conferencing vendors won’t be able to provide full functionality. 2. SVC support is important for conferencing. AV1 has SVC built-in; in HEVC it is an extension. The Dependency Descriptor (defined in the AV1 RTP payload specification) is superior to the Framemarking RTP header extension for spatial scalability modes. If a browser (and next generation codec) doesn’t support SVC along with a forwarding header extension, it won’t be competitive. 3. AV1 includes screen coding tools as a basic feature, not an extension as in HEVC. This is a major competitive advantage for conferencing.”
A. Screen Sharing
AV1 is extremely efficient when encoding screen content, for both textual content as well as very high motion content. So superior in fact that AV1 real-time may be deployed only in that single use case, as Cisco has done in Webex.
Transmitting AV1 is supported when sharing screens or applications with “Optimize for motion and video” selected , and when the machine you are on has at least four cores. Receiving AV1 is supported for any machine with at least two cores. AV1 will automatically be used for sharing this type of screen content whenever all participants in a meeting support it, otherwise it will automatically revert to H.264.
It is interesting here to note the constraints reference to 4 and 2 cores respectively. Cisco made the same statement during their live demo at the Big Apple conference in June 2019.
We are going to keep the performance discussion for a later section, but to provide context, MacBook Air has had Intel core-2 chips with 2 cores since 2008, and Intel i7 or better with 4 cores or more in MacBook Pros since 2011. So, as far as laptops and desktops are concerned, expecting to have 4 cores is not a big ask.
B. End-To-End Encryption
E2EE is the next big thing. Maybe because it was an original webrtc promise. Maybe because it became an overused marketing term and Zoom got burned. Maybe because most people are still claiming to have it and actually don’t.
With respect to E2EE, one of the best response on the subject is this presentation by Emil Ivov:
While many people think E2EE encryption is only a video conferencing or chat application feature, it is used throughout the media industry under the acronym “DRM” (Digital Rights Management). However, the traditional implementation of DRM in the browser and in the media player are not truly end-to-end, and only covers delivery. People uploading their content to a platform still need to trust the platform (and anybody who can access it legally or not) with their raw content.
True E2EE requires the media to be encrypted at the source when the media is encoded, and only be decrypted at playback. It allows content provider not to trust the platform.
WebRTC Insertable stream API proposal received a lot of coverage, because it can be used for many things. It is an API that allows you to access the media, and a necessary step to enable E2EE. However, it has no encryption capacity, or crypto key management capacity in its own.
The closest thing to a WebRTC compatible Media encryption for E2EE is the proposed IETF SFrame standard. It still requires an external system to offer the secure external key management. To that point, Apple has reported that on the 18th of January, at the monthly WebRTC interim meeting, they have added an early secure implementation of SFrame in Safari. This has received good feedback from Firefox whose team is usually very attached to security features and protecting internet users. Progress is being made on the web platform side as well.
The subtle message here is that the design of SFrame was forward looking. Where its predecessor PERC forced users into legacy RTP media transport and was restricted to the video conferencing use case, SFrame was designed to be:
- use-case agnostic (i.e. usable for streaming)
- protocol agnostic (RTP today, QUIC tomorrow)
- use less bandwidth overhead (than SRTP or PERC)
- SVC codecs compatible.
C. SVC with SFUs and modern media infrastructure
Most people focus on the coding efficiency of new codecs:
- the bandwidth usage reduction that results from using a new codec
- with the same input resulting in the same quality on the viewer side.
Used within a next generation media architecture, SVC provides more than that.
Remove the need for ABR
SVC provides the capacity to generate multiple layered resolutions from a single encoder, and within a single bitstream. In other words, SVC makes server-side transcoding and ABR obsolete (although there are still other reasons to transcode server-side for VOD).
It is also more efficient to encode one layered bitstream rather than 3, 5 or 7 independent layers like simulcast or ABR currently do, as the low resolution content is only encoded once. In modern media delivery systems where adaptation is the norm, it makes a huge difference to the bottom line.
Better Network Resiliency
For those not familiar with the notion of media transport and partial reliability, we recommend reading our previous post on the subject.
There are mainly 3 ways to deal with network glitches: retransmission, redundancy and forward error correction (FEC). Each is a compromise:
- Retransmission supposes that you have time to send the packet again, and supposes you keep a packet cache for each ongoing stream. The benefit is that it is actually pretty simple to implement.
- Redundancy supposes that you can afford to use (much) more bandwidth. If your packet loss is due to congestion (a quantity problem) and not quality problems, it is not going to help.
- FEC allows to reduce the bandwidth overhead, and not have to wait for retransmission. However, this will increase the complexity both sender and receiver side.
In a layered codec, only the base layer is critical to the call, losing other layers would only reduce the resolution of a single frame on the receiving side.
As a result, you do not have to protect the entire stream, but only the base layer. That makes FEC more interesting, because the complexity is automatically reduced. The relative bandwidth overhead is also but a fraction of what it would have been if RED or FEC was used on all the packets.
Also, the time frequency of the base layer packet is a fraction of those of the stream, meaning that you have much more time than you would otherwise to deal with a missing packet. That makes RTX much more appealing as well.
Whatever your approach to network resiliency, SVC is only going to make it more efficient.
Faster (than light) Adaptation
Again, the role of an SFU is relatively simple,: to get incoming packets, check which one should be proxied to a given destination, and push to that destination. To decide which packet should be proxied to a specific destination, first you need to decide which resolution/layer to proxy, and then execute the change.
The decision is often made following some heuristics based in part on the viewer bandwidth capacity, screen size and the device hardware executing the change of resolution/layer is where difference is made.
If using simulcast, one can determine the resolution of a stream depending on its source ID (SSRC). Provided with the streamID <=> resolution mapping, the SFU then decides to stop sending a given stream, and to send another one with the same content at a different resolution. For the viewer decoder to be able to follow the change without artefacts, it needs to wait a full frame before switching.
SVC codec have a special structure known as the Scalability Structure that defines the dependencies between the different layers. This is a codec and bitstream feature. One of the extremely smart advances in the past years was to duplicate and extend this scalability structure at the media transport level.
The ultimate goal is Instant Decodability decision making at the decoder!
Thanks to those extra structures the SFU can decide upon receiving any packet, given a target decoding resolution, whether the packet should be dropped or not. This is an Extremely powerful feature to:
- Reduce the bandwidth usage feedback with the sender by not sending NACKs for packets that are not critical,
- Reduce the forward media bandwidth usage by not sending superfluous packets that the decoder would eventually drop,
- Change the resolution from one packet to the next (in the single digit millisecond range), instead of waiting for a full frame.
This is a very technical and very new feature, and I’m not going into the technical details in this post. You can refer to our Media Server Tech Lead Sergio Murrillo’s upcoming blog post for the tech details.
III. Adoption, Performance and Expectation
What is new today, is the availability of a software ENCODER that is fast and good enough to work in Chrome, as well as the availability of the RTP payload which supports all the scalability modes of the codecs.
B. Performance of AV1 codec in RealTime mode
In mid-2020 we did a study dedicated to real-time codecs that shows that even on a very limited hardware, AV1 RT was performing well and fast enough, and certainly better than it s predecessors in the same conditions conditions. It was peer-reviewed and published in an IEEE conference, with and an extended version has been made available at ArXiv. In the spirit of reproducible science and open data, the command lines used to test each encoder are provided in the paper linked below for everybody to test by themselves.
As far as we know, this is the only benchmark and comparison of codecs used in their real-time configuration and with a real-time setting (paced input). There is a test suite by Phoronix that seems to test libaom realtime mode at speed 6 and 8, but we have not checked exactly which command lines are used (i.e how many cores, multithreading, etc.), and whether the input is paced or read from file. If read from file, the results would be artificially faster than in a real world setting.
C. Performance of AV1 Real Time encoding in Chrome
The performance target in chrome according to google is [email protected] at 2 Mbps for a normal desktop/laptop machine. The speed configuration of the encoder in libwebrtc is chosen based on the input resolution and number of cores. It uses the same thresholds as Cisco did: 2 cores as the minimum acceptable value and 4 cores as the maximum value. In practice we have been able to reach much higher resolution than 720p using only 4 cores.
It makes sense for google to choose that target which covers the vast majority (in volume) of the web use case for Real Time Media. It also aligns with their target to provide a better experience to the next billion of internet user that will not have access to more than 2 Mbps of bandwidth.
For real-time streaming platforms like Millicast, no limit is put on resolution, frame rate, bit depth, etc. Native apps (like Millicast Studio) replace Chrome to provide deeper support for features not available in-browser; for example 10-bit and 4:4:4 colour Broadcast Quality for Color Grading.
The SVC mode is, as expected, taking more resources (between 30~40% for now, depending on the scalability mode chosen), and a few performance bugs are still to be ironed out with SVC support. There is a know regression in the multithreading code of libaom in WebRTC which Google is working on. We have provided some patches (*) and everything should be on time for m90.
So now, the real question that everybody has should be: When?
It should be in Chrome Canary already, and you can start using it and reporting bugs if any. Unfortunately the commit missed the m89 cut, so unless they back port it to m89 (very rare but under discussion), it should only be available in m90 stable.
Webrtc Systems: now Harder, Better, Faster, Stronger