WebRTC — The Future for Encoders

Broadcasters, content owners and OTT providers are slowly but surely re-defining their streaming workflows, especially as it relates to real-time streaming.

These content providers have been confined to the playback experience within the viewer at the end of the streaming pipeline. But with WebRTC, they are now able to manage the entire end-to-end workflow and add greater control to the user experience.

Given the mass adoption of browsers, and operating systems like Android that are ubiquitous in mobile devices, Smart TVs, Set-Top Boxes, and other IoT devices, you can now reach up to 80% of the market with WebRTC, with even greater adoption into the future.

WebRTC is the streaming technology that the web and the internet have specifically designed, chosen and standardized for this very purpose. It is critical that encoders adapt to that vision, or they will be left behind.

The web and the internet have a slow but steady innovation model, with the biggest companies innovating first within native apps, and learning from that experience to propose new features for standardization. Cisco uses Webex, Google uses DUO, Microsoft uses Teams, etc.

This presents an opportunity for vendors of native apps and hardware devices to differentiate themselves through a two-tier approach:

a base product that is on par with what web browsers provide, and
a premium product that is ahead of what web browsers provide for added value and differentiation (as is the case with Webex, DUO and Teams).

As long as the design and evolution is aligned with the web and the internet, it keeps solution vendors ahead of the curve. When a new feature is added to the web platform, that feature falls back from the premium offer to the base offer, keeping the base offer at the best capacity possible when interacting with a browser, and protecting the value of the premium offer.

Traditional Streaming

The traditional streaming model is a pipeline, where each filter has an input and an output, and the media only flows one way:

The uplink/ingest is supposedly done over either a perfect network, or a reliable network transport (at the cost of latency).
The adaptation is done on the server-side, by decoding and re-encoding several resolutions.
The encryption is done only on the delivery side through DRM.

To be able to achieve Real-Time streaming over the public internet, the model was changed:

1. RTP media engine

RTP implements a media engine, as opposed to a simple encoder. The encoder, the media transport, and the network transport are tied together, and implement feedback loops. There is a bandwidth estimation algorithm which probes the network capacity in real time and provides a budget to a “congestion control algorithm”, the real brain of the media engine.

Where TCP would try to send all content at any cost until successful, the RTP media engine design uses UDP for the network transport, and moves the reliability management to the media transport layer. It allows for partial reliability, a smart, media and codec aware, management of content. If you receive feedback that a packet was lost, and given your RTT is too late to send it in time, just ignore it.

One of the first things for media encoders to adopt WebRTC is to have an RTP media engine. Note that it breaks pure pipeline designs.

For example:

FFMPEG refuses to implement RTP feedback
GStreamer implemented WebRTC years ago but only implemented the feedback mechanism in summer 2020, and
OBS plugin design is still incompatible with feedback mechanisms.

Practically today, people can use e.g. GStreamer or libwebrtc to provide the RTP engine. In those two libraries, the encoders are “injectable” through a Factory design pattern, which makes it easier to integrate on top of existing devices or solutions.

Those are still single stream solutions, which do not provide client-side adaptation capacity.

2. ABR vs Simulcast vs Layered codecs (SVC)

ABR, or server-side transcoding, is not part of the web/internet real-time streaming model. There are mainly two reasons for that: it doubles or sometimes triples the latency, which is the main metric of interest, and it forces you to trust the real-time streaming platform with your content.

So, ABR, or adaptation must be done sender-side, to have a single encode/decode media path, and no transcoding. The WebRTC spec already has a provision for client side encoding they call “Simulcast”.

It’s exactly the same as traditional ABR: one high quality input stream is being piped into several encoders with different resolution and bitrate targets.

WebRTC supports an arbitrary number of streams, as long as all use the same codec. (An extension is being proposed to be able to use different codecs per resolution/bitrate target).

That approach works with any existing and past codec. Implicitly, it supposes that you have at least one media server between the sender and the receivers, and that that server will decide which stream to relay to a given receiver. The speed at which the adaptation can be done depends on the frequency of I-Frames.

More recent codecs like AV1 are layered codecs by default. It means you achieve the same effect, having multiple resolutions of the input high quality stream to choose from, but with a single encoder and in a single stream.

Not only does it provide adaptation capacity, but also, the reaction time is much faster (on a packet basis, which translates into milliseconds instead of seconds), and its resilience to poor network conditions is enhanced.

Real-Time AV1 encoding is available in Cisco Webex, Google DUO, and CoSMo’s Millicast today. It was enabled by default in Chrome m90 on April 13th, 2021:

https://youtu.be/DZlsk7BQ5Cs

3. E2EE and DRM

From the web/internet point of view E2EE is a much better approach. It mandates zero trust in your platform, it has the same protection as DRM along the entire media path. Of course it has an impact on secondary features, but solutions already exist for encrypted recording, encrypted SSAI, etc.

Here there is a specific opportunity for hardware encoders, and device manufacturers. Secure crypto key management is a nightmare for the internet (where you cannot trust Javascript), quite difficult for software, and relatively easy for hardware vendors. All CPU and chipset vendors have some kind of crypto capacity and secure vault in their design.

SFrame (Secure Frames): a Media Frame encryption and authentication scheme for WebRTC

While the Media encryption itself is being standardized (see IETF SFrame), and the key exchange protocol is likely to be MLS (see Cisco implementation), storing the keys locally and securely is an open challenge.

Final note on RTMP vs WebRTC

RTMP is easy to support. You can implement it once and forget about it, and pass an URL to the service with all the necessary tokens.

The main risk so far for hardware vendors is that there is no standard signaling for WebRTC. In other words while the media stack is standard, each service requires some part of a proprietary implementation.

The WHIP protocol, supported by Cosmo, Google, Tencent, Cisco, caffeine.tv, LLNW and others, is the answer to that. Implement once, and use it with any platform.

Practically, what should we do?

From our customers’ point of view, hardware encoders are solving a convenience problem rather than a performance problem. One can run software encoders on a normal desktop machine and achieve good results.

However, capturing professional cameras through SDI or HDMI sources, as well as other audio equipment and external displays is a pain.

We already have many of our heavy users using dedicated servers (PC or Mac), with multiple BlackMagic Decklink cards, and our Millicast Studio software for encoding and WebRTC delivery to the Millicast platform.

But what the industry needs now is a simple appliance, similar to a Teradek Cube, Videon Edgecaster or AJA HELO, with a single SDI/HDMI input and a LAN connection, which can be configured through a web page.

Hardware Encoders: Teradek Cube, Videon Edgecaster, AJA HELO

CoSMo/Millicast can provide a C/C++ WHIP SDK, a reference WHIP server to test against, and other support.

Below are non-mutually exclusive options to add WebRTC to these encoders:

1. Ground Zero

Implement a super low latency version of your RTMP encoder. It’s really just playing with RTMP parameters, nothing fancy, minimum overhead. You will not get ABR, you will not get E2EE, you will not get a better codec than H.264, but it will work with the Millicast platform today as an RTMP ingest. Frankly speaking, that’s investing in the past.

2. Implement WebRTC+WHIP with H264 or VP8 (4:2:0, 8bits)

This is the most sensible first step. It’s simple to implement, it will work against all existing browsers today, it will shave half of the latency you have with RTMP. There is already a night and day difference for latency-sensitive workflows using RTMP today. It allows you to implement and validate a full WebRTC stack, before you move on to further integration. You will not get ABR, you will not get E2EE.

3. Intermediary: WebRTC+WHIP with VP9 mode 2 (10bits 4:2:0 HDR)

An interesting intermediate step if your hardware supports VP9 encoding (INTEL, Qualcomm and Samsung do for example). This provides you with a 10bits HDR10 capacity out of the box, supported by Chrome, Edge and Safari today.

4. Intermediary: WebRTC+WHIP with H.265

This is kind of a smaller play. Only Apple Safari will be able to receive your stream and display it among existing browsers (and it’s not likely to change). However, there are also a lot of Hardware devices that can decode H.265. Risky and not very practical, but many existing devices support H.265. It is low hanging fruit, and god knows Apple owners love their devices.

5. Same as #2 but with simulcast

This gives you the best quality possible today, while being future-proof when E2EE will be available. In our opinion, this is the best configuration of the base offer (in-par with browsers today). It requires the capacity for multiple concurrent encodings. Note that only one stream is high resolution, and all other streams will be lower resolutions. In this context, the Qualcomm approach with some CPUs/GPUs more capable than others make a lot of sense. The magic number is 3 encoders in parallel for optimum quality.

6. Real-Time AV1 SVC, or other high level codecs (i.e. Dolby Atmos, etc.)

There will always be a demand for the best quality possible: 12bits, 4:4:4, lossless (no quantization, etc). This will be our premium offer. AV1 is very interesting because of its widespread adoption on the decoder side, and the fact that encoders will find their way into the browsers very soon. Also, there are many very good libraries implementing the codec already making adoption easier.

We believe this will be in very high demand for content production, especially for the latest stages of post-production (Color Grading). That being said, Millicast is codec agnostic on the platform side, and any codec could be added in principle.