Check out this part 1 notebook and this part 2 notebook that accompany this post!

Streaming video is a major part of how users consume information across a variety of applications. As more users turn to mobile devices, the screen sizes are also increasing.

At the same time, consumers expect high quality video without lag or distortion. This frames an engineering challenge to optimize the way video is streamed for consumers using a wide variety of hardware.

Most streaming video services employ some form of adaptive bitrate encoding, whereby video is chunked into GOPs and encoded at different bitrates so that client devices can dynamically switch bitrates to accommodate changing network conditions.

adaptive bitrate streaming

The perceived quality of streaming video can vary considerably depending upon bitrate and resolution. To minimize the presence of issues like compression artifacts, we would like to find the video encoding which optimizes for perceived quality.

We can measure distortion of a signal with respect to a reference using PSNR. Metrics like SSIM, MOS and VMAF are popular choices for video.

Groups like Netflix realized that encodings can be specialized to each title. They extended this to chunk-level optimizations before using scene detection to apply this idea at the shot-level.

shot-level optimization

The economics of this kind of optimization depend on the nature of the streaming content. For content which will be streamed over and over again, this can have tremendous impact!

Some features of content make it more or less impacted by bitrate encodings. For example, simple animated images are easy to compress compared to fast-motion & colorful sequences.

compression comparison

Groups like Facebook and Twitter have also employed these kinds of optimizations to scale streaming services to mobile users more efficiently.

However, these gains are not free! Encoding video at various bitrate/resolution combinations before computing VMAF against a reference video is computationally expensive.

Ultimately, convex hull optimization is used to determine the optimal bitrate ladder.

bitrate-ladder-1

This plot illustrates the inefficiency of using a grid search, many encoding combinations are well below the pareto front.

Much of the recent work in this space is directed at probing more efficiently than an exhaustive search.

bitrate-ladder-2

However, without a priori knowledge of the video, it’s difficult to guess the optimal bitrate range for each resolution. Instead, we need to extrapolate effectively.

bitrate-ladder-3

Generally, the distortion exhibits logarithmic decay. To accommodate scaling and shift of the argument and dependent variable, we choose to fit a curve of the form:

$\begin{equation} f(x | a, b, c) = a \log(x + b) + c \end{equation}$

for learnable parameters a, b, and c.

This allows us to extrapolate beyond the range of values we probe for each resolution, thus probing less frequently.

We’ve put together a databricks notebook to show how you can apply Spark with FFmpeg to optimize the bitrate ladder of a sample 4K video.

Here’s an example wrapping an FFmpeg command to extract VMAF scores with a custom udf applied to a lattice of resolutions and bitrates. This command also avoids writing the videos to file by using pipe:.

@udf(returnType=T.DoubleType())
def rate_distortion(video, bitrate, resolution):
    width, height = map(int, resolution.split(":"))
    cmd = '/home/ubuntu/ffmpeg/ffmpeg -i {} -vf scale={}:{} -c:v libx264 -tune psnr
           -x264-params vbv-maxrate={}:vbv-bufsize={} -f rawvideo -f rawvideo pipe: | 
           /home/ubuntu/ffmpeg/ffmpeg -i pipe: -i {} -filter_complex 
           "[0:v]scale=1920x1080:flags=bicubic[main]; [1:v]scale=1920x1080:flags=bicubic,format=pix_fmts=yuv420p,fps=fps=30/1[ref];
           [main][ref]libvmaf=psnr=true:log_path=vmaflog.json:log_fmt=json" -f null - '.format(
        video, width, height, bitrate, bitrate, video
    )
    ps = subprocess.Popen(
        cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT
    )
    output = ps.communicate()[0]
    vmaf = float(str(output).split("VMAF score: ")[-1].split('\\n"')[0])
    return vmaf

As you see from the notebook, we can combine the power of MapReduce with that of FFmpeg to extract structured information from video at scale.

In a follow-up post, we’ll consider smarter sampling strategies which reduce computations further. Stay tuned!