A Modest Proposal for AV2

Occasionally I look at the experiments in AV1 repository that should be the base for AV2 (unless Baidu rolls out VP11 from its private repository to replace it entirely). A year ago they added intra modes predictor based on neural network and in August they added a neural network based loop filter experiment as well. So, to make AV2 both simpler to implement in hardware and improve its compression efficiency I propose to switch all possible coding tools to use misapplied statistics. This way it can also attract more people from the corresponding field to compensate the lack of video compression experts. Considering the amount of pixels (let alone the ways to encode them) in a modern video it is BigData™ indeed.

Anyway, here is what I propose specifically:

  • expand intra mode prediction neural networks to predict block subdivision mode and coding mode for each part (including transform selection);
  • replace plane intra prediction with a trained neural network to reconstruct block from neighbours;
  • switch motion vector prediction to use neural network for prediction from neighbouring blocks in current and reference frames (the schemes in modern video codecs become too convoluted anyway);
  • come to think about it, neural network can simply output some weights for mixing several references in one block;
  • maybe even make a leap and ditch all the transforms for reconstructing block from coefficients directly by the model as well.

In result we’ll have a rather simple codec with most blocks being neural networks doing specific tasks, an arithmetic coder to provide input values, some logic to connect those blocks together, and some leftover DSP routines but I’m not sure we’ll need them at this stage. This will also greatly simplify the encoder as well as it will be more of a producing fitting model weights instead of trying some limited encoding combinations. And it may also be the first true next generation video codec after H.261 paving the road to radically different video codecs.

From hardware implementation point of view this will be a win too, you just need some ROM and RAM for models plus a generic tensor accelerator (which become common these days) and no need to design those custom DSP blocks.

P.S. Of course it may initially be slow and work in a range of thousands FPS (frames per season) but I’m not going to use AV1 let alone AV2 so why should I care?

3 Responses to “A Modest Proposal for AV2”

  1. Peter says:

    Videos are getting that big nowadays, might as well ship the decoder neural network with the video. It will make the reverse engineering task much easier.

    This article further demonstrates a theory about “peak codec”. The analogy applies not only to oil, but vehicles themselves. Remember Japanese cars in the 1980s and early 1990s? They were the optimum balance of everything: usability, efficiency, serviceability, restorability, comfort too. Then the engineers got bored.

    Cheers

  2. Kostya says:

    You should remember about Kolmogorov complexity: you might end having NN weights occupying more space than H.264 video with full decoder.

    As for peak codec, we have an article for this theory in The Wiki. And probably the cars suffered mostly from the lack of fierce competition, rising demands (ecological and marketing ones) and diminishing returns from the innovations made the cars boring soapboxes until somebody decided to introduce electric cars to the mainstream.

    The same can be applied to codecs: at first you had rather fierce competition with essentially every ecosystem (or multimedia fiefdoms as Mike called it) having its own codecs and container formats, then you have reached the point where you gain compression either by having loads of tools each improving compression by 0.0001% or by applying transforms on larger blocks. In result all codecs look about the same (like H.264/VP8/AVS/RV4 or H.265/VP9/AVS2/RV6) and the whole field awaits for some disruptive idea (maybe a one from the past) to challenge the current streamlined design.

  3. […] over each other plus adaptive weights also used in residuals coding. Of course that reminded me of AV2 and more specifically about neural networks. And what do you know, Monkey's Audio actually calls […]