Occasionally I look at the experiments in AV1 repository that should be the base for AV2 (unless Baidu rolls out VP11 from its private repository to replace it entirely). A year ago they added intra modes predictor based on neural network and in August they added a neural network based loop filter experiment as well. So, to make AV2 both simpler to implement in hardware and improve its compression efficiency I propose to switch all possible coding tools to use misapplied statistics. This way it can also attract more people from the corresponding field to compensate the lack of video compression experts. Considering the amount of pixels (let alone the ways to encode them) in a modern video it is BigData™ indeed.
Anyway, here is what I propose specifically:
- expand intra mode prediction neural networks to predict block subdivision mode and coding mode for each part (including transform selection);
- replace plane intra prediction with a trained neural network to reconstruct block from neighbours;
- switch motion vector prediction to use neural network for prediction from neighbouring blocks in current and reference frames (the schemes in modern video codecs become too convoluted anyway);
- come to think about it, neural network can simply output some weights for mixing several references in one block;
- maybe even make a leap and ditch all the transforms for reconstructing block from coefficients directly by the model as well.
In result we’ll have a rather simple codec with most blocks being neural networks doing specific tasks, an arithmetic coder to provide input values, some logic to connect those blocks together, and some leftover DSP routines but I’m not sure we’ll need them at this stage. This will also greatly simplify the encoder as well as it will be more of a producing fitting model weights instead of trying some limited encoding combinations. And it may also be the first true next generation video codec after H.261 paving the road to radically different video codecs.
From hardware implementation point of view this will be a win too, you just need some ROM and RAM for models plus a generic tensor accelerator (which become common these days) and no need to design those custom DSP blocks.
P.S. Of course it may initially be slow and work in a range of thousands FPS (frames per season) but I’m not going to use AV1 let alone AV2 so why should I care?