A new/old video coding scheme

Now that AV2 has officially been released it’s time to remind that these codec advances are not for free. Here it is claimed that AV2 offers about 25% improvements in compression efficiency compared to AV1 with decoding being five times slower (and that’s for an extremely optimised decoder; no idea how much slower encoding is).

Back in the day I proposed to go full AI with the codec design but it was not a fully serious idea, here I’d like to propose something more realistic that would use “AI” hardware that we should have plenty of by now—and with no AVn codecs in mind.

The basic idea is that video compression is being improved mostly by trying more and more different approaches on varying blocks of data that depend on previously coded blocks of data. For modern mainline video codecs the structure is simple: divide frame into, say, 128×128 tiles, split those tiles into e.g. 64×64 blocks, decide whether you want intra or inter prediction there, split those blocks further and try some specific intra/inter coding mode on them, maybe split them further… One of the problems is combinatorial explosion as the number of coding approaches tried correlates with the product of coding alternatives at each stage, so adding just one new method (e.g. previously there was just one 8×8 DCT or 4×4 DCT approximation being employed in all cases, now you can select one out of several transforms let alone the fact that transform size is no longer fixed either) may increase the coding time by tens of percents (of course It Depends™ on the settings, heuristics used and so on but it always increases encoding time, and decoding time may be affected as well)—and now remember that there’s usually at least several new coding tools being added for each new codec iteration. The other problem is data interdependency: you can’t work on a new block before you encoded its neighbours since you may need its data for intra (spatial) prediction.

The answer to that so far was to add more coding tools, better heuristics and hope for Moore’s law to take care of the rest. What I propose is an alternative approach: split encoding into a phase that’s easy to do per block and the second stage that encodes data from the previous stage using context-dependent arithmetic coder or a variation of it. Approximately it’s like coding 8×8 with either intra DCT or DCT of the difference to the previous frame; zeroth stage would calculate those differences (or perform motion compensation), first stage would apply DCT on both intra and inter versions of the blocks, and the final stage will select either version to encode). Of course such scheme is too simple to produce good compression, so here are two methods from the past may come in handy. They did not gain much popularity for being too slow back in the day but are perfect for the modern hardware.

I’m talking about vector quantisation and matching pursuit. To oversimplify it, vector quantisation keeps a codebook of possible image pieces and either selects the most fitting one (like in Cinepak) or constructs image using several entries from its codebook (like mean-removed VQ used in SVQ1, I really should write an encoder for it). Matching pursuit is a similar idea where you construct an image from several pieces in its codebook. To help visualise that idea, look at 8×8 2D DCT bases (here’s a link to Wickedpedia) and think that any 8×8 block can be represented as a sum of some of those bases scaled by a certain number (and the transform itself produces a matrix of those weights for each of the bases—the fewer of them you use the better is compression). Now vector quantisation (one of the varieties) would be having, say, sixteen 8×8 bases and sixteen 4×4 bases for the refinement details (so you code it as one 8×8 base plus four 4×4 base applied as a difference to each quadrant). Matching pursuit means you may have 256 various bases and you select an image as a set of, say, 1-3 of them. It may look excessive at the first glance but it should work better on representing e.g. diagonal and curved lines.

It should be obvious that a scheme employing VQ/MP plus coding of its results offers both scalability and flexibility: as with an ordinary transform you may decide to discard or transmit less important details separately. And the requirement for searching for the best candidate (which made it impractical back in the day) should be no problem for GPUs or TPUs (and with enough parallelism CPUs as well).

So the final scheme should look like this: motion search, VQ or MP for intra and inter blocks, context-dependent coding for indices and motion vectors (and maybe loop filtering). The main challenges here are designing good codebooks for intra blocks and inter differences (which probably just needs collecting a data on encoding a lot of content in different genres and sizes; maybe it makes sense to have several codebook sets for e.g. QCIF, FullHD and UHD content) and devising a scheme for effective compression of the codebook indices (of course they should be coded in a similar manner as generic integer values are coded in H.26x or VPy but maybe they can be predicted as well?). At least both the encoder and the decoder for such format should be much simpler than the state of the art video codecs so more efforts can be dedicated to improving existing few things instead of finding a way to bolt on a new coding tool and then research when actually use it.

P.S. For rather obvious reasons I’m not going to even attempt to design such format. But if somebody wants to do that and name the resulting codec Sleet, Photon2 or Nadd—be my guest.

4 Responses to “A new/old video coding scheme”

  1. Paul says:

    Bink Video 3.0 format

  2. Kostya says:

    Bink2 author Fabian Giesen has his own blog, petition him instead.

  3. It has been a long time since I have heard of AI applied to video compression. When I did, it essentially boiled down to using neural networks to perform optimal vector search in a VQ scheme, and that was like 20+ years ago. There are much better opportunities now, and I assume the big boys already leverage the techniques.

    I suspect if you asked a modern LLM to compress a video optimally, it would say, “Sure! Here are some FFmpeg command lines you can use…”

    And how did I never know about that Bink2 author’s blog?

  4. Kostya says:

    I suspect that like fashion it works in cycles. I still remember virtual reality being a hot new thing in mid-90s (and AI was a hot research topic in 1960s and 1980s apparently). Maybe it’s time to re-discover some forgotten techniques and put them to work again. I don’t think that fractal compression or wavelets would ever be good, but VQ can be optimal and there’s more to it than representing small blocks with two colours and a mask.

    As for the blog, I learned about it when somebody shared a link to https://fgiesen.wordpress.com/2013/11/04/bink-2-2-integer-dct-design-part-1/ in the comments to my post about something Bink2.

Leave a Reply