VP8: dubious decisions, deficiencies and outright idiocy

I’ve finally finished VP8 decoder for NihAV (which was done mainly by hacking already existing VP7 decoder) and I have some unpleasant words to say about VP8. If you want to read praises to the first modern open-source patent-free video codec (and essentially the second one since VP3/Theora) then go and read any piece of news from 2011. Here I present my experience implementing the format and what I found not so good or outright bad about the “specification” and the format itself.

Dubious format decisions

Here I list (in no particular order) the format peculiarities that I found strange.

First of all, Y2 block prediction. Y2 is a special block containing DC coefficients for all Y blocks in a macroblock. In H.264 this happens only for special Intra16x16 block type, in VP7 it is present in all macroblock types but one, in VP8 it is present in all macroblock types but two. It is fine but the decision to use as the context last macroblock with Y2 block present in the same row/column is a bit strange and annoying as it essentially forces you to implement it the same way as the codec does—by keeping a cache that’s updated only if the macroblock has valid Y2 block. The same can be said about top-right prediction for unavailable block that relies on the same trick (instead of e.g. simply replicating last available top-right pixel).

The whole “B-frames but not as effective” business. While VPx codecs do not have frame reordering or real B-frames, starting with VP8 they can be emulated. In order to do that you can create a special hidden reference frame (there’s a special bit in frame header for telling decoder not to show the frame) and use it as a forward reference. And there are special flags for golden and altref frames that invert motion vector sign when the reference and current MB have different reference frames and signs for them (which allows emulating forward/backward MVs better). This sucks compared to real B-frames for several reasons: you don’t display the frame (unless using a specially crafted frame to essentially tell a decoder to copy it in full, macroblock by macroblock), you don’t have bidirectional prediction and instead of reordering now you have to deal with invisible frames. It gets uglier in VP9 with its superframes (aka storing two frames together like it’s DivX in AVI) and AV1 with its multitude of non-B reference frames.

Speaking of the related topic, frame management is questionable as well: you have three flags for setting last/golden/altref frame to the current decoded frame and two additional update modes for setting golden and altref frame to one of the previous reference frames in case they’re not updated. Consider the flags for not preserving updated probabilities for later and you have droppable frames. So what H.265 did by introducing multiple frame types (which is not perfect either), VP8 does with lots of flags in the frame header.

Another fine point related to the frame header bits and not keeping probabilities. For some reason they decided to store the flag for that (called refresh_entropy_probs) happens after segmentation map information that may update segment ID tree probabilities. And considering one of the conformance samples exploiting it, you need to save old probabilities in case the flag you read later is set.

And it’s worth mentioning the whole story with version 3. It should’ve been called profile but I’ve mentioned that in my specification review. While the document claims it should use no reconstruction filter (meaning it has only fullpel motion vectors) in reality only chroma MVs are adjusted while luma still has halfpel precision (interpolated with bilinear filters). I’d blame it more on design than documentation flaw since the intent clearly was not to use interpolation.

Bitstream specification deficiencies

There are many things that are not well documented (if at all) in text and often you have to refer to the included libvpx source code in order to find how it actually works:

  • I find it funny that despite months of work on the RFC the information on what bit value corresponds to intra frames (it is zero) is mentioned in the errata;
  • reference frame sign bias (you have two flags but three reference frames and the default sign bias value for last frame is not mentioned);
  • the whole motion vector prediction (more on it below);
  • data partitioning lacks the crucial information: what data each partition contains. It turns out each partition contains data for interleaved macroblock rows (e.g. for two partition one contains all even rows and another one all odd rows) but you won’t find it in the text;
  • the actual reference frame update order is not documented (and it’s not obvious either);
  • loop filter adjustments—you read those for each reference frame and macroblock type, but the actual order in which they are stored is not documented (and macroblock type order is hard to guess).

Now, the whole motion vector prediction process. The “specification” says:

Much of the process is more easily described in C than in English.

And that’s only because copy-pasting C code is much easier than thinking what you’re actually doing and describing it. Especially if your code is equally confusing.

For starters, it does not mention the fact that predicted vectors should be clipped so that they don’t reference area outside 16 pixels from the frame edge (well, code has it). Then there’s an inane rambling about macroblocks:

As is the case with many contexts used by VP8, it is possible for
macroblocks near the top or left edges of the image to reference
blocks that are outside the visible image. VP8 provides a border of
1 macroblock filled with 0x0 motion vectors left of the left edge,
and a border filled with 0,0 motion vectors of 1 macroblocks above
the top edge.

While in reality those non-existing macroblocks (like intra-coded macroblocks) should not be taken into account while predicting motion vectors.

Then there’s another uncertainty not resolved in the document (and not apparent from the code either): you can have three motion vector candidates with weights 2, 2 and 1; weights for the same motion vector values are summed together; prediction MV is the one with the largest weight, nearest MB is non-zero MV with the highest weight. Now I got one zero MV with weight 2 and one non-zero MV with weight 2, which one should I pick for for the prediction? The answer turned out to be non-zero one.

Outright idiocy

There are not so many such issues here but as it’s said, even once is one time too many.

First of all, it’s their wonderful decision to set unavailable left pixels to 129 and unavailable top pixels to 127. It can be explained only by some flaw in their encoder producing slightly biased images that they compensated with this trick (RealVideo 3 and 4 had the same problems). And unless you do it like they do (by adding invisible edges to the frame and filling them with the desired values) it’s rather annoying to implement (for instance, you need to decide the value of top-left pixel depending on which edge you’re on).

And as a cherry on top you have test vectors. With odd dimensions. In YUV420 subsampling. Two such files actually (vp80-00-comprehensive-006.ivf and vp80-00-comprehensive-014.ivf). And even if you ignore chroma issues, the encoded data is completely distorted by the encoder (I even checked vpxdec output to be sure). And yet somebody thought it’s a good idea to have these files and somebody else approved it. Sigh.

Conclusion and useless rant

I implemented decoder for this format mainly for having a complete support of Duck family of codecs. Of course there’s VP9 and VP10 (aka AV1) but there it was done by a new owner (even if most of the people are still the same) and they’ve changed some approaches (“open-source” development from the start, dropping even a pretence that they care about documenting it, tying it all to a new ecosystem and so on) so I can’t consider them to be Duck codecs.

And since it deserves mentioning somewhere (and I’m not going to do VP9): there’s an alternative semi-legendary software VP9 encoder from Two Orioles which gives better compression than libvpx at the same speed (and somebody might even use it instead of libvpx). I call it semi-legendary because I’ve never seen any evaluations of it so you just have to believe it exists and works as good as it’s claimed. But I don’t call it fully legendary because I know both Ronald (and that he’s capable of delivering such product) and the original code is not that good to begin with (so it’s not as hard to beat as making an H.264 encoder faster and better than x264). And it gets even funnier with AV1 and how even AOM members would prefer SVT-AV1 as the main codebase instead of libaom and libgav1 being developed while there’s much better dav1d.

The closer you look at those formats the less you want to do that. And the more I look how it all works the less I believe in their developing process. ITU video codecs may be not perfect but they’re made in an open way with an intent to produce a standard (and some reference code to verify your implementations against). VPx is just a source codebase with some optional documentation and probably no one can explain what was the rationale behind some things or how some parts of code really work. And if you wonder if it can be worse: yes, AVS (Chinese standard rip-off of H.264) had communication problems and got into a complete disarray with the standard, the reference code and the reference test samples not fully agreeing with each other.

I still believe that the only proper way for a standard to work is to have the working group to produce a text specification, make two independent parties produce an encoder and a decoder using just that specification, and compare them all against the reference code (i.e. how well the encoded samples from both encoders can be decoded with both decoder)—and then you make amendments in the specification if the results disagree or if the implementers have objections about some parts of it. This way you can be sure that your specification is understandable and complete. Sadly I think the only time it worked outside ITU and MPEG was with VC-1 (the reference decoder for this micro and soft codec was written by ARM!) and FFV1. And (at least if you believe Chiariglione) MPEG is no more so I expect decent formats with good specification to appear even less often than before.

Mind you, I expect more codecs to be standardised in upcoming years, but I also expect them to be more or less fast tracked similarly to VP8 where various experts from open-source community were allowed to evaluate it before the official open-sourcing event and whatever issues they found were ignored for “keeping the compatibility with already released binary version of the codec”. And if you think I blame only On2 for producing sub-standard format description, you can also look at Xiph. While their audio codecs are objectively good, only Vorbis is well-documented, FLAC is mostly documented, Speex is undocumented (beside several papers on the codec theory) and Opus is essentially done the same way as VP8 (RFC containing source code). And the results are similar: Vorbis decoders are written in different ways in several programming languages, FLAC decoders are even more so (even I have written one), Speex has one single implementation (sometimes translated into another programming language by a tool) and Opus has only one alternative implementation based on the original C code (and a lot of failed attempts to write a decoder in Rust, even Luca has tried and failed at that; hopefully I don’t have to go there). And I think I mentioned before that audio codec specifications outside MPEG are often lacking crucial details as well (especially the ones from DT$ with their attitude “see the code we commercially license to you for details” but IIRC some ITU G.72x speech codecs were essentially a source dump with an accompanying note).

There are many ways to look on codecs: patent freedom, open source implementation, compression ratio and speed and so on. I don’t care about patents, given a disassembler and enough time all codecs are open source, I don’t encode stuff myself—so I care mostly about the originality and how easy it’s to implement a decoder for it. VPx codecs have some originality in them (VP8 as well even if feels like a regression from VP7) but VP8 decoder was horrible to implement and that’s what counts as well. Hopefully I’ll never have to deal with post-Duck codecs.

6 Responses to “VP8: dubious decisions, deficiencies and outright idiocy”

  1. Paul says:

    … and enough time all codecs are open source – Except bink2.

  2. Kostya says:

    Well, just take more time, I believe in you.

  3. lu says:

    Oh well, Opus is 2 decoders in one, one decoder is easier to implement and write tests for (and I did) the other has documentation less simple to follow, it sort of refers to the reference code for some details and making tests is much less straightforward.

    I will complete it once I find the will to extract specific test vectors and use it to validate what I wrote for the celt side.

  4. Paul says:

    Any codec can be “bullied”, except maybe raw video.

  5. Ed says:

    Part of the reason I still hope VVC win the battle. But it is increasingly hard to paint a decent picture.

  6. Kostya says:

    Well, you have a choice between a patent clean AV1 and very patent clean VVC (I expect it to get several patent pools formed as it was the case with its predecessor).

    Greed—ruining multimedia ecosystems since MP3.

Leave a Reply