VP8: specification analysis

In a recent post titled Is VP8 a Duck codec? the majority (both commenters) decided it’s a Duck codec after all so I’ll have to implement a decoder for it in NihAV. Back in the day Jason from x264 looked at it from his perspective and found it inferior in most parts to H.264 (and rightfully so). That post was the most popular on multimedia.cx ever since Steve Jobs replied with a link to it once. But since those days too many things have changed, there’s no Jobs, there’s no Jason, his blog is deleted and all you can find is an archived copy. And now it’s my turn to look at VP8 and see how it fares against other codecs I know.

And of course I start with its specification.

The main thing that has happened since 2010 release of VP8 as the next greatest opensource codec since VP3 was the fact that we’ve managed to find VP6 and VP7 specifications on pudn.com (another site that is no longer with us). It is worth noting that there were requests to make older On2 code public but they were denied since “the developers would rather work on improving new codec” (and the fact that there were textual specifications was not disclosed either). Anyway, let’s see how the old VP7 specification fares against RFC 6386.

Overall document structure

VP7 specification structure:

  1. Introduction
  2. Uncompressed Frame Format
  3. Compressed Frame Types
  4. Overview of Compressed Data Format
  5. Overview of the Decoding Process
  6. Description of Algorithms
  7. Boolean Entropy Decoder
  8. Basic Data Components
  9. Frame Header
  10. Macroblock Features
  11. Key Frame Macroblock Prediction Records
  12. Luma Modes
  13. Intra Prediction Process
  14. DCT Coefficient Decoding
  15. DCT Inversion and Macroblock Reconstruction
  16. Loop Filter
  17. Interframe Macroblock Prediction Records
  18. Motion Vector Decoding
  19. Inter-prediction Buffer Calculation
  20. Golden Frame Update
  21. Document Revision History

VP8 specification structure:

  1. Introduction
  2. Format Overview
  3. Compressed Frame Types
  4. Overview of Compressed Data Format
  5. Overview of the Decoding Process
  6. Description of Algorithms
  7. Boolean Entropy Decoder
  8. Compressed Data Components
  9. Frame Header
  10. Segment-Based Feature Adjustments
  11. Key Frame Macroblock Prediction Records
  12. Intraframe Prediction
  13. DCT Coefficient Decoding
  14. DCT and WHT Inversion and Macroblock Reconstruction
  15. Loop Filter
  16. Interframe Macroblock Prediction Records
  17. Motion Vector Decoding
  18. Interframe Prediction
  19. Annex A: Bitstream Syntax
  20. Attachment One: Reference Decoder Source Code
  21. Security Considerations
  22. References

As you see, nothing in common.

Introduction

It seems to be almost the same except for minor wording changes and the usual RFC disclaimer about modal verbs added in VP8 specification.

The only major difference is the first paragraph.

VP7:

This document describes the VP7 compressed video data created by On2 Technologies Inc. together with a discussion of the decoding procedure for this format. It is intended to be used in conjunction with, and as a guide to, the reference decoder provided by On2 Technologies.

VP8:

This document describes the VP8 compressed video data format, together with a discussion of the decoding procedure for the format. It is intended to be used in conjunction with, and as a guide to, the reference decoder source code provided in Attachment One (Section 20). If there are any conflicts between this narrative and the reference source code, the reference source code should be considered correct. The bitstream is defined by the reference source code and not this narrative.

And I find it funny how they changed “contemporary video compression schemes” to “modern video compression schemes” in the next paragraph.

Chapters 2-8 (various overviews)

Those chapters are more or less the same except that VP8 introduces WHT for macroblock DC coefficients and altref frame (which is essentially the second golden frame). And the code is snake_case instead of camelCase.

Additionally VP8 decided to move frame tag description to frame header and not mention it at all in chapter 4.

Also I’d like to note that bool coder chapter in VP6 specification was shorter, faster to the point and did not have bool encoder source code (which is rather strange to see in the decoder description outside some annex describing how encoding should work).

Frame header

That’s where the real differences start.

VP8 introduces profiles (called versions because VP version 8 version 0 sounds great) that restrict motion interpolation and loop filter. So e.g. version 3 means no sub-pel precision and no loop filter (and if you want it to call simple profile instead—what’s wrong with you?).

Also it’s where the real ugliness starts, just look at these pieces of code:

  #if defined(__ppc__) || defined(__ppc64__)
  # define swap2(d)  \
    ((d&0x000000ff)<<8) |  \
    ((d&0x0000ff00)>>8)
  #else
    # define swap2(d) d
  #endif

  pc->Width      = swap2(*(unsigned short*)(c+3))&0x3fff;

I try to be not a perfectionist but this is idiotic. Simply reading 16-bit value as two bytes shifted at certain positions would be portable and free from the potential alignment and aliasing issues. And do not try to convince me the speed would matter in reading frame header (and not decoding bools and performing DSP routines). This is a bad code and it makes me suspect the rest of libvpx code is not of high quality either.

There’s also a concept of segments and frame partitioning mentioned here. I’d say that those segments look suspiciously like renamed macroblock features (but with golden frame update flag and interlaced mode dropped from it).

Intra prediction

The main differences here are mentioning mb_skip_coeff (why here?!) and the fun change so that pixels outside the coded boundaries are not 128 as in sane formats (or even VP7) but it’s 129 for the pixels behind the left edge and 127 for the pixels above the top edge. Except in certain prediction modes. I hope that saved them a lot of bits in encoding.

Overall methods description became worse. In VP7 specification it was like this:

12.1.5 B_LD_PRED
This intra 4x4 prediction process is invoked when Block Prediction Mode is set to B_LD_PRED.

The values of prediction samples are derived by:

Predictor [ 0] = (A[0] + A[1] * 2 + A[2] +2)>>2;
Predictor [ 1] =
Predictor [ 4] = (A[1] + A[2] * 2 + A[3] +2)>>2;
Predictor [ 2] =
Predictor [ 5] =
Predictor [ 8] = (A[2] + A[3] * 2 + A[4] +2)>>2;
Predictor [ 3] =
Predictor [ 6] =
Predictor [ 9] =
Predictor [12] = (A[3] + A[4] * 2 + A[5] +2)>>2;
Predictor [ 7] =
Predictor [10] =
Predictor [13] = (A[4] + A[5] * 2 + A[6] +2)>>2;
Predictor [11] =
Predictor [14] = (A[5] + A[6] * 2 + A[7] +2)>>2;
Predictor [15] = (A[6] + A[7] * 2 + A[7] +2)>>2;

in VP8 specification it’s

 case B_LD_PRED:    /* southwest (left and down) step =
                       (-1, 1) or (1,-1) */
     /* avg3p(A + j) is the "smoothed" pixel at (-1,j) */
     B[0][0] = avg3p(A + 1);
     B[0][1] = B[1][0] = avg3p(A + 2);
     B[0][2] = B[1][1] = B[2][0] = avg3p(A + 3);
     B[0][3] = B[1][2] = B[2][1] = B[3][0] = avg3p(A + 4);
     B[1][3] = B[2][2] = B[3][1] = avg3p(A + 5);
     B[2][3] = B[3][2] = avg3p(A + 6);
     B[3][3] = avg3(A[6], A[7], A[7]); /* A[8] does not exist */
     break;

Each version has its advantages and drawbacks but VP7 version is clearer since you don’t need to remember how that avg3p() works (and it looks like they cared to document it instead of copy-pasting the source code).

DCT coefficient decoding

The coding itself is conceptually still the same as it was in VP5 except that since blocks contain just 16 coefficients, there are no explicitly coded zero runs.

Also while VP7 specification lacks on actual block decoding, VP8 document provides a very pseudocode description with items marked like **this** to denote things that are done somewhat differently in the actual source code but should mean the same thing, like:

     if ( **token_has_extra_bits(token)** )
     {
         extraBits = DCTextra( token );
         absValue =
             categoryBase[**token_to_cat_index(token)**] +
       extraBits;
     }
     else
     {
         absValue = **token_to_abs_value(token)**;
     }

Why not simply declare it all a pseudocode instead of doing something not unlikely what the reference code does but in a different way?

Macroblock reconstruction

VP8 at finally has a sub-section dedicated to dequantisation with the actual quantisers (that’s what VP7 was missing).

And now it has a new integer approximation of DCT plus WHT for the blocks of luma DCs (VP7 applied the same transform to all blocks).

And finally, it seems that VP7 had a special meta-DC prediction mode (which predicted the DC of the block of transformed DCs for luma blocks) but looks like it was dropped (probably for too little bitrate improvement).

Chapters 16-18 (loop filter, MVs, motion compensation)

While the actual algorithms and data remain the same, VP8 specification is much wordier and quotes source code instead of explaining how it’s done with giving just a minimum piece of clear pseudocode as VP7 did.

The only substantial difference is that motion vectors now can be two bits longer so the probability tables for them were changed.

Annex A: Bitstream Syntax

This was not present in VP7 specification and generally it’s a good idea. The only irritating thing is using the table format from H.264 specification and presenting it in text form which may give some people NUT flashbacks.

Security Considerations

Another boilerplate thing added for RFC. I have an older draft of the specification in PDF form from the WebM Project site and it ends with the annex A and the references section is put before it.


Conclusions

This comparison had two goals: to see how VP8 differs from VP7 and how good its specification really is.

I remember VP8 being a version of VP7 with some of its features stripped out. And it turns out mostly true. Partial golden frame update, variable pitch mode (good riddance!) and some other things are gone. DCT was replaced with a DCT and WHT combination, another golden frame is added and now there’s a coefficient data partitioning. Not an equal exchange in my opinion.

What about the specification itself and its usefulness? I’d say that previous releases were slightly more useful since they accompanied the decoder source code and tried to explain how it all works. Now, as it went public, they had essentially the same document but had to make it into stand-alone format specification which was done by embedding source code and replacing some high-level descriptions with excepts from it. That did not help much though.

As I said above, the quality of that code does not look good and it was not written for clarity (and comments can’t fix it all). The specification is lacking. For example, it mentions that data partitions contain DCT coefficients that can be decoded in parallel but it fails to mention how exactly those coefficients are partitioned (and the answer can be found only in the source code). I’d say formally it’s enough to write your own decoder but only because it comprises the reference decoder source.

Similarly it is no surprise that there’s no VP9 specification—in that codec they’ve switched from H.264 to H.265 as the “inspiration” source and could not be bothered to write any specification for it, the only sensible version appeared about three years after the codec release, it was written by Argon Design engineers (under some the supervision from some On2 guy of course) and stuck at version 0.6.

All signs are telling this is not going to be fun. But I had worse and VP8 is sufficiently close in design to VP7 for which I’ve written a decoder already, so it should be feasible.

4 Responses to “VP8: specification analysis”

  1. Paul says:

    Duck codecs are very inspirational, not.

  2. Kostya says:

    They can inspire you to make your own rip-off codec.

  3. Ksec says:

    And yet almost every single person on the internet are cheering for Baidu’s video codec and hope MPEG and their parties fails. And when Baidu stopped looking at MPEG for “inspiration” AV1 is the kind of codec that we end up with.

  4. Kostya says:

    I doubt they’ve stopped looking at it—there are still lots of things that looks suspiciously the same in H.26x and AV1/AV2.

    The problem is that (at least if you believe Chiariglione) that MPEG is no more so you should not expect ITU H.26x codecs to be ratified under MPEG names. Where would AV3 takes its inspiration from?

Leave a Reply