VP3-VP6: the Golden (Frame) Age of Duck Codecs

Dedicated to Peter Ross, who wrote an opensource VP4 decoder (that is not committed to CEmpeg yet at the time of the writing).

The codecs from VP3 to VP6 form a single codec family that is united not merely by the design but even by the header—every frame in this codec (sub)family has the same header format. And the leaked VP6 format specification still calls the version field there Vp3VersionNo (versions 0-2 used by VP3, 3 is used by VP4, 5 is for VP5 and 6-8 is for VP6). VP7 changed the both the coding principles to mimic H.264 and the header format too. And you can call it the golden age for Duck because it’s when it gained popularity with VP3 donated to open-source community (and xiphed to Theora which was the only patent-free(ish) opensource video codec with decent performance back then) to its greatest success found in VP6, employed both in games and in Flash video (remember when BaidUTube still used .flv with VP6 and N*llyMos*r ADPCM or Speex?). And today, having gathered enough material, I want to give an overview of these codecs. Oh, and NihAV can decode VP30 and VP31 now.

VP3

It is hard to say how this codec was designed but looks like it was mostly good for network transmission with the risk of packet loss (but then why it does not have CRCs?) and for doing stuff not like other codecs do.

Overall format can be described as being very weird variation on progressive JPEG. Image (all three planes) is represented as a sequence of 32×32 pixel super-blocks (part of super-block outside plane boundaries is not coded), each of them contains 4×4 macroblocks, each of those contains usual 8×8 blocks. The block metadata (coding type and motion vectors) are coded as single chunks of data for all blocks before coefficients. Coefficients are also coded in sequence: DCs for all blocks first, then 1st AC for all blocks (where present), then 2nd AC … up to the 63rd AC. There are two kinds of frames—intra and inter. Inter-frames can use either previous frame or last intra frame (called golden frame) as the reference (VP6 finally allowed any inter-frame to become new golden frame but let’s talk about it later).

As you can see, it’s robust against truncated frames but I cannot imagine such scenario being important for the codec not used in video calls (and that was VP7—way later than VP3). Also that new block ordering required a new way to walk them so there are no jumps between last block in current super-block and first block in next super-block. Hence the Hilbert curve suitable for macroblocks and blocks alike.

IMO codec was designed not for efficiency but rather for being sufficiently different from the rest. And while I don’t like the result, I must admit that at least it’s quite original in some parts and I appreciate the variety.

Anyway, beside golden frames, peculiar block walk pattern and data partitioning the other notable thing is block coefficients coding. While the method remains the same (you need to code sparse array essentially), instead of using the standard run-length codebook VP3 uses semi-fixed scheme with entries grouped in categories (fixed number of end-of-block flags, longer amount of EOB flags, fixed coefficient, small coefficient in fixed range, zero run of one plus small coefficient and such). In result you need to read one of 32 possible tokens using one of several codebooks and then unpack token by following the fixed scheme. For the reference, H.263 coefficients codebook contains over hundred entries (and still need escape values handling).

Now let’s review VP3 flavours.

VP30 is obviously the zeroth version (it does not even have a version in the header and stores dimensions in macroblocks there instead). Block coded flags are coded hierarchically: first there are a bit-run coded flags for super-blocks whether they contain coded macroblocks, then for all non-empty super-blocks there are bit-run coded flags for macroblock that contain coded blocks, and finally there are bit-run coded flags for blocks in coded macroblocks (obviously in intra frames all block are coded and are of intra type so such metadata is not present).

After that you have macroblock types (just for coded macroblocks of course) which is read with a fixed codebook. There are seven macroblock types: intra, skip, copy with MV, copy using motion vector from last inter block, four MVs per macroblock, copy from golden frame from the same position, copy from golden frame using MV. The WTFiest moment for me is that uncoded intra block in VP3 (which may happen only in inter frame) actually means skipped block.

After that you have motion vector information for the macroblocks that require it using fixed coding scheme for motion vector components. As you can see above, the only kind of MV prediction is re-using motion vector from last inter block with coded MV (VP5 will improve this).

Coefficients are coded using three codebooks (one for DCs, one for intra block ACs and one for inter block ACs) with codebook set selected from one of five possible depending on quantiser. DC prediction is simple: just add last DC from the block of the same type (intra for intra, any inter for inter block). Also it has been using bit-exact DCT that remained until VP7 decided to switch to H.264-like transforms.

Loop filter seems to be the same and survives even in VP8 (under the name of simple loop filter) but it’s not used on output picture since VP4…

Fun fact: VP30 decoder was written in C++ and binary VP31 (and maybe even VP4) decoder still contains it along with C code for decoding VP31.

And as I’ve mentioned in the beginning, NihAV can decode it now while libavcodec/vp3.c still can’t.

VP31 is a small improvement on the above and it’s the version people know the most (if you include Theora based on it). Block coding information is now coded in following layers: first it codes information for partially-coded super-blocks, then it classifies the rest into fully coded/uncoded super-blocks and finally it signals which blocks in partially coded super-blocks are actually coded.

Macroblock types are coded using either raw indices or unary coding plus one of the predefined sets of values (or a custom one).

Motion vectors now can be decoded using fixed coding scheme (slightly different one) or raw 5-bit value plus sign (even for zeroes). Also now there’s an inter block type that uses second last MV.

Coefficients are now coded using a set of five codebooks (DC, ACs 1-5, ACs 6-14, ACs 15-27 and ACs 28-63) for luma and for chroma. Codebooks are selected from a set of sixteen using two indices read from bitstream for DCs and (after DCs are decoded) two indices read from bitstream for ACs. DC prediction is now complex and uses weighted DCs from neighbours depending which of them of the same kind are available (intra, last frame reference or golden frame reference).

And there is an elusive VP33 (aka VP3 version 2) which has some leftover code in the VP31 opensource dump including the tables and block coded flags.

While most of the code is the same there are two main changes: block coded flags is completely different now and there’s now a completely new set of tables. Block coded flags are decoded on macroblock basis: first you decode flags denoting fully coded macroblocks, then you decode flags for partially coded macroblocks and finally you decode coded block pattern using previously decoded block pattern and 3-5 bits of data. The rest goes exactly as VP31.

Either this was an experimental codec or simply a leftover they forgot to remove but while binary VP3 decoder supports it, there’s nothing known about any samples or means to encode them. Call it Bink-d of VP codecs.

VP4

Now enters VP4. Essentially it’s VP33 with some tweaks and some changes in frame reconstruction. The main differences are motion vector coding scheme (it’s more complex now) and the fact that loop filter is not applied to the decoded frame but used on motion block source instead (i.e. it copies block from source frame to the temporary buffer and if it contains edge pixels then it applies loop filter on it). The same scheme was employed in VP5 and VP6 as well. Also now it stores table indices before all coefficients (instead of DC luma table index, DC chroma table index, all DCs, AC luma tables index, AC chroma tables index, AC1, AC2, …) and has some more information in the frame header but that’s all I can think of. Also DC prediction has been changed once again (seems to be simply an average of neighbouring DCs from the same kind blocks).

The only fun facts about it is that some of its code is already present in opensource VP31 dump (being mostly VP33) and that binary VP4 decoder contains code and tables for bilinear and bicubic motion interpolation employed in VP5/6. I’m yet to look at VP5 and VP6 binary specifications so I don’t know whether they contain hints on VP7 but that seems unlikely.

VP5

This is a codec that radically changed two things: data is now coded using their bool coder (a variant of arithmetic coder for coding bits that uses fixed probabilities instead of an adaptive model) that is used to decode bits for values and codes in fixed Huffman trees (also that means that now there are twelve DCT token categories instead of 32 like before). This scheme will survive until VP9 and has some chances to reappear in AV2 (just a guess—we know who develops it and nothing else beside that fact).

In result the codec could return to having ordinary 16×16 macroblocks without super-blocks (and code blocks instead of coefficient slice). Also now motion vector prediction relies on using so-called nearest and near MVs that come from neighbour blocks visited in certain order.

The only major drawback I can name is that now inter frames may rely on previous ones since they can transmit probability updates and thus dropped frame will result in complete garbage instead of damaged frame converging back to normal image.

VP6

And now VP5 can be tweaked for performance. So the notable changes include: signalling that this inter frame is a new golden frame and new data coding and partitioning modes. Now if you want to have lower latency you can split your data into two parts and start decoding coefficients as soon as you decode corresponding macroblock information. Or if you really care about speed you can switch to old way of coding by constructing Huffman trees from model probabilities and having special handling for zero/EOB runs. Also now there’s an interlaced mode and alpha support. No wonder the codec was popular on the Web until H.264 came to power.


From VP30 to VP6.2 it was a steady evolution of the codec (beside brief return from bool coding for Huffman trees in VP6) with most concepts remaining the same and the biggest changes being between VP4 and VP5—new bool coder, qpel MC and switch from super-blocks to ordinary macroblocks (and new MV prediction scheme to some extent). But other things like data partitioning, DCT tokens, DCT itself and loop filter remained the same. The same way VP6 into VP7 change will involve changing of coding blocks into 4×4 ones, new transforms and spatial prediction, but many other things like bool coder or MC interpolation still remain there.

Maybe I’ll revisit it later if I ever get to VP7 decoder but for now this should give you an idea how Duck codecs changed over the time and why it ended like that (maybe one day somebody will manage to answer a question “Why platypus? Just why?”). Meanwhile I’m looking sideways for AV2 aka VP11 and its possible collision course with H.266 aka VVC.

One Response to “VP3-VP6: the Golden (Frame) Age of Duck Codecs”

  1. Peter says:

    Great write up.

    VP33 was never released, and I wonder if the increasingly popularity of MPEG-4 at the time encouraged the marketing team to name it VP4.

    No mention of interlacing capabilities. Good!