Archive for the ‘NihAV’ Category

NihAV: released!

Monday, July 27th, 2020

NihAV was a fine joke that had been running for far too long. But today, on no particulate date at all, I release it for public to ignore or to briefly look and forget immediately. Some decoders (Bink2, ClearVideo and Vivo 2) are still far from perfect, some features have simple or sketchy implementations, but despite all of that here it is.

The official website is here, source code is here.

Many thanks to people from former Libav project for hosting.

Some words about NihAV tools

Saturday, July 11th, 2020

Since the work on NihAV is nearing the point when I can release it to public without that much shame (main features I wanted to implement are there and I’ve even documentation for all public interfaces plus some overview, you can’t ask for more than that) I want to give $title.

nihav-tool

This is the oldest tool oriented mostly to test decoders functionality. By default it will try to decode every stream in a file and output it either into a wave file or a sequence of images (PPM for RGB video, PGMYUV for YUV). Beside that it can also not decode a stream (and if you choose to decode neither then it tests demuxer or dumps raw frames).

Here is the list of switches it understands:

  • -noout makes it decode data but not produce any output (good for testing decoding process if you don’t currently care about decoder output);
  • -an/-vn makes it ignore audio or video streams correspondingly;
  • -nm=count/pktpts/frmpts make nihav-tool write frame numbers as a sequence or using PTS from input packet or decoded frame correspondingly;
  • -skip=key/inter tells video codec (if it is willing to listen) to skip less significant frames and decode only keyframes or intra- and interframes but no B-frames;
  • -seek time tells the tool to seek to the given position before decoding;
  • -apfx/-vpfx prefix specify the prefix for output filename(s) which comes useful when decoding files in a batch;
  • -ignerr tells nihav-tool to keep decoding ignoring errors the decoders report;
  • -dumpfrm tells nihav-tool to dump raw frames. This is both useful for obtaining raw audio frames (I could not make avconv do that) and because of the way it is implemented (it dumps packet contents first and then tries to decode it) if you use it along with the decoder and it errors out you’ll have raw frame on which it errored out.

Additionally you can specify end time after giving input name if you don’t need to decode the whole file.

As you can see this is not the most feature-rich tool but it works good enough for the declared goal (hence I use it mostly a debug build of it).

nihav-player

This is another quick and dirty tool that appeared when I decided that looking at long sequences of images is not the best way to ensure that decoding goes right. So I wrote something that can pass in a bad light for a player since it can show moving pictures and play sound that sometimes even goes in sync instead of deadlocking audio playback thread.

Currently it’s written using patched SDL1 crate (removing dependencies on num and rand and adding YUV overlay support and audio interface that you can actually use from Rust; patches will be available in the same repository) because my primary development system is too old and I don’t want to mess with various libraries or finding which version of sdl2 crate would compile using my current version of Rust (1.31 or 1.33.

In either case it’s a temporary solution used mostly for visual debugging and I want to write a proper media player based on SDL2 that would play audio-only files just as fine (so I can move to dogfooding). After all, can you really call yourself a multimedia developer if haven’t written a single player?

nihav-encoder

And finally the tool that appeared out of need to debug encoders instead of decoders. Hopefully it will become more useful than that one day but at least its interface should give you the idea what it does and what it will do in the future.

I still consider one of the main problems with ffmpeg (the tool) and later avconv the positional order of arguments. Except when the order does not matter. If you’ve never been annoyed by the fact you should put some arguments before -i infile in order for them to take effect on input and the rest of arguments should be put before output file name—well, in this case you’re luckier than me. So I’ve decided to have it in a more free-form format.

nihav-encoder command line looks a list of options in no particular order and some of them take complex arguments and then you provide a comma-separated list in form --options-list option1,option2=value,option3=.... Here is the list of recognised options:

  • --list-{decoders,encoders,demuxers,muxers} obviously lists the corresponding category and quits after listing all requested lists and options (see the next item);
  • --query-{decoder,encoder,demuxer,muxer}-options name prints the list of options supported by the corresponding codec or (de)muxer. Of course you can request options for several different things to be listed by adding this option several times;
  • --input inputfile and --output outputfile;
  • --input-format format and --output-format format force (de)muxer to use the provided format when autodetection fails;
  • --demuxer-options options takes a comma-separated list of options for demuxer (BTW you can also force input format with e.g. --demuxer-options format=avi);
  • --muxer-options options takes a comma-separated list of options for muxer (BTW you can also force output format with e.g. --muxer-options format=avi);
  • --no-audio and --no-video tell nihav-encoder to ignore all audio or video streams correspondingly;
  • --start time and --end time tell nihav-encoder to start decoding at given time and end at given time. The times are absolute so --start 1:10:00 --end 1:11:00 will process just a second of data;
  • --istreamX options and --ostreamX options set options for input and output streams with given numbers (starting with zero of course). More about them below.

nihav-encoder has two modes of operation: query mode, in which you specify which e.g. demuxers or codec options you want listed, and the program quits after listing them; and transcode mode, in which you specify input and output file and what you want to do with them. Maybe I’ll add a probe mode but I’ve never cared much about it before.

So what happens when you specify input and output? nihav-encoder will try to see which streams can be output (e.g. when transcoding from AVI to WAV there’s no point to even attempt to do anything with video stream), then it will try to copy input streams to the output unless anything else is specified. Of course you can specify that you want to discard some input stream with e.g. --istream0 drop. And for output streams you can also specify encoder and its parameters. For example my command line for testing Cinepak encoding looks like this:

./nihav-encoder –input laser05.avi –output cinepak.avi –no-audio –ostream0 encoder=cinepak,quant_mode=mediancut,nstrips=4

It takes input file laser05.avi, discards audio stream, encodes remaining video stream with Cinepak encoder that has options quant_mode and nstrips set explicitly, and writes the result to cinepak.avi.

As you can see, this tool has enough features to serve as a daily base transcoder but no complex features like taking input from several files, arbitrary mapping input streams from them to output streams and maybe applying some effects while at it. In my opinion that’s the task for some more complex application that builds a complex processing graph probably using a domain-specific language to specify inputs and outputs and what to do with them (and it should be a proper command file instead of command line that is impossible to type correctly even from the eighth try). Since I never had interest in GStreamer I’m definitely not going even to play with that. But a simple transcoder should serve my needs just fine.

Another reason for NihAV

Saturday, July 4th, 2020

So instead of doing something productive like adding missing functionality bits and writing documentation I wasted my time on adding some QuickTime decoders. And while wasting time on adding SVQ1, SVQ3, QDMC and QDM2 decoders it became apparent why NihAV is a good thing to exist.

Implementing two of them was not a very big deal but implementing SVQ3 and QDM2 decoders took more than a week each because there are only two specifications available for them and both are equally hard to comprehend: the first one is the official binary specification, the second one is source code in libavcodec which is derived from the former.

The problem arises when somebody wants to understand how it works and/or reimplement the code and both SVQ3 and QDM2 decoder demonstrate two different aspects of that problem.

SVQ3 decoder is based on some draft of H.264 (or ex-MPEG/AVC if you’re from Piedmont) with certain extensions mostly related to motion compensation. Documentation for it was scarce and because of optimisations and integration with common H.264 decoder bits it’s hard to understand some of the things. One of those is intra prediction with two modes having SVQ3-specific hacks hidden in libavcodec/h264pred.c (those are 16×16 plane prediction mode giving transposed result and 4×4 diagonal down prediction being simplified and not relating on pixels not immediately top/left from the block) and another one is block coefficients decoding function. It took me quite a while to realize that it actually decodes three different kinds of blocks: single 4×4 block with zigzag scan, 4×4 block divided into two parts with interlaced scan, and 2×2 block. I’ve documented most of that in The Wiki (before that nobody has touched that page for almost ten years; sometimes I feel like I’m the only person contributing there).

QDM2 is horrible in different way. It is slightly improved translation of the original binary specification with hardly any idea how it works (there are still names like local_int_8 in the code). Don’t get me wrong, back in 2003-2005 when reverse engineering was done the only tools you had were debugger, disassembler (you’re lucky if it’s not the one provided by debugger) and no decompilers at all (IIRC rec appeared much later and was of limited usefulness, especially on multi-megabyte QT monolith—and that’s assuming you’re not doing that on Mac with even less tools available). I did some of such work back then as well so I understand how hard it is and how you’re happy that it works somehow and you can ship it and forget about it.

Another thing is that now it’s clear that QDMC and QDM2 are predecessors of DT$ LBR (aka Express) and use the same principles (QDMC simply coded noise and tones, QDM2 is almost like LBR but without some features like LPC or multichannel audio and with different chunk structure), but back in the day there was no documentation on LBR (or LBR itself for that matter).

But the main problem is that nobody has tried to understand the code since. It became a so-called category killer i.e. its existence prevents others from doing something similar. At least until some idiot tried to do another implementation in NihAV.

And here we have the reason for NihAV to exist: it advances the understanding of codecs for me (and I document results in The Wiki) resulting in different implementations that are (hopefully) easier to understand and sometimes even fix long-standing bugs. I hope this shall convince you that sometimes it’s good to have reimplementation of the decoder even if an existing implementation is good enough (as far as I remember the only time a decoder was rewritten in FFmpeg was when a reverse-engineered Indeo 3 decoder that crashed on damaged content almost every time was replaced with a reverse-engineered Indeo 3 decoder where a guy had the idea how it works).

But back to QDM2: while my decoder is not finished yet and I probably won’t bother with inter-frames in it (I’ve never seen any samples with those), it still decodes sweeps much better. That’s mostly because of the various bugs I’ve uncovered (also while discovering that Ghidra effectively does not allow to edit about a megabyte large decoder context). Since I have no incentive to produce a patch and people who created the decoder are long gone from the project, here are some spotted bugs: wrong coarse quantiser band selection (resulting in noise generated in wrong frequency range), reading bits past the chunk end (because is some cases checks are missing), ignoring group 4 tones because of the wrong conditions, some initial variables are set in the wrong way too. Nevertheless it mostly works and it was very useful for mapping the functions in the binary specification (fun fact: QDM2 decoder is located in QuickTime.qts while QDMC is located in QuickTimeInternetExtras.qtx).

NihAV: Conceptually Done!

Sunday, June 7th, 2020

I’m happy to announce that NihAV has finally taken more or less complete form. Sure there are some concepts I wanted to play with (like raw streams handling) but I had no need for them so far so it can wait until much much later. But all major features required to build a transcoder are there as well as working transcoder itself.

As I wrote in the previous post I wanted to play with vector quantisation so first I implemented image palettisation but since that was not enough I implemented two encoders using vector quantisation: 15-bit MS Video 1 and Cinepak. I have no doubts that Tomas Härdin has written a much better encoder but why should that stop me from NIHing? Of course such encoder is not very useful by itself (and it was useless to begin with) so I needed a muxer to represent encoder output in some form. And then simply fiddling with parameters and recompiling became boring so I finally introduced generic options and in order to use those options without recompiling the binary every time I had to write a transcoder as well. But that means that now I can use NihAV to recode media into something else even if it’s just two crappy video encoders, MS ADPCM and PCM encoder with the large variety of supported output containers (AVI and WAV!). I called it conceptually done because all the essential concepts are there, not because there’s nothing left to do.

Now about video encoders. I’ll describe the NihAV design and how it works on a separate page, for now I just mention that while decoders are working on “frame in-picture/audio out” principle, encoders accept single picture or audio buffer for encoding and then may output a series of encoded packets. Why such asymmetry in design? Because decoders are expected to produce single output for single input (and frame reordering is handled externally) while most encoders are expected to have at least a single audio frame or couple of pictures of lookahead to make decisions about coding of a current input. For modern video codecs it may be a decision what frame type to assign or where to start a new scene, for audio codecs like AAC you may need to change current frame type if the following frame type has transients and previous frame type didn’t have them.

Anyway, back to the technical details about the encoders. MS Video 1 operates on 4×4 blocks that can be coded as skipped, filled with single colour, filled with two colours in a pattern, or split into 2×2 sub-blocks each filled with its own two colours in a pattern. Sounds perfect for median cut. Cinepak is much more complex. It splits frame into several strips and each strip is also split into 4×4 blocks that may be coded as skipped, single 2×2 YUV codeword (2×2 Y block and single U and V values) scaled twice or four YUV codewords from different codebook. Essentially for a good encoding you need to determine how to partition frame into strips optimally, split blocks into single and four-vector ones and find optimal codebooks for them separately. Since I wanted to write a working encoder mostly to check whether vector quantisation is working, I simply have fixed amount of strips and add every block as a candidate for both coding schemes without a following refining steps.

Here are some numbers if you really care about those. Input is laser05.avi (320×240 Indeo2 file with 196 video frames from the standard samples place). Encoding with MS Video 1 encoder takes about 4 seconds . Encoding Cinepak with median cut takes six seconds. Encoding Cinepak with ELBG and randomly-generated codebooks takes 36 seconds and result looks bad (but recognizable). Encoding Cinepak with ELBG that takes codebooks produced with median cut as the initial ones takes 68 seconds but the quality is higher than merely median cut and the output file is slightly smaller too.


Now with all of this done I should probably fix the knowingly bad decoders (RV6 and Bink2), add whatever missing decoders and features I see fit and start documenting it all. I have strong doubts about VDD this year but maybe I’ll be able to present my stuff at FOSDEM 2021.

NihAV: Now with Palette Support

Sunday, May 31st, 2020

While NihAV had support for paletted formats before, now it has more use cases covered. Previously I could only decode paletted format and convert picture into some other format. Now it can handle palette in standard containers like AVI and MOV and even palette change in AVI (it’s done via NASideData which is essentially the same thing I NIHed more than nine years ago). In addition to that it can convert image into paletted format as well and below I’d like to give a brief review of methods employed.
(more…)

NihAV: Toying with VivoActive

Tuesday, May 5th, 2020

Before moving to improving parts of NihAV not related to decoding I decided to implement some small family of formats and I picked VivoActive since somebody complained some of it was unsupported.

This family consists of one custom container format and three codecs based on ITU standards. Container format is simple, intended just for one video and one audio stream with video frame most likely split into 128-byte chunks (probably for better streaming), the only interesting thing is that it stores header in text form which is too flexible compared to the rest of format.

First audio codecs is ITU G.723.1 and it was painful to implement it. As a proper speech codec it has a lot of proper speech codec math like “multiply 32-bit value A by 16-bit value B and shift result by 15 bits” which requires explicit casts in Rust. On the other hoof it has saturating_add() and friends which help in many other cases. There are places where functions take the same data as input and output while in other places the same functions have different input and output arrays. Plus I wanted to have a slightly better design structure so there are functions inherent to subframes, some functions belong to the decoder instance and some are used by both. And then I had to debug it. To give it a perspective, G.723.1 decoder takes 110 kB in source form and code part is 37 kB; for Siren the numbers are 45 kB and 15 kB respectively; Vivo video decoder is merely 19 kB because most of the decoding is done by base H.263 decoder in nihav-codec-support.

Siren (or more officially Polycom Siren 7) is a codec that served as a base for ITU G.722.1. Since RealAudio Cook is based on G.722.1 and I’ve written a decoder for it already, this one was quite easy to implement. Especially considering that some guy wrote an opensource decoder and encoder for it back in early 2000s. Also this might be the case when having 5*2^N FFT finally paid off since Siren frames are 320 samples long so I still can use my standard IMDCT implementation here (it outputs samples in reverse order but that’s no problem).

And finally Vivo Video. It’s yet another codec based on H.263 (but with slightly different headers) and notable mostly for how it represents codebooks. The codebooks are stored as a single set (but not in order e.g. codebook definition number two is used for codebook number fourteen), each codebook can represent codes up to eight bits long (for longer codes you have escape prefix which means that e.g. codes starting with 0000 10 have their tails defined in another codebook set). Another interesting feature is that the codes are stored as text strings with ones, zeroes, and spaces (yes, the decoder parses them to get the actual code). Additionally it has a weird decoding mode where you keep a state ID, there’s a special table to map it to the actual codebook number, and codebook tells you how to change state ID when you decoded a new code. This mode can be used to decode the whole stream or just macroblock coefficients.

As for the codec itself, there are two flavours of it: Vivo/1.0 (or Vivo/0.90) and Vivo/2.0. The first version is plain H.263 that does not use any special features, the second version has PB-frames (i.e. frames where B-frame macroblock data is stored together with P-frame macroblock data) and it employs AIC (advanced intra coding mode). It’s probably the only codec I’ve seen that actually has AIC in P-frames and not just in I-frames. Reconstruction of P-frames because of this AIC mode is not perfect but as with G.723.1 decoder it’s good enough to demonstrate that it works and I don’t want to waste more time on it.

All in all it was a meh-y experiment with mediocre results and I should move on.

Better VMD Support in NihAV

Thursday, April 16th, 2020

As the certain doctor from ScummVMTrek reminded me, the VMD format (developed by Coktel Vision that was bought by Sierra) was used in their own games too (and even more so, there VMDs were used for many animations where simple sprite would suffice as well) and they kept making some educational games way into 2000s. So I decided to look at those as well.

The Last Dynasty (which I’ve never played and unlikely to ever play) features VMD that can be decoded mostly fine except that in 320×161 video you sometimes have sudden 640×322 frames near the end of video.

Some education game from Adi 4.0 generation. This one has some videos in 15-bit RGB format plus IMA ADPCM compressed audio track. But it can’t beat the weirdness of…

Urban Runner. Here we have a mix for 15-bit RGB VMD, 24-bit RGB VMD and VMDs with Indeo 3 video. And if you thought this was simple enough here’s another fun trick for you. Despite having different depths, all non-Indeo3 VMDs use the same compression methods, just in some cases the buffer should be interpreted as bytes, sometimes as 16-bit little-endian words and sometimes like triplets. So far so good. But they had a bright idea of sometimes storing the image dimensions in pixels and sometimes in bytes. In result I look at VMD header to see what flavour it has there to see if I need to scale frame dimensions by bytes per pixel before decoding or not. And this game also features some videos that have 312×136 resolution except that the last frame is 624×272 (I had to allow my Indeo 3 decoder to change dimensions to handle that particular case).

At least it could be done mostly by guesswork (except for audio, I had to look into ADI4.exe using Ghidra to find out that it has IMA ADPCM now) and all files can be decoded fine now.

If somebody can provide me with some samples and binary for their latest generation I’d look at it as well.

NihAV: Progress Report

Monday, April 13th, 2020

Since we all in Europe have to suffer from the lockdown until things get better I can’t travel for now and in result I have to spend free time in different ways (mind you, I still work despite it’s work from home so it’s not that much free time added). Nevertheless I’ve spent a significant part of that time working on NihAV and in result I have some progress to report.
(more…)

NihAV: Janitoring

Saturday, February 22nd, 2020

For last couple of weeks I’ve been working on documenting and restructuring NihAV. In result I’ve documented every public thing in my crates (except H.263 decoder skeleton but I need to need to debug and maybe rework it anyway) and NihAV have final crate structure.

Speaking about crate structure, modern languages often suffer from npm.js syndrome—when almost any trivial action has a separate package and most of the packages consist of imports from other packages. The other extremity would be to have two or three monolithic libraries with everything. I don’t think there’s a perfectly balanced solution so I split features using a few principles and I’ll stick to the scheme:

  1. nihav-core—the basis structure definitions like frame, packet, demuxer and decoder interfaces etc etc and utility code that should be used by both crates implementing NihAV format support and various users (like my own decoding tool and player);
  2. nihav-registry contains essentially three things: codec descriptions, codec mapping from e.g. FOURCC to codec name used by NihAV (IMO it’s better to use a string as codec identifier instead of arbitrary number that may or may be not recognized by the different version of the library) and container detection code (i.e. something like what file utility on UNIX does). This functionality can belong to nihav-core but it’s expected to be updated way more often than the base code so I decided to finally split it out;
  3. nihav-codec-support contains various pieces of code and data that are reused by many various decoders. It is intended just for decoders and has such bits as functions for testing decoder on some file, the skeleton for H.263 decoder (just add some functions for parsing headers and your new decoder is ready), motion compensation code, audio DSP bits (including FFT) and more;
  4. various crates that cover codec families and related containers: nihav-commonfmt for AVI and codecs like AAC; nihav-duck, nihav-indeo, nihav-rad and nihav-realmedia for supporting corresponding codec families with e.g. Bink or RealMedia demuxers as well; and nihav-game for supporting various codecs from various games with their unique demuxers;
  5. and finally nihav-allstuff that simply re-exports decoder and demuxer registrations in single nihav_register_all_codecs() and nihav_register_all_demuxers(). Also it has a test to check that all registered decoders have codec description in nihav-registry but nobody beside me should care about that.

Now with all of this done at last I can return to polishing other decoders which I still find more pleasant than documenting.

General overview of Duck codecs and their design

Saturday, February 15th, 2020

I’ve finally finished polishing out decoders for all Duck codecs (before it was bought by Baidu) and now they all seem to work fine (except AVC, that one can wait for later—much much later). And while I moved to even more hairier and painful tasks (reorganising nihav-core and even documenting it) now, as I have full understanding how those codecs work, I can give an overview of their design (not the bit-by-bit description of the format, we have The Wiki for that but rather most notable features and similarities to other codecs) and form my opinion on them.

TrueMotion 1

Somehow this might be their most original codec. While it’s simple codec with delta prediction I can’t remember any other codec that used a variable-length codebook with byte indices. Also this is the only codec in the family that works with RGB (16- and 24-bit modes even; the rest of codecs use YUV).

TrueMotion RT

This one is a trivial codec for real-time video capturing (hence the name) that codes deltas with fixed quantisation scheme (2, 3 or 4 bits deltas with predefined step sizes).

TrueMotion 2

This codec is still based on delta coding but now instead of working with individual pixels it works with 4×4 blocks that can have different amount of deltas and even employ motion compensation (instead of coding deltas). Also the data is separated into different streams and each of them is Huffman coded.

The approach with coding different kinds of information in separate chunks will be used in later codecs as well.

TrueMotion 2X

TrueMotion 2X is some weird amalgamation of TrueMotion 1 and TrueMotion 2. It works with 8×8 blocks that may have different amount of deltas like TM2 and information is grouped into chunks like TM2 but it uses variable codebook approach from TM1.

The main distinguishing features of this codec though are having multiple chunk variants for holding the same data and obfuscating data using XORing with 32-bit key derived from a key stored in a frame by passing it through LSFR a couple of times. IIRC frame data also contains the name of person owning the copy of the program so it might be some kind of protection scheme but it looks dubious at best.

3- and 4-bit ADPCM

As you can guess these codecs are based on DVI ADPCM (4-bit variant is essentially IMA ADPCM with different block header), 3-bit variant simply expands three deltas into four samples by interpolating coded differences (which has been done by other formats as well but I don’t remember which ones).

VP3-VP4

Starting with this format Duck moved to the codec design approach which I can describe as “make an equivalent of some existing codec but with some crazy thing replacing some less important stage”. It’s not like they are the only company doing this but it’s probably the only one leaving you with “how did they manage to come up with that idea?” question and VP3 is a very good example of that.

First of all, VP3 has an unusual block clustering: 8×8 blocks are grouped into 16×16 macroblocks and into 32×32 superblocks; blocks in superblocks are walked in Hilbert pattern but macroblocks in superblocks use zigzag pattern. Except that when you have four motion vectors in a macroblock they are stored also in zigzag pattern. Oh, and superblocks are walked in raster format plane after plane. Macroblock having data for both luma and chroma? Leave that to other codecs.

Then we have another feature familiar from TM2 times: data is grouped by type. First you have superblock information (intra/skip/inter), then macroblock information (which kind of motion it uses), then motion vectors and finally block coefficients.

Speaking of motion vectors, there are four features related to them that make these codecs different. First, motion vector prediction uses last/second last motion vector (in the order of decoding) as the base instead of median prediction in other codecs (this scheme will live up until VP9 with some modifications; I guess it’s done so because of the scan order but who knows). Second, motion interpolation is done as averaging two pixels—and for (½,½) case you average pixels on diagonal, which one of two depends on motion vector direction (averaging all four pixels? who would do that?!). Third, the introduction of golden frame as an alternative reference frame (don’t confuse it with altref frame introduced in VP8). This one is probably done to avoid B-frames that were patented at the time (at least that’s what people think). Fun fact: in VP31-VP5 golden frame is selected as last intra frame, in later codecs it can be selected with a special bit or even partially updated but in VP30 any frame with low enough quantiser automatically becomes new golden frame. And fourth, VP4 moved the loop filtering to motion compensation process so the reference picture does not have its edges filtered but when you perform motion compensation you apply it on source block edges using the current strength. This scheme remained until VP7 where they moved to the usual in-loop deblocking again (also it’s fun to encounter blocky intra frame image that gets smoothed with the following frames).

Now the block coefficients coding. VP3-VP9 used essentially the same scheme: decode special token that tells you what you have—a run of end-of-block flags, a run of zeroes, some small non-zero value or a larger value falling into certain range. Then you decode trailing bits if needed and expand token to form coefficient block. For some (error resiliency?) reasons VP3 had those tokens stored by coefficient number for all blocks (with some skips if zero run was coded) while VP4 had them grouped by block.

I should also mention DC prediction here. For obvious reasons it’s not median predicted either but rather calculated as weighted sum of neighbour block DCs in VP3 or “if you have two neighbour values available take their average, otherwise use the last predicted value” in VP4.

And final pet peeve is the DCT they used in VP3-VP6. While it’s good to have clearly defined integer DCT instead of a mess with different DCT implementations in H.263 / MPEG-4 ASP era, they decided to use transform coefficients in range 12785-64277 so essentially you have to multiply signed 16-bit input coefficient by unsigned 16-bit transform coefficient (and discard low 16 bits immediately). Now realize you have SIMD instruction for either signed*signed->take high or unsigned*unsigned->take high operations and not for this case. Sigh.

VP5

The main difference of VP5 from VP4 is the support for interlaced coding mode. And maybe also new binary range coder (named bool coder) that’s been in use even in VP9.

So now all non-binary data in the frame is coded using trees with fixed probabilities (i.e. you read bit with probability stored in the node and it’s zero take left branch, otherwise take right branch). Those probabilities might be constant or set to some new values at the beginning of the frame.

Frame data still contains macroblock information first and coefficient data last.

Motion vectors are predicted using nearest and second nearest (called simply near) motion vectors from already decoded macroblocks scanned in certain order. Also the information about found prediction candidates is used as one of the context variables used to select some probability set in decoding process.

DC prediction is a bit weird and it’s easier to describe it in the form “you have a special cache for top/left DC values and you use them for prediction” except that you have an additional special case for chroma in the first macroblock.

VP6

There are several things that got changed from VP5, mainly coefficient data location and coding method and motion compensation. Also now you can signal that you want this particular inter frame to become new golden frame. And you can enjoy new alpha mode which is coded essentially as a separate frame after the first one but with just one plane.

First, now there are two coefficient ordering modes: the old “MB info first, coefficients later” and the mode where macroblock information interleaves coefficient data.

Second, now you have Huffman coding for coefficient data. You take the original tree with probabilities, calculate weights for each leaf and construct new Huffman tree that might be completely different from the original. And then you decode data by reading macroblock information with bool coder from one place and variable-length codes for DCT tokens from another.

Third, motion interpolation now uses either a special set of bicubic filter coefficients or simple bilinear interpolation. Also there’s a special mode for switching between interpolation methods depending on source block variance (i.e. if it’s greater than certain threshold then use bicubic interpolation, otherwise use bilinear interpolation). I don’t think this feature has been used after VP6 though.

Also it’s worth noting that now VP6 can change block scan per frame (probably it improves compression a bit by eliminating or shortening some zero runs).

Another fun fact is that depending on container (AVI or FLV) VP6 picture might be coded upside-down or downside-up.

AVC

My favourite audio codec. Essentially it’s simplified AAC-LC rip-off (just bands and coefficients, no noise codebooks or pulses or TNS) except for the special frame mode where you can have half of the frame or the whole frame coded with special mode which is essentially some arbitrarily selected subbands that should be merged together in certain order to reconstruct audio. I have the idea how it all works but I don’t want to debug my decoder yet.

VP7

The codec is not like H.264 at all: H.264 has plane prediction mode and VP7 has TrueMotion prediction mode. There is one thing though introduced in VP7 and dropped in VP8 (and resurrected in some form in VP9) called features (there’s also special frame fading mode but hardly anybody cares about that). Features is an alternative mode that may be present for some macroblocks: different quantiser, different deblocking strength, a flag to signal this macroblock should be used to update golden frame and special block drawing mode (related to interlacing but not quite). There are up to four possible feature values where it makes sense (i.e. not for golden frame update flag).

Last feature (called pitch) defines how block coefficients should be put and how motion compensation should be performed. So you can put decoded coefficients in interlaced mode or even doubly interlaced mode (i.e. using every fourth line instead of every second). Motion compensation has these modes too and more: you can get 4×4 block from 16×1 line or from a slanted block (i.e. every next line starts one pixel earlier/later than the previous one).

Another characteristic of VP7 is being evolved rather than designed. There are several places in the codec where you can safely claim they simply have written code (maybe with some bugs) and relied on its behaviour instead of making the code follow some principle. Below are some examples.

Motion vector candidates search may get wrong macroblock coordinates. Here are the words of Peter Ross from his VP7 decoder:

The vp7 reference decoder uses a padding macroblock column (added to right edge of the frame) to guard against illegal macroblock offsets. The algorithm has bugs that permit offsets to straddle the padding column.

Inter DC prediction for DC superblock that says “if three previously decoded DCs were the same then you should use it for prediction” is fine but why should you keep the history from the last frame? I understand it might improve compression if you have the same value for the whole previous frame but it still looks a bit strange.

Spatial (intra) prediction also behaves counter-intuitively. In 4×4 prediction mode when top right block is not available the bottom of macroblock right above should be used instead. And when it’s the last block in row then top right prediction is the replicated pixel from the top macroblock as well. This is hard to explain from codec design perspective but easy from implementer’s point of view: you have top pixels line cached and you update it after you decode the block (so if the data is unavailable you use last decoded data here instead of replicating last available pixel like in H.264).

Conclusion and final thoughts

I hope I was able to demonstrate in this post that Duck codecs have an element of originality but quite often they go so far in originality that you start wondering why they were doing it like that. While some of it might be because of the patent workarounds some things are showing that in some cases they were fiddling with the code instead of trying proper ideas first and implementing codec after the idea (no, idea “let’s use codec X as the base” does not count).

Also while I’m not going to deal with VP8 and VP9 unless I really have to, I can say that the people behind Duck codecs developing AV1 is both good and terrible thing. Good because they know how to propose stuff that looks different but still works similarly to some conventional codec. Terrible because they still don’t know how to design a codec properly—not writing some ad hoc code that does something but rather gather ideas, evaluate them and only after that implementing it. I heard the story that shortly before releasing VP8 to the public Baidu actually showed it to some opensource multimedia people and asked for their opinions and input; somebody (from Xiph IIRC) found a design flaw but it was left unfixed because the encoder relied on it as well and they were reluctant to change it.

AV1.0 Errata 1 shows similar design problems partly for the same reasons and I don’t expect AV2 to be conceptually better. Especially after hearing rumours that Baidu is working on it already probably to force mostly complete work on AOM so the codec is ready by the same time as H.266 (or MPEG/VVC as they say it in Italy). And since most opensource multimedia people are working on AV1 nowadays, the chances of some competitor appearing are slim. So don’t ask questions, just consume AV1 and then get excited for AV2.