Reviewing AV1 Features

March 21st, 2020

Since we have this wonderful situation in Europe and I need to stay at home why not do something useless and comment on the features of AV1 especially since there’s a nice paper from (some of?) the original authors is here. In this post I’ll try to review it and give my comments on various details presented there.

First of all I’d like to note that the paper has 21 author for a review that can be done by a single person. I guess this was done to give academic credit to the people involved and I have no problems with that (also I should note that even if two of fourteen pages are short authors’ biographies they were probably the most interesting part of paper to me).
Read the rest of this entry »

NihAV: Janitoring

February 22nd, 2020

For last couple of weeks I’ve been working on documenting and restructuring NihAV. In result I’ve documented every public thing in my crates (except H.263 decoder skeleton but I need to need to debug and maybe rework it anyway) and NihAV have final crate structure.

Speaking about crate structure, modern languages often suffer from npm.js syndrome—when almost any trivial action has a separate package and most of the packages consist of imports from other packages. The other extremity would be to have two or three monolithic libraries with everything. I don’t think there’s a perfectly balanced solution so I split features using a few principles and I’ll stick to the scheme:

  1. nihav-core—the basis structure definitions like frame, packet, demuxer and decoder interfaces etc etc and utility code that should be used by both crates implementing NihAV format support and various users (like my own decoding tool and player);
  2. nihav-registry contains essentially three things: codec descriptions, codec mapping from e.g. FOURCC to codec name used by NihAV (IMO it’s better to use a string as codec identifier instead of arbitrary number that may or may be not recognized by the different version of the library) and container detection code (i.e. something like what file utility on UNIX does). This functionality can belong to nihav-core but it’s expected to be updated way more often than the base code so I decided to finally split it out;
  3. nihav-codec-support contains various pieces of code and data that are reused by many various decoders. It is intended just for decoders and has such bits as functions for testing decoder on some file, the skeleton for H.263 decoder (just add some functions for parsing headers and your new decoder is ready), motion compensation code, audio DSP bits (including FFT) and more;
  4. various crates that cover codec families and related containers: nihav-commonfmt for AVI and codecs like AAC; nihav-duck, nihav-indeo, nihav-rad and nihav-realmedia for supporting corresponding codec families with e.g. Bink or RealMedia demuxers as well; and nihav-game for supporting various codecs from various games with their unique demuxers;
  5. and finally nihav-allstuff that simply re-exports decoder and demuxer registrations in single nihav_register_all_codecs() and nihav_register_all_demuxers(). Also it has a test to check that all registered decoders have codec description in nihav-registry but nobody beside me should care about that.

Now with all of this done at last I can return to polishing other decoders which I still find more pleasant than documenting.

General overview of Duck codecs and their design

February 15th, 2020

I’ve finally finished polishing out decoders for all Duck codecs (before it was bought by Baidu) and now they all seem to work fine (except AVC, that one can wait for later—much much later). And while I moved to even more hairier and painful tasks (reorganising nihav-core and even documenting it) now, as I have full understanding how those codecs work, I can give an overview of their design (not the bit-by-bit description of the format, we have The Wiki for that but rather most notable features and similarities to other codecs) and form my opinion on them.

TrueMotion 1

Somehow this might be their most original codec. While it’s simple codec with delta prediction I can’t remember any other codec that used a variable-length codebook with byte indices. Also this is the only codec in the family that works with RGB (16- and 24-bit modes even; the rest of codecs use YUV).

TrueMotion RT

This one is a trivial codec for real-time video capturing (hence the name) that codes deltas with fixed quantisation scheme (2, 3 or 4 bits deltas with predefined step sizes).

TrueMotion 2

This codec is still based on delta coding but now instead of working with individual pixels it works with 4×4 blocks that can have different amount of deltas and even employ motion compensation (instead of coding deltas). Also the data is separated into different streams and each of them is Huffman coded.

The approach with coding different kinds of information in separate chunks will be used in later codecs as well.

TrueMotion 2X

TrueMotion 2X is some weird amalgamation of TrueMotion 1 and TrueMotion 2. It works with 8×8 blocks that may have different amount of deltas like TM2 and information is grouped into chunks like TM2 but it uses variable codebook approach from TM1.

The main distinguishing features of this codec though are having multiple chunk variants for holding the same data and obfuscating data using XORing with 32-bit key derived from a key stored in a frame by passing it through LSFR a couple of times. IIRC frame data also contains the name of person owning the copy of the program so it might be some kind of protection scheme but it looks dubious at best.

3- and 4-bit ADPCM

As you can guess these codecs are based on DVI ADPCM (4-bit variant is essentially IMA ADPCM with different block header), 3-bit variant simply expands three deltas into four samples by interpolating coded differences (which has been done by other formats as well but I don’t remember which ones).

VP3-VP4

Starting with this format Duck moved to the codec design approach which I can describe as “make an equivalent of some existing codec but with some crazy thing replacing some less important stage”. It’s not like they are the only company doing this but it’s probably the only one leaving you with “how did they manage to come up with that idea?” question and VP3 is a very good example of that.

First of all, VP3 has an unusual block clustering: 8×8 blocks are grouped into 16×16 macroblocks and into 32×32 superblocks; blocks in superblocks are walked in Hilbert pattern but macroblocks in superblocks use zigzag pattern. Except that when you have four motion vectors in a macroblock they are stored also in zigzag pattern. Oh, and superblocks are walked in raster format plane after plane. Macroblock having data for both luma and chroma? Leave that to other codecs.

Then we have another feature familiar from TM2 times: data is grouped by type. First you have superblock information (intra/skip/inter), then macroblock information (which kind of motion it uses), then motion vectors and finally block coefficients.

Speaking of motion vectors, there are four features related to them that make these codecs different. First, motion vector prediction uses last/second last motion vector (in the order of decoding) as the base instead of median prediction in other codecs (this scheme will live up until VP9 with some modifications; I guess it’s done so because of the scan order but who knows). Second, motion interpolation is done as averaging two pixels—and for (½,½) case you average pixels on diagonal, which one of two depends on motion vector direction (averaging all four pixels? who would do that?!). Third, the introduction of golden frame as an alternative reference frame (don’t confuse it with altref frame introduced in VP8). This one is probably done to avoid B-frames that were patented at the time (at least that’s what people think). Fun fact: in VP31-VP5 golden frame is selected as last intra frame, in later codecs it can be selected with a special bit or even partially updated but in VP30 any frame with low enough quantiser automatically becomes new golden frame. And fourth, VP4 moved the loop filtering to motion compensation process so the reference picture does not have its edges filtered but when you perform motion compensation you apply it on source block edges using the current strength. This scheme remained until VP7 where they moved to the usual in-loop deblocking again (also it’s fun to encounter blocky intra frame image that gets smoothed with the following frames).

Now the block coefficients coding. VP3-VP9 used essentially the same scheme: decode special token that tells you what you have—a run of end-of-block flags, a run of zeroes, some small non-zero value or a larger value falling into certain range. Then you decode trailing bits if needed and expand token to form coefficient block. For some (error resiliency?) reasons VP3 had those tokens stored by coefficient number for all blocks (with some skips if zero run was coded) while VP4 had them grouped by block.

I should also mention DC prediction here. For obvious reasons it’s not median predicted either but rather calculated as weighted sum of neighbour block DCs in VP3 or “if you have two neighbour values available take their average, otherwise use the last predicted value” in VP4.

And final pet peeve is the DCT they used in VP3-VP6. While it’s good to have clearly defined integer DCT instead of a mess with different DCT implementations in H.263 / MPEG-4 ASP era, they decided to use transform coefficients in range 12785-64277 so essentially you have to multiply signed 16-bit input coefficient by unsigned 16-bit transform coefficient (and discard low 16 bits immediately). Now realize you have SIMD instruction for either signed*signed->take high or unsigned*unsigned->take high operations and not for this case. Sigh.

VP5

The main difference of VP5 from VP4 is the support for interlaced coding mode. And maybe also new binary range coder (named bool coder) that’s been in use even in VP9.

So now all non-binary data in the frame is coded using trees with fixed probabilities (i.e. you read bit with probability stored in the node and it’s zero take left branch, otherwise take right branch). Those probabilities might be constant or set to some new values at the beginning of the frame.

Frame data still contains macroblock information first and coefficient data last.

Motion vectors are predicted using nearest and second nearest (called simply near) motion vectors from already decoded macroblocks scanned in certain order. Also the information about found prediction candidates is used as one of the context variables used to select some probability set in decoding process.

DC prediction is a bit weird and it’s easier to describe it in the form “you have a special cache for top/left DC values and you use them for prediction” except that you have an additional special case for chroma in the first macroblock.

VP6

There are several things that got changed from VP5, mainly coefficient data location and coding method and motion compensation. Also now you can signal that you want this particular inter frame to become new golden frame. And you can enjoy new alpha mode which is coded essentially as a separate frame after the first one but with just one plane.

First, now there are two coefficient ordering modes: the old “MB info first, coefficients later” and the mode where macroblock information interleaves coefficient data.

Second, now you have Huffman coding for coefficient data. You take the original tree with probabilities, calculate weights for each leaf and construct new Huffman tree that might be completely different from the original. And then you decode data by reading macroblock information with bool coder from one place and variable-length codes for DCT tokens from another.

Third, motion interpolation now uses either a special set of bicubic filter coefficients or simple bilinear interpolation. Also there’s a special mode for switching between interpolation methods depending on source block variance (i.e. if it’s greater than certain threshold then use bicubic interpolation, otherwise use bilinear interpolation). I don’t think this feature has been used after VP6 though.

Also it’s worth noting that now VP6 can change block scan per frame (probably it improves compression a bit by eliminating or shortening some zero runs).

Another fun fact is that depending on container (AVI or FLV) VP6 picture might be coded upside-down or downside-up.

AVC

My favourite audio codec. Essentially it’s simplified AAC-LC rip-off (just bands and coefficients, no noise codebooks or pulses or TNS) except for the special frame mode where you can have half of the frame or the whole frame coded with special mode which is essentially some arbitrarily selected subbands that should be merged together in certain order to reconstruct audio. I have the idea how it all works but I don’t want to debug my decoder yet.

VP7

The codec is not like H.264 at all: H.264 has plane prediction mode and VP7 has TrueMotion prediction mode. There is one thing though introduced in VP7 and dropped in VP8 (and resurrected in some form in VP9) called features (there’s also special frame fading mode but hardly anybody cares about that). Features is an alternative mode that may be present for some macroblocks: different quantiser, different deblocking strength, a flag to signal this macroblock should be used to update golden frame and special block drawing mode (related to interlacing but not quite). There are up to four possible feature values where it makes sense (i.e. not for golden frame update flag).

Last feature (called pitch) defines how block coefficients should be put and how motion compensation should be performed. So you can put decoded coefficients in interlaced mode or even doubly interlaced mode (i.e. using every fourth line instead of every second). Motion compensation has these modes too and more: you can get 4×4 block from 16×1 line or from a slanted block (i.e. every next line starts one pixel earlier/later than the previous one).

Another characteristic of VP7 is being evolved rather than designed. There are several places in the codec where you can safely claim they simply have written code (maybe with some bugs) and relied on its behaviour instead of making the code follow some principle. Below are some examples.

Motion vector candidates search may get wrong macroblock coordinates. Here are the words of Peter Ross from his VP7 decoder:

The vp7 reference decoder uses a padding macroblock column (added to right edge of the frame) to guard against illegal macroblock offsets. The algorithm has bugs that permit offsets to straddle the padding column.

Inter DC prediction for DC superblock that says “if three previously decoded DCs were the same then you should use it for prediction” is fine but why should you keep the history from the last frame? I understand it might improve compression if you have the same value for the whole previous frame but it still looks a bit strange.

Spatial (intra) prediction also behaves counter-intuitively. In 4×4 prediction mode when top right block is not available the bottom of macroblock right above should be used instead. And when it’s the last block in row then top right prediction is the replicated pixel from the top macroblock as well. This is hard to explain from codec design perspective but easy from implementer’s point of view: you have top pixels line cached and you update it after you decode the block (so if the data is unavailable you use last decoded data here instead of replicating last available pixel like in H.264).

Conclusion and final thoughts

I hope I was able to demonstrate in this post that Duck codecs have an element of originality but quite often they go so far in originality that you start wondering why they were doing it like that. While some of it might be because of the patent workarounds some things are showing that in some cases they were fiddling with the code instead of trying proper ideas first and implementing codec after the idea (no, idea “let’s use codec X as the base” does not count).

Also while I’m not going to deal with VP8 and VP9 unless I really have to, I can say that the people behind Duck codecs developing AV1 is both good and terrible thing. Good because they know how to propose stuff that looks different but still works similarly to some conventional codec. Terrible because they still don’t know how to design a codec properly—not writing some ad hoc code that does something but rather gather ideas, evaluate them and only after that implementing it. I heard the story that shortly before releasing VP8 to the public Baidu actually showed it to some opensource multimedia people and asked for their opinions and input; somebody (from Xiph IIRC) found a design flaw but it was left unfixed because the encoder relied on it as well and they were reluctant to change it.

AV1.0 Errata 1 shows similar design problems partly for the same reasons and I don’t expect AV2 to be conceptually better. Especially after hearing rumours that Baidu is working on it already probably to force mostly complete work on AOM so the codec is ready by the same time as H.266 (or MPEG/VVC as they say it in Italy). And since most opensource multimedia people are working on AV1 nowadays, the chances of some competitor appearing are slim. So don’t ask questions, just consume AV1 and then get excited for AV2.

Om marsipangrisorna

February 9th, 2020

Since I have nothing better to do (obviously) I want to talk about marzipan pig situation in Sweden.
Read the rest of this entry »

NihAV: Boring Details

December 14th, 2019

As I mentioned in the previous post, I’m polishing NihAV and improving some bits here and there. In this post I’d like to describe what has been done and what should be done (but not necessarily will be done).
Read the rest of this entry »

The Unfinished Puzzle from Monty Python and the Quest for the Holy Grail

November 27th, 2019

In the previous post I’ve mentioned that one of the levels contain an unfinished witch puzzle probably based on Match Two that also contains various photos of the (presumably team and their families (about a dozen of them actually). I also wanted to tell a bit more about it and since I’m working on boring parts of NihAV and it will take a while, why not tell the story now?
Read the rest of this entry »

Railways of Baden and Württemberg

November 17th, 2019

First of all, unlike Rheinland Palatine railways I’ve not travelled on all local railways here because of certain geographic peculiarities it takes less time to get to the farthest corners of Rheinland-Pfalz from here than to places around Bodensee. And I can’t visit all historic railways either because one of them is located in the middle of nowhere with the only connection being a bus that comes there maybe twice a day.

Nevertheless, I’ve seen most of them and here I’d like to describe my impressions about them.
Read the rest of this entry »

NihAV: the Last Quack

October 31st, 2019

Finally NihAV got full-feature VP7 decoding support (well, except one very exotic case for a very exotic mode) so now I can move to other things like actually making various decoders bit-exact, fixing other bugs in them, adding missing pieces of code for player and even documenting stuff. I hope to give a presentation of my work on VDD 2020 or FOSDEM 2021 (whichever accepts it) and I want to have something decent to present by then.

Anyway, here’s a review of VP7.
Read the rest of this entry »

Going to the 7th Level for the Grail!

September 28th, 2019

I’ve spent two days on something that I wanted to do long time ago but postponed because of various reasons including complexity. But now thanks to Ghidra I’ve finally managed to pry into resources of Monty Python and the Quest for the Holy Grail and decode videos (and more!) stored there.

Probably it was VP7 (again) that made me look into it, but since Ghidra has decompiler for 16-bit code as well, I’ve tried it on libraries of surprisingly small game engine (pity that ScummVM does not even plan to support it but I’m in a similar situation with codecs). And what do you know, they have a special library dealing with resource files that included unpacking images (but not audio—there’s audio library for that).

First, let’s talk about resource file organisation and why it baffled me for too long. The files are organised into several chunks: header that contains offsets of the other chunks at the end of it (around bytes 0xE2..0x141), then you have actual table of contents (more about it later), then some small binary data chunk, then list of strings related to file names, then some chunk that looks like game script (in binary form), then palette and finally another list of strings that looks like variables list. It is no wonder I could not do anything useful with it since actual table of contents is hidden somewhere near the end of file but not exactly at the end. Also each game scene corresponds to its own archive which makes it clearer what to expect there (previous version used in Monty Python’s Complete Waste of Time even has a description right in the header and table of contents stored right after it, I might look at it later but no promises).

Now, files in the resource archive. The catalogue has only file type, its size and offset in the archive and nothing else. There are more than a dozen file types, 1 being images (both background and sprites), 2 being movies (the thing I was hunting), 4 is for MIDI tracks, 9 is for digitised audio, the rest I don’t care about.

Let’s move to simple things like music and audio. Music is stored in its custom format with four bytes per command except when high bit is set (then it’s delta time for next command). Why so? Probably because it’s being played by sending single MIDI commands and at least on Windows it requires packing them into 32-bit word anyway. Audio is stored in files with some header and either raw form or with IMA ADPCM compression. The notable thing is that it does not use any multiplications in any form—instead it has precomputed table for all eight possible values for each step.

Images are another interesting topic. First of all, palette is stored as 944-byte file with 32-bit entries (so it’s just 236 colours) but somehow decoded files use it with bias of ten, i.e. decoded index 13 corresponds to palette entry 3, index 27 is for entry 17 etc etc. I have no idea what it does with first colours in the palette (though sometimes images are not colorised properly so maybe they need different palette bias or a different palette at all).
Second, images employ two different compression schemes: RLE and LZW. RLE is rather trivial except that it uses opcode zero to signal end of line (and double zero for end of image):

Obvious words skipped

LZW-compressed image splits image into chunks that may be uncompressed or compressed with LZW using 10, 11 or 12 bits per index. And after all these years I was still able to correctly guess it was LZW and write a decoder that worked fine without resorting to any documentation or even studying the decompiled code (I just needed to realize that it reads sequences of n-bits indexes and does something with them to output bytes).

And finally the movie format. Those are actually stored in two files—movie data and companion frame index (i.e. frame type, size and offset). After extracting frames I could not realize what to do with them until I noticed that frame type 2 all have the constant size of 11025 except for first few. Obviously they turned out to be IMA ADPCM compressed audio frames (and since they were the first frames in every movie it was hard to guess the format looking at semi-random data without header). Frame type 0 turned out to be the same image format as normal pictures.

Frame type 1 turned out to be inter-frames that use index 0 for pixels that should not be updated.

Unfortunately even if I now have a way to decode the files I cannot add direct support for them in NihAV since it requires at least three different files to play it (palette, frame index and frame data). At least now I know I can transcode them into something when the need arises.

Now let’s talk about Ghidra a bit. Without it this probably would have not happened since I’m not that young to spend time on translating tons of disassembly (and REC failed to do a good job). Ghidra support for 16-bit code is far from perfect with all that mess with far pointers consisting of two 16-bit words, external DLL functions linked by ordinals (so I had to match them by hoof and rename where possible) and general confusion with int treated as 4-byte in some contexts but 2-byte in another. And yet it did the job and helped me to understand how the game libraries work and that’s what really counts.

And as a bonus here’s a team photo extracted from concentration mini-game related to the witch scene (not the Simon Says clone with fires but probably Match Two clone that I’ve never seen in-game before):

MidiVid codec family

September 26th, 2019

VP7 is such a nice codec that I decided to distract myself a little with something else. And that something else turned out to be MidiVid codec family. It turned out to be quite peculiar and somehow reminiscent of Duck codecs.

The family consists of three codecs:

  1. MidiVid — the original codec based on LZSS and vector quantisation;
  2. MidiVid Lossless — exactly what is says on a tin, based on LZSS and bunch of other technologies;
  3. MidiVid 3 — a codec based on simplified integer DCT and single codebook for all values.

I’ve actually added MidiVid decoder to NihAV because it’s simple (two hundred lines including boilerplate and tests) and way more fun than working on VP7 decoder. Now I’ll describe them and hopefully you’ll understand why it reminds me of Duck codecs despite not being similar in design.

MidiVid

This is a simple hold-and-modify video codec that had been used in some games back in PS2/Xbox era. The frame data can be stored either unpacked or packed with LZSS and it contains the following kinds of data: change mask for 8×8 blocks (in case of interframe—if it’s zero then leave block as is, otherwise decode new data for it), 4×4 block codebook data (up to 512 entries), high bits for 9-bit indices (if we have 257-512 various blocks) and 8-bit indexes for codebook.

The interesting part is that LZSS scheme looked very familiar and indeed it looks almost exactly like lzss.c from LZARI author (remember that? I still do), the only differences is that it does not use pre-filled window and flags are grouped into 16-bit word instead of single byte.

MidiVid Lossless

This one is a special best as it combines two completely different compression methods: the same LZSS as before and something used by BWT-based compressor (to the point that frame header contains FTWB or ZTWB IDs).

I’m positively convinced it was copied from some BTW-based compressor not just because of those IDs but also because it seems to employ the same methods as some old BTW-based compressor except for the Burrows–Wheeler transform itself (that would be too much for the old codecs): various data preprocessing methods (signalled by flags in the frame header), move-to-front coding (in its classical 1-2 coding form that does not update first two positions that much) plus coding coefficients in two groups: first just zero/one/large using order-3 adaptive model and then values larger than one using single order-1 adaptive model. What made it suspicious? Preprocessing methods.

MVLZ has different kinds of preprocessing methods: something looking like distance coding, static n-gram replacement, table prediction (i.e. when data is treated as series of n-bit numbers and the actual numbers are replaced with the difference between previous ones) and x86 call preprocessing (i.e. that trick when you change function call address from relative into absolute for better compression ratio and then undo it during decompression; known also as E8-preprocessing because x86 call opcode is E8 <32-bit offset> and it’s easy to just replace them instead of adding full disassembler to the archiver). I had my suspicions as n-gram replacement (that one is quite stupid for video codecs and it only replaces some values with some binary values that look more related to machine code than video) but the last item was a dead give-away. I’m pretty sure that somebody who knows open-source BWT compressors of late 1990s will probably recognize it even from this description but sadly I’ve not been following it that closely being more attracted to multimedia.

MidiVid 3

This codec is based on some static codebook for packing all values: block types, motion vectors and actual coefficients. Each block in macroblock can be coded with one of four modes: empty (fill with 0x80 in case of intra), DC only, just few coefficients DCT, and full DCT. As usual various kinds of data are grouped and coded as single array.

Motion compensation is full-pixel and unlike its predecessor it operates in YUV420 format.


This was an interesting detour but I have to return back to failing to start writing VP7 decoder.

P.S. I’ll try to document them with more details in the wiki soon.
P.P.S. This should’ve been a post about railways instead but I guess it will have to wait.