Archive for April, 2019

Bink2: some words about loop filter

Sunday, April 14th, 2019

Since obviously I have nothing better to do, here’s a description of loop filter in Bink2 as much as I understand it (i.e. not much really).

First, the loop filter makes decision on two factors: motion vector difference between adjacent blocks is greater than two or it selects filter strength depending on number of coefficients coded in the block (that one I don’t remember seeing before). The filter is the same in all cases (inter/intra, luma/chroma, edge/inside macroblock), only the number of pixels filtered varies between zero and two on each side (more on that later). This is nice and elegant design IMO.

Second, filtering is done after each macroblock, horizontal edges first, vertical edges after that—but not necessarily for all macroblocks. Since `BIKi` or `BIKj` encoder can signal “do not deblock macroblocks in these rows and columns” by transmitting set of flags for columns and rows.

Third, in addition to normal filtering decoder can do something that I still don’t understand but it looks like whole-block overlapping in both directions (and it is performed in actual decoding but I don’t know what happens with the result of it).

And the filter itself is not that interesting (assuming we filter `buf[0] buf[1] | buf[2] buf[3]`:

```    diff0 = buf[2] - buf[1] + 8 >> 4;
diff1 = diff0 * 4 + 8 >> 4;
if (left_strength >= 2)
buf[0] = clip8(buf[0] + diff0);
if (left_strength >= 1)
buf[1] = clip8(buf[1] + diff1);
if (right_strength >= 1)
buf[2] = clip8(buf[2] - diff1);
if (right_strength >= 2)
buf[3] = clip8(buf[3] - diff0);
```

Strength is determined like this: 0 — more than 8 coefficients coded, 1 – MV difference or 4-7 coefficients coded, 2 — 1-3 coefficients coded, 3 — no coefficients coded.

Overall, the loop filter is nice and simple if you ignore the existence of some additional filter functions and very optimised implementation that is not that much fun to untangle.

Update: the alternative function seems to be some kind of block reconstruction based on DCs. In case it’s intra block with less than four coefficients coded it will take all neighbouring DCs, select those not differing by more than a frame-defined threshold and smooth the differences. I still don’t understand its purpose in full though.

NihAV: now with TM2X support!

Thursday, April 11th, 2019

I’m proud to say that NihAV got TrueMotion 2X support. For now only intra frames are supported but 75% of the samples I have (i.e. three samples) have just intra frames. At least I could check that it works as supposed.

First, here’s codec description after I managed to write a working decoder for it. TrueMotion 2X is another of those codecs that’s closer to TrueMotion 1 in design. It still uses the same variable-length codebook instead of Huffman coding (actually only version 5 of this codec uses bit reading for anything). It also uses “apply variable amount of deltas per block” approach but instead of old fixed scheme it now defines twenty-something coding approaches and tells decoder which ones to use in current frame. That is done because block size now can be variable too (but it’s always 8 in all files I’ve seen). And blocks are grouped in tiles (usually equivalent to one row of blocks but again, it may vary). The frame data obfuscation that XORs chunks inside the frame with a 32-bit key derived in a special way is not worth mentioning.

Second, the reference is quite peculiar too. It decodes frame data by filling an array of pointers to the functions that decode each line segment with proper mode, move to the next line and repeat. And those functions are in handwritten assembly—they use stack pointer register for decoder context pointer (that has original `ESP` saved somewhere inside), which also means they do not use stack space for anything and instead of returning they simply jump to the next routine until the final one restores the stack and returns properly. Thankfully Ghidra allows to assign context argument to `ESP` and while decompile still looks useless, assembly has proper references in the form `mov EDX, dword ptr [ctx->luma_pred + ESP]`.

And finally, I could not check what binary specification really does because MPlayer could not run it. At first I tried running working combination of WMP+Win98 under OllyDbg in QEMU but it was painfully slow and even more painful to look at the memory state. In result I’ve managed to run TM2X decoder in MPlayer which then served as a good reference. The trick is that you should not try to run `tm2X.dll` (it’s really hopeless) but rather to take `tm2Xdec.ax` (or deceptively named `tm20dec.ax` from the same distribution that can handle TM2X unlike its earlier versions), patch one byte for check in DLL init and it works surprisingly well after that.

So what’s next? Probably I’ll just add missing features for the second TM2X sample (the other two samples are TM2A), maybe add Bink2 deblocking feature—since I’d rather have that decoder complete—and move to improving overall `NihAV` design. Frame management needs proper rework before I add more codecs—I want to change into a thread-safe version before I add more decoders. Plus I’ll need to add some missing bits for a player. There’s a lot of work still to do but I’m pleased that I still managed to do something.

BMV: Complete!

Thursday, April 4th, 2019

So `NihAV` finally got Discworld Noir BMV support and I’ve tested it on all samples from the game to see if it works correctly. Here’s a sample frame:

(I still remember the song Samael plays there and have it somewhere ripped in its full MPEG audio layer II glory).

Now I want to talk about the format since it’s quite different from anything else. BMV used in Discworld II was simple but with two quirks: it employed integer coding using variable amount of nibbles (that were read as bytes but a nibble could be saved and used later) and it could decode frame either from the beginning to end or from end to beginning (reading frame data from the end too!). DW3 BMV is even stranger and let’s start with audio part. Audio codec is very simple: you have 41-byte block with one byte signalling which quantised values tables should be used for both channels and 32 indices for each channel packed into 16-bit words. The main peculiarity is that data is aligned to 16-bit and mode byte can be either in the beginning or at the end of the block. That’s a bit unusual but not strange. Well, it turns out it aligns for the absolute position in a file so my demuxer has to signal whether audio data was at even or odd position. And video is even stranger.

As I wrote previously, video codec is 16-bit now and still employs nibble variable integer coding and copy/repeat/new pixels mode. Luckily there is no backwards decoding mode yet the codec is tricky without it. First of all, where previously there were just three plain modes now we have combinations of those with bytes or nibbles signalling what should be done (i.e. copy/repeat/put new pixels fixed amount of times and then do the other operation another fixed amount of times). And they have different meaning depending on what was the last operation (copy/repeat/put for fixed amount or with arbitrary large one). And if there’s a nibble left unread after last operation or not. But that’s not all! While previously reading new pixels meant just reading a byte, 16-bit pixels can be compressed a bit more. In result we read 1-3 bytes per pixel: first read index byte, remap it, if it’s in range `00..F7` then return pixel in an array, if it’s in range `F8..FE` then read another byte and use it as an index in one of seven secondary “palettes”; if it’s `FF` then simply read explicit 2-byte value from the stream. The reference simply used an array pointing to the various functions performing this. And of course palettes can be updated in the beginning of each frame.

Surprisingly, decoder implementation takes about 28kB with a quarter of it being tables. That’s for both audio and video decoder. This is on par with other game decoders (GDV, Smacker and VMD) and feels significantly smaller than the reference (which is about 30kB in a stripped assembled `.o` file and over 200kB as assembly).

Overall it was hard to comprehend and tricky to debug too. Nevertheless now it’s over and I can probably move to TrueMotion 2X. Or whatever I decide to do when I’m bored enough.