Bink-b: Encoder

May 31st, 2019

Recently I’ve been contacted by some guy working on a mod campaign for Heroes of Might and Magic III. The question was about the encoder for videos there. And since the original one is not likely to exist, I just wrote a simple one that would take PGMYUV image sequence and encode it. Here’s the gzipped source.

It took a couple of evenings to do that mostly because I still have weak symptoms of creeping perfectionism (thankfully it’s treated with my laziness). BIKb does not have Huffman-coded bundles, so the simplest straightforward encoding would be: write block type bundle (13-bit size and 4-bit elements), write empty other bundles, write several bundles containing pixels and you’re done. There’s a proper approach: write a full-featured encoder that takes input in several formats and that encodes using all possible features selecting the best quality for the target bitrate. There’s a hacky approach—translate later versions of Bink into BIKb (and then you remember that it has different motion compensation scheme so this approach won’t work). I’ve chosen something simple yet with some effectiveness: write an encoder that employs only vector quantisation and motion compensation for non-overlapped blocks plus add a quality setting so users can play with output size/quality if they really need it.

So how does the block encoding work? Block truncation coding, the fast and good way to quantise block into two colours (many video codecs back in the day used it and only some dared to use vector quantisation for more than two different values per block). Essentially you just calculate average pixel value and select two values depending on how many pixels in the block are larger than average and by how much they deviate. And here’s where quality parameter comes into play: depending on it encoder sets the threshold above which block is coded as is (aka full mode) instead as two colours and pattern in which they occur (of course if it’s a solid-colour fill it’s always coded as such). As I said, it’s simple but quite effective. Motion compensation is currently lossless i.e. encoder will try to find only the block that matches exactly (again, it can be improved but that would only lead to longer implementation times and even longer debugging times). This makes me appreciate the work on Smacker and Bink 1 video codecs and encoders for them even more.

Overall, it was a nice diversion from implementing Duck decoders for NihAV but I should probably return to it. The sooner it’s done the sooner I can move to something more exciting like finally experimenting with vector quantisation, or trying to write a player, or something else entirely. I avoid making plans but there are many possibilities at hoof so I just need to pick one.

VP3-VP6: the Golden (Frame) Age of Duck Codecs

May 24th, 2019

Dedicated to Peter Ross, who wrote an opensource VP4 decoder (that is not committed to CEmpeg yet at the time of the writing).

The codecs from VP3 to VP6 form a single codec family that is united not merely by the design but even by the header—every frame in this codec (sub)family has the same header format. And the leaked VP6 format specification still calls the version field there Vp3VersionNo (versions 0-2 used by VP3, 3 is used by VP4, 5 is for VP5 and 6-8 is for VP6). VP7 changed the both the coding principles to mimic H.264 and the header format too. And you can call it the golden age for Duck because it’s when it gained popularity with VP3 donated to open-source community (and xiphed to Theora which was the only patent-free(ish) opensource video codec with decent performance back then) to its greatest success found in VP6, employed both in games and in Flash video (remember when BaidUTube still used .flv with VP6 and N*llyMos*r ADPCM or Speex?). And today, having gathered enough material, I want to give an overview of these codecs. Oh, and NihAV can decode VP30 and VP31 now.
Read the rest of this entry »

NihAV: rust-clippy experience

May 18th, 2019

As I’ve mentioned in the previous post, I’ve finally tried rust-clippy to see what issues and suggestions it will have on my code. The results are not disappointing if you take the tool name seriously.
Read the rest of this entry »

NihAV: after clean-up

May 17th, 2019

Since the clean-up work on NihAV is done and I progress with Truemotion VP3 decoder, it’s a good time to talk about what I’ve actually done—there’s even more material to write waiting in the queue.

The intent was to make all frame-related stuff thread-safe and improve efficiency a bit. In order to do the former I had to replace most of the references from Rc<RefCell<T>> to Arc<T> and while doing it I introduced aliases like type NAFrameRef = Arc<NAFrame> and .into_ref() methods to convert object into ref-counted version. This helped when I tried switching from one implementation of reference counter to another and will make it easy to switch again if I ever need that (hopefully not). Now about improved efficiency and how it’s related to the ref-counting.

There’s a straightforward way of dealing with frames: you allocate the picture, fill it, dispose, allocate a new one, etc etc. And there’s a more effective way: you allocate several pictures at once, select an unused one, fill, return to the pool when it’s not needed any more. That is where reference counting comes into play and where Rust default structures don’t help. Frame pool owns the reference and decoder gets a second copy. And Rust Arc is intended for single ownership: when you try to access the shared object it will simply clone it so you end up working with a copy (which defies the purpose). So I had to NIH my own NABufferRef<T> which keeps reference counts and still allows shared access even for writing (currently it does that in all cases but if I need to add some guards the API won’t have to be changed for that). The implementation is very simple: the structure contains a raw pointer to a structure that contains actual object and AtomicUsize counter. The whole implementation is ~2.2kB relying just on std crate.

And finally I’ve made a picture pool. The difference between picture and frame is all additional metadata picture should not care about (like timestamps, stream information and such). Because of the design decisions I have three different picture formats (implemented for 8-, 16- and 32-bit element sizes, Rust does not like aliasing after all), which means I need to provide decoder with all three picture pools because we can’t say in advance which one codec will use (if at all—the option to allocate new non-pooled pictures is still there). Also I want to keep those pools external in case the code around it wants to do keep more pictures in it (e.g. 2-3 pictures required by decoder and 25 pictures pre-buffered for the display). This resulted in a structure called NADecoderSupport that contains picture pools and may have something else added late. Of course people might argue that it’s much better to have AVCodecContext with a myriad of fields you can set directly or via utility functions but I’d rather not have one single structure. Though it might be a good place to put various decoder options there (so that decoder can ignore them at its leisure).

Since I said I did it to increase efficiency I should probably give some numbers too: RealVideo 3/4/6 decoders now use buffer pool (for three frames obviously) and reallocate it on format change. Decoding time got reduced by 4-5% from using the pool. Currently I don’t care about speed much but I may convert more decoders to it if the need arises.

In conclusion I want to say that even I did not enjoy doing that work much, it was needed and gave me some experience plus some improvements in code and design. So it was not a wasted effort.

P.S. I also installed rust-clippy since it’s in stable now and tried to fix errors and warnings it reported. But that is a story for another post.

Zähringerstädte

May 6th, 2019

Today I want to talk about local dynasty that was rather short-lived but left quite an impressive legacy.
Read the rest of this entry »

Bink2: some words about loop filter

April 14th, 2019

Since obviously I have nothing better to do, here’s a description of loop filter in Bink2 as much as I understand it (i.e. not much really).

First, the loop filter makes decision on two factors: motion vector difference between adjacent blocks is greater than two or it selects filter strength depending on number of coefficients coded in the block (that one I don’t remember seeing before). The filter is the same in all cases (inter/intra, luma/chroma, edge/inside macroblock), only the number of pixels filtered varies between zero and two on each side (more on that later). This is nice and elegant design IMO.

Second, filtering is done after each macroblock, horizontal edges first, vertical edges after that—but not necessarily for all macroblocks. Since BIKi or BIKj encoder can signal “do not deblock macroblocks in these rows and columns” by transmitting set of flags for columns and rows.

Third, in addition to normal filtering decoder can do something that I still don’t understand but it looks like whole-block overlapping in both directions (and it is performed in actual decoding but I don’t know what happens with the result of it).

And the filter itself is not that interesting (assuming we filter buf[0] buf[1] | buf[2] buf[3]:

    diff0 = buf[2] - buf[1] + 8 >> 4;
    diff1 = diff0 * 4 + 8 >> 4;
    if (left_strength >= 2)
        buf[0] = clip8(buf[0] + diff0);
    if (left_strength >= 1)
        buf[1] = clip8(buf[1] + diff1);
    if (right_strength >= 1)
        buf[2] = clip8(buf[2] - diff1);
    if (right_strength >= 2)
        buf[3] = clip8(buf[3] - diff0);

Strength is determined like this: 0 — more than 8 coefficients coded, 1 – MV difference or 4-7 coefficients coded, 2 — 1-3 coefficients coded, 3 — no coefficients coded.

Overall, the loop filter is nice and simple if you ignore the existence of some additional filter functions and very optimised implementation that is not that much fun to untangle.

Update: the alternative function seems to be some kind of block reconstruction based on DCs. In case it’s intra block with less than four coefficients coded it will take all neighbouring DCs, select those not differing by more than a frame-defined threshold and smooth the differences. I still don’t understand its purpose in full though.

NihAV: now with TM2X support!

April 11th, 2019

I’m proud to say that NihAV got TrueMotion 2X support. For now only intra frames are supported but 75% of the samples I have (i.e. three samples) have just intra frames. At least I could check that it works as supposed.

First, here’s codec description after I managed to write a working decoder for it. TrueMotion 2X is another of those codecs that’s closer to TrueMotion 1 in design. It still uses the same variable-length codebook instead of Huffman coding (actually only version 5 of this codec uses bit reading for anything). It also uses “apply variable amount of deltas per block” approach but instead of old fixed scheme it now defines twenty-something coding approaches and tells decoder which ones to use in current frame. That is done because block size now can be variable too (but it’s always 8 in all files I’ve seen). And blocks are grouped in tiles (usually equivalent to one row of blocks but again, it may vary). The frame data obfuscation that XORs chunks inside the frame with a 32-bit key derived in a special way is not worth mentioning.

Second, the reference is quite peculiar too. It decodes frame data by filling an array of pointers to the functions that decode each line segment with proper mode, move to the next line and repeat. And those functions are in handwritten assembly—they use stack pointer register for decoder context pointer (that has original ESP saved somewhere inside), which also means they do not use stack space for anything and instead of returning they simply jump to the next routine until the final one restores the stack and returns properly. Thankfully Ghidra allows to assign context argument to ESP and while decompile still looks useless, assembly has proper references in the form mov EDX, dword ptr [ctx->luma_pred + ESP].

And finally, I could not check what binary specification really does because MPlayer could not run it. At first I tried running working combination of WMP+Win98 under OllyDbg in QEMU but it was painfully slow and even more painful to look at the memory state. In result I’ve managed to run TM2X decoder in MPlayer which then served as a good reference. The trick is that you should not try to run tm2X.dll (it’s really hopeless) but rather to take tm2Xdec.ax (or deceptively named tm20dec.ax from the same distribution that can handle TM2X unlike its earlier versions), patch one byte for check in DLL init and it works surprisingly well after that.

So what’s next? Probably I’ll just add missing features for the second TM2X sample (the other two samples are TM2A), maybe add Bink2 deblocking feature—since I’d rather have that decoder complete—and move to improving overall NihAV design. Frame management needs proper rework before I add more codecs—I want to change into a thread-safe version before I add more decoders. Plus I’ll need to add some missing bits for a player. There’s a lot of work still to do but I’m pleased that I still managed to do something.

BMV: Complete!

April 4th, 2019

So NihAV finally got Discworld Noir BMV support and I’ve tested it on all samples from the game to see if it works correctly. Here’s a sample frame:


(I still remember the song Samael plays there and have it somewhere ripped in its full MPEG audio layer II glory).

Now I want to talk about the format since it’s quite different from anything else. BMV used in Discworld II was simple but with two quirks: it employed integer coding using variable amount of nibbles (that were read as bytes but a nibble could be saved and used later) and it could decode frame either from the beginning to end or from end to beginning (reading frame data from the end too!). DW3 BMV is even stranger and let’s start with audio part. Audio codec is very simple: you have 41-byte block with one byte signalling which quantised values tables should be used for both channels and 32 indices for each channel packed into 16-bit words. The main peculiarity is that data is aligned to 16-bit and mode byte can be either in the beginning or at the end of the block. That’s a bit unusual but not strange. Well, it turns out it aligns for the absolute position in a file so my demuxer has to signal whether audio data was at even or odd position. And video is even stranger.

As I wrote previously, video codec is 16-bit now and still employs nibble variable integer coding and copy/repeat/new pixels mode. Luckily there is no backwards decoding mode yet the codec is tricky without it. First of all, where previously there were just three plain modes now we have combinations of those with bytes or nibbles signalling what should be done (i.e. copy/repeat/put new pixels fixed amount of times and then do the other operation another fixed amount of times). And they have different meaning depending on what was the last operation (copy/repeat/put for fixed amount or with arbitrary large one). And if there’s a nibble left unread after last operation or not. But that’s not all! While previously reading new pixels meant just reading a byte, 16-bit pixels can be compressed a bit more. In result we read 1-3 bytes per pixel: first read index byte, remap it, if it’s in range 00..F7 then return pixel in an array, if it’s in range F8..FE then read another byte and use it as an index in one of seven secondary “palettes”; if it’s FF then simply read explicit 2-byte value from the stream. The reference simply used an array pointing to the various functions performing this. And of course palettes can be updated in the beginning of each frame.

Surprisingly, decoder implementation takes about 28kB with a quarter of it being tables. That’s for both audio and video decoder. This is on par with other game decoders (GDV, Smacker and VMD) and feels significantly smaller than the reference (which is about 30kB in a stripped assembled .o file and over 200kB as assembly).

Overall it was hard to comprehend and tricky to debug too. Nevertheless now it’s over and I can probably move to TrueMotion 2X. Or whatever I decide to do when I’m bored enough.

BMV: moving forward

March 30th, 2019

I’ve made some significant progress on REing Discworld Noir BMV.

First, I put opcodes meaning into table (it’s probably the only case when I had to use spreadsheet for REing) to figure out the meaning.

Normal mode of operation have these opcodes:

  • 00**00*0 — perform extra-long copy;
  • 00**00*1 — perform extra-long invoking of pixel functions;
  • 00**xxx0 — copy xxx-1 pixels;
  • 00**xxx1 — invoke pixel function xxx-1 times;
  • xxxx000y — copy xxxx+3+y pixels;
  • xxxx001y — invoke pixel function xxxx+3+y times;<
  • xxx0yyy0 — copy yyy-1 pixels and then invoke pixel function xxx-3 times;
  • xxx0yyy1 — invoke pixel function yyy-1 times and then repeat last value xxx-3 times;
  • xxx1yyy0 — copy yyy-1 pixels and then repeat last value xxx-3 times;
  • xxx1yyy1 — invoke pixel function yyy-1 times and then copy xxx-3 pixels.

Then depending on last operation performed mode is changed to: something special for 00****** opcodes, no change for repeat, “after copy mode” and “after pixel func mode” for obvious cases.

After copy mode opcodes:

  • xxxxxxx0 — invoke normal mode opcode xxxxxxx1;
  • 00**00** — extended repeat;
  • 00**xxx1 — repeat last value xxx-1 times;
  • xxxx00y1 — repeat last value xxxx+3+y times;
  • xxx0yyy0 — repeat last value yyy-1 times and copy xxxx-3 pixels;
  • xxx0yyy1 — repeat last value yyy-1 times and copy xxxx-3 pixels;
  • xxx1yyy1 — repeat last value yyy-1 times and invoke pixels function xxxx-3 times.

After pixel function mode is simple: xxxxxxx0 opcode is after copy mode xxxxxxx1 opcode and xxxxxxx1 opcode is normal mode xxxxxxx0 opcode.

The special modes may have secondary opcodes that usually boil down to the same thing: either do some more of the same and proceed normally or calculate next opcode instead of reading it from the stream.

And now the second thing: I’ve managed to rip the relevant code from the disassembly, fix it for NASM to handle plus added some fixed to make it possible to invoke externally and linked it against my own small program that parses BMV file, decodes the first video frame and dumps it into file. That approach works so I have something to test my new implementation against. Also since NASM makes all labels visible it’s easy to make debugger report each opcode as it gets called.

To summarise, I have good understanding of the algorithm and I have a working binary specification. This should be enough to finish it soon (unless I get distracted by something else of course).

Moving with REing game codecs

March 23rd, 2019

As I’ve mentioned before, NihAV now can decode Bink2 files more or less decently. I don’t have many samples but I can decode all samples I could find from KB2f to KB2j quite well (the only exception is KB2a — there’s only one partial sample known with no indication which game uses it and no version of RAD tools understands it either).

I’ve omitted support for full-resolution Bink2 files (no samples) and reconstruction is not perfect because there’s an in-loop deblocking filter with some additional crazy functions invoked in some cases. It’s messy and does not affect actual bitstream decoding so I’m not going to work on it now. Maybe if I get some inspiration later…

Anyway, I moved to REing Discworld Noir BMV format instead. While the game is nice, it’s hard to run on any modern OS and there’s almost no hope for its engine being reimplemented. So maybe I’ll be able to re-watch cutscenes from it… I’ve figured out container and audio long time ago, video is not that easy. It seems to be an upgrade of Discworld II BMV that outputs 16-bit video but the way it’s implemented is baffling.

While locating the functions responsible for the decoding was easy, understanding them was hard. For the disassembler. No, the code was recognized fine but its flow makes even disassembler freak out. Here’s how it works:

  1. The frame decoding function reads pixel values, fills certain arrays and patches functions that return pixel values (nice start, isn’t it?) and then the real frame decoding starts;
  2. Frame decoding is done like a state machine (which complements coroutines used elsewhere in the engine) with several tables for handling 256 opcodes (or less in some cases);
  3. In result you read byte, jump to one of the labels, perform the operation, read the next opcode, jump using the same or different table;
  4. Except when it’s opcodes 00-3F, then you usually have to construct length word, perform some pixel output loop and then jump to another opcode handler which performs some operation and then jump to the address calculated by the previous operation;
  5. Of course pixel functions are some permuted array of 256 pointers to the functions of three different kinds: return fixed value (set by the decoding function in the beginning), read byte and return pixel value from the corresponding array or read new pixel value from the stream;
  6. And to make it all even better, all those operations are obviously not actual functions but small(er) chunks of assembly code that use fixed registers as arguments and they’re located both before and after decoding function “body”.

Anyway, I’ve made some progress and I reckon it will be possible to support this format in NihAV though maybe not soon enough.