Today I’ll try to tell the principles behind bool coder in VP6 (actually VP5-VP9) and how it all should work in the encoder. As usual, let’s start with the theory.
In 1948 Claude Shannon has published his fundamental theory of communication (which is just one of several fundamental theories he produced) where among other things he demonstrated the relation between probability of a message and an amount of bits required to code that message, bits = -log2(p)
.
Of course some people wonder how you can encode e.g. one and a half bit and the answer is simple: multiplication. For example, if you have three equiprobable messages (codes, events—it’s just a number to us) then you need -log2(1/3) ≈ 1.585
bits to code one of them. Of course you can spend two full bits to code one, but you can also pack five of them into a base-3 number (i.e. m0 + m1 * 3 + m2 * 9 + m3 * 27 + m4 * 81
). It is easy to see that the resulting number will have a 0-242 range which fits eight bits (instead of ten) meaning it’s only 1.6 bits per single message. Similarly you can pack three of five equiprobable messages into just seven bits (using 2⅔ bits per message instead of ideal 2.32 bits). We can mix various probabilities as well e.g. we can combine one message with three equiprobable variants and one message with five equiprobable variants to code in 4 bits. But what if we want to code a message with different probabilities of each outcome?
And from that we get to the idea of arithmetic coding (or ANS if you prefer tables to multiplications). We can represent all possible outcomes using cumulative probabilities (i.e. sum of all probabilities up to the current one), e.g. if you have probabilities 0.2, 0.3 and 0.5 then their cumulative probabilities will be 0.2, 0.5 and 1.0 and you can use ranges [0; 0.2)
, [0.2, 0.5)
and [0.5, 1.0)
to signal the proper outcome. Any number inside the range would do – both 0.6 and 0.75 point to the same outcome #2. So how do you code the next number? You take the range you got at the previous step and multiply it by the range of the next event (this is done by multiplying original range length by new range points and adding the original range start). E.g. if we had event #2 and then event #0 we multiply [0.5, 1.0)
by [0; 0.2)
and get a new range, namely [0.5; 0.6)
. Obviously we’d get [0.0; 0.1)
range if we had event #0 and then event #2— same range size but different range position. And it’s worth noting that you can represent range as its start (or end) point plus range size, that’s how you get range coding.
It is easy to see that you can code all those probabilities you get a single insanely large number representing a number inside the final range. It can be proven that it’s as close to the ideal low boundary as it can be, but how can you work with such number of indefinite length? The answer is easy: you need just a couple of digits at the given time as top digits won’t change (the only exception is possible overflow like [0.39999, 0.40001)
which can have indefinite amount of nines or zeroes but you can simply record such situation and output 39999 or 40000 later; or design a carry-less coder to prevent such situation from happening entirely). The same with decoding, you work with finite precision and handle only certain amount of digits at the time.
While this is theoretically optimal, in practice people want to code it faster and thus a lot of effort was put into research of binary coding where you have a simplified case and thus require less operations to decode a bit (it can be done without any multiplications, to give one example).
And one of those methods in VP6 bool coder. It codes bits (or “bool(ean)s” in VP terminology) using 8-bit probabilities. The idea is as simple as it can get: first you calculate the split point (range multiplied by the probability) below which it’s zero bit and above which it’s one bit, pick an appropriate sub-range for the input bit and add split value to the output value. During decoding you compare input value to the split point, select the proper sub-range and remove split value from the input value if the bit is set. But what do you do when the range value gets too small (like one) so you can’t split it properly? That’s why you need normalisation (so the range is always 128-255). Normalisation is done by adding bits to range and input value until the range value is 128 or more. This is corresponding to the process of discarding top bits of the fraction I talked about previously. Side note: I’m not sure that overflow can happen in bool coder since its range is limited to 255 instead of 256 but I’ll ignore that for now and deal with it if I’ll ever be able to catch such overflow.
Now about the probabilities themselves. VP6 uses context-dependent probabilities (i.e. different probabilities may be chosen depending on what was the value of previously coded elements of the same size or the position of coded coefficient) but they are not adaptive (i.e. they are set once in the beginning of the frame and do not change in the process) which makes decoding VP6 faster than H.264 (because you don’t need to update probabilities in the process) but you can’t encode a frame on the fly since you need to calculate and transmit probabilities first (or use some defaults)—not that it’s a big issue anyway.
Generating the probabilities is very easy: you just count the total number of bits coded and how many of them were zero and rescale it at the end to 256 range. And I played with size estimation a bit so I should mention it too. If you want to know encoded size you can simply use the same counters to calculate probability and zeroes * -log2(prob/256) + ones * -log2(1 - prob/256)
. In my experiments I pre-calculated a table with -8 * log2(prob/256)
for all 256 probabilities to make the process faster while being precise enough (if you wonder, this will come in handy during rate control and rate distortion optimisation but I’ll get to that much later).
Another thing I should probably mention is what is done with those bits. H.264 usually represents possible values as universal codes (often with insane rules like “this bit should use this context unless it’s in this code, then it should use a different context also used to code that bit in another code”) while VPx has fixed trees to code values with unique contexts assigned to each tree node so it’s much clearer there (and tails in long coefficients are coded using fixed set of probabilities depending on coefficient value). It is also worth noting that those trees are not exactly Huffman trees since in Huffman tree you take right or left branch with equal probability and here each node has its own probability. But the format defines a way to convert those probabilities into variable-length codes if you want to use them for coding coefficients. Essentially you just calculate the probability of each leaf by multiplying the probabilities taken on the way to it and build a completely new tree using those probabilities as weights.
Anyway, with DCT and bool coder already implemented I should move to making a simple intra-frame encoder. This will take a while.