It is not that hard to write a simple encoder (as I’m going to demonstrate), the problem is to make it good (and that’s where I’ll fail). Until that time I’m going to explain what I’m doing and how/why it should be done.
VP7 is definitely not a H.264 rip-off, it just borrows overall codec design and prediction methods (i.e. what we call we rip-off in other cases, but this one is definitely not it; neither is VP8 for that matter). And the frame coding ideas are rather simple: you have frames composed from 16×16 macroblocks and those macroblocks can either be intra (and use 4 possible 16×16 luma prediction methods or 10 possible prediction methods for each 4×4 luma sub-block; and of course 4 possible prediction methods used for both chroma planes) or inter (and have 1-16 motion vectors; those vectors can refer either previous or special “golden” frame). Plus there is a thing called macroblock feature which allows coding macroblock with different quantiser (or signal some special properties), there can be up to four of those as well. As you can see, trying all possible combinations for each macroblock may be too tedious. This is usually called combinatorial explosion. There’s a reason why very large numbers are called astronomical—they are useful only for measuring things outside Earth scale, like the distances between galaxies. And numbers above googol (that’s 10100 if you forgot) are called combinatorial because they exceed the number of particles in the whole Universe but some combinatorial problems need much higher numbers (and special notations to express them, just look at the Graham’s number for example).
So, to give an estimate, we can code macroblock in the following ways:
- intra: four chroma prediction methods multiplied by four 16×16 luma prediction methods plus 1016 4×4 luma prediction methods (that doesn’t sound practical already, even 10*16 for selecting just the best prediction for current sub-block may to a bit too much);
- inter: sixteen sub-blocks each with ~2×2552 possible source locations (two reference frames and motion vector with components that can have 255 possible values);
- now sum those together and throw in four macroblock features (that can change quantiser for residue coding);
- and then you should consider that the frame can be coded with different quantisers for DC and AC coefficients in different block types (luma, luma DCs and chroma) and there are loop filter settings (that affect frames using it as a reference).
And if you think it’s not enough you can repeat the process for the next macroblock. And the next one. And all the following ones too.
That is why various heuristics are used to reduce the number of combinations. The most effective method for most cases is stopping search when the found value is good enough (e.g. block distortion is below some threshold). And I’ve reviewed the ways to save time on motion search already. Now what to do with intra 4×4 prediction mode selection? I’ve done a quick search and essentially there are two methods proposed: try just some of the modes depending on some base modes (e.g. horizontal, vertical and diagonal, pick the best one and then additionally try two directions adjacent to it) or select direction based on some neighbouring pixels. I’ll probably use some variation of the former approach.
Another thing is that you have too many options to select from so various decisions should be made earlier. In VP6 encoder I could get away with first making intra and inter versions of macroblock and deciding which one is better to code—and in intra case I additionally tried four motion vectors per macroblock to see if it improves coding. In VP7 comparing all the possible macroblock codings may be too costly (you have two different ways to code luma prediction in intra MB and five different ways to partition motion vectors in inter MB).
That is why you should have RDO built in into the coding so you can terminate search as early as possible e.g. instead of calculating prediction modes for all 16 luma sub-blocks you can terminate after 10th sub-block if you see that RDO metric is already worse than for other MB coding modes. The same is true for motion search: instead of searching for a motion vector for each sub-block you can sub-divide macroblock first into quarters and only then into sixteenths (and only if it promises RDO metric improvement). And the motion search itself can be terminated early if coding too distant motion vector is too costly (EPZS paper mentions that already).
As you can see, there’s a lot of things to try and implement. I’ll write about the specific details when I get to them. Meanwhile I’ve managed to make extremely simple intra-frame coding with mixed 16×16 and 4×4 intra prediction modes and that’s all.
[…] I wrote in the previous post, there are too many coding parameters to try so if you want to have a reasonable encoding done in […]