I’ve finally managed to implement more or less working RealVideo 4 encoder with all the main features (yeah, I’m also surprised that I’ve got to this stage this fast). As usual, it’s small details that will take a lot of time to make them decent let alone good.
So, what can my encoder actually do now? It can encode video with I/P/B-frames using provided order, it can encode all possible macroblock types and has some kind of rate control.
What it does not have then? First of all, I don’t know yet how it would fare with the original RealPlayer (I also need to modify RMVB muxer to output improper B-frame timestamps and maybe write the additional streaming-related information). Then there’s a question of having a proper rate control. And finally there are a lot of questions related to B-frames.
Currently my rate control is implemented as a system that keeps statistics on how large is on average an encoded frame for a given frame type and quantiser and tries to find the best fitting quantiser. If there’s still no statistics (or not enough of it) I resort to a simpler quantiser guessing, adjusting quantiser depending on how different are the projected and actual frame sizes. Of course it can be tuned to behave better (the question is how though). And I’m not going to touch the two-pass encoding (theoretically it’s rather simple—you log various encoder information in the first pass and use it to select quantisers better in the second part; in practice it means messing with text data and doing additional guesstimates, so pass).
With B-frames there are two main issues to deal with: which frames to select and how to perform motion estimation. I read the first can be achieved by performing motion compensation against neighbouring frames and calculating SATD (often done on scaled-down frames to be faster). The second question is how to search for a bidirectional block vectors. Currently I have a very simple approach: I search for a forward and backward motion vectors independently and check which combination of them works the best. I suspect there may be an approach specifically for weighted bi-directional search but I could not find anything (and I’m not desperate enough to dive into the codebase of MPEG-4 ASP/AVC encoders).
And finally there’s the whole question of quality. I suspect that my encoder is far from being good because it should not merely transform-quantise-code blocks but also perform some masking (i.e. set some higher-frequency coefficients to zero instead of hoping that they’ll be quantised to zero).
So this will be long and boring work…