As you might know (but still not care), I’m working on adding full RealMedia support for NihAV starting with video. So I’ve made it to decoding RealVideo 2 and I have some not so nice words to say about H.263 and MPEG-4 ASP.
First, the creeping featuritis in the standards: MPEG-4 part 2 from 2001 has A-O (the version from 2004 has only annexes A-M for some reason) while ITU H.263 (version from 2005) has annexes A-X plus two appendices. For comparison, ITU H.264 from 2017 has annexes A-J, same for MPEG-4 part 10 😉 Mind you, some annexes are for informative stuff (e.g. how an encoder should work or list of patent claims) but others add new coding features. So, for MPEG-4 part 2 (2001) we have 15 annexes, 7 of them are informative and only a couple of normative annexes add new features. For ITU H.263 out of 24 annexes about 15 are introducing new coding modes and other enhancements (different treating of motion vectors, loop filter, an alternative macroblock coding mode, PB-frame type and a lot more). The features are actually grouped into baseline(-ish) H.263 and H.263+.
Second, neither of them is really suitable for video coding. I know, it might sound strange, but either of these standards makes an unholy mix of various codecs. H.263 mixes several codecs from different generations together (initial H.263 did not have B-frames, later they’ve added PB-frames and finally B-frames too, there are at least two different ways to code macroblocks etc etc), MPEG-4 part 2 is for coding 3D video that actually also specifies a method to code video texture on those 3D shapes (there are no actual frames there, just VOPs—Video Object Planes). And yet, because the compression methods there provided an improvement over H.262 (aka MPEG-2 Video), they were used in various forms with various hacks in many multimedia formats. There we have a very wide gamut from RealVideo 1 and Sorenson Spark (aka FLV1) with just I- and P-frames to Intel I.263 that had PB-frames to RealVideo 2 with many features of H.263+ (including B-frames) to M$ MPEG-4 decoders to WMV2.
And here we have the problem: both format grew from the joint effort known as H.262 or MPEG-2 Video so obviously it was a good idea to abuse the same decoder structure to handle all possible variations of H.263 and video texture coding from MPEG-4 part 2 and then add all decoder-specific hacks. And in result you get a mess that’s hard to comprehend because it usually depends on many various context variables set in a specific manner for a specific codec. Hence the post title.
To demonstrate this I’ll show how the same feature is handled in different H.263/MP4p2-based codecs.
Sequence and frame headers
Obviously it differs for every codec. Some rely on container-provided width and height, some have dimensions coded for GOP or for individual frames, some codecs have only meaningful bits in the frame header, others store all feature bits and error out on unsupported configurations.
Frame types
- Intel I.263: I, P, PB
- RealVideo 1: I, P
- RealVideo 2: I, P, B
- Sorenson Spark: I, P, droppable P
- WMV1: I, P
- WMV2: I, P, X8(alternative I-frame coding)
- H.263 in general: I, P, PB, B, EI, EP (last two are enhancement layer picture types for scalable coding)
- MPEG-4: I, P, B and S (last one is sprite-coded picture)
Block coding
- Intel I.263: H.263 codes
- RealVideo 1: H.263 codes with a special codes for I-frame DCs
- RealVideo 2: H.263+ AIC mode (advanced I-frame coding) plus H.263 P- or B-frames
- Sorenson Spark: H.263 codes with a custom handling of AC escapes
- WMV1/2: M$MPEG-4 codes
Motion vectors reconstruction
- H.263: simply add predictor vector
- H.263 UMV: depending on predictor value and difference range wrap it or not (see ITU H.263 D.2 for proper explanation)
- MPEG-4:
if (mv < low) mv += range; if (mv > high) mv -= range;
- M$MPEG-4:
if (mv < = -64) mv += 64; if (mv >= 64) mv -= 64;
(And there are different ways to predict motion vectors too!)
There are even more quirks than I listed here but it should give you an idea what a fine mess these formats are and why the code that supports them all tends to turn into huge mess. I tried to solve it in NihAV by having a template decoder for H.263 that calls bitstream parser for actual codec-specific parsing and keep some quirks inside specific structures (like MV that adds vectors differently depending on current mode) I still have more features to take into account (like slices, AC prediction and B-frames) so I’ll have to redesign it before I can support RealVideo 2 properly.
But then maybe I’ll add Vivo Media format support for the old times sake (it’s the funniest one with codebooks stored as strings of ones and zeroes like “0000 0011 110
” inside the binary with “End” signalling the codebook end).