So, for a last month or even more (it feels like an eternity anyway) I was mostly trying to force myself to write a MPEG-4 ASP decoder. Why? Because I still have some content around that I’d like to play with my own player. Of course I have not implemented a lot of the features (nor am I going to do that) but even what I had to deal with made me write this rant.
First of all, MPEG is overrated.
Curiously enough, it reminds me a lot of Xiph.org—both organisations have (had?) professionals working on various multimedia-related (or unrelated—remember MPEG-G?), both have released audio codecs widely used around the world, both are known for giving catchy names (not always, for example the rather confusing group of HEVC, EVC and LCEVC), and both are known for sucking at video codecs. And before you start naming video codecs bearing MPEG name, I ask you how many of them became popular that were not a joint effort between ITU and MPEG (and ITU released its first popular video codec before MPEG was formed). I can name only MPEG-1 video (similarly with Xiph, their only success was modified Truemotion VP3 codec while both Tarkin and Daala failed—but here I’m not talking about them).
And nothing demonstrates it better than MPEG’s magnum opus ISO/IEC 14496 aka MPEG-4. Some part of it do not suck: rebranded MOV specification for which you have to pay insane money (because ISO), ITU H.264 aka ISO/IEC 14496-10 is good, some sub-sub-parts of MPEG-4 Audio (i.e. what we call AAC and ignore all the additional coding tools it may have like TwinVQ let alone more exciting stuff like several speech codecs, one text-to-speech codec, and Structured Audio—the programming language for describing the scene and instruments or other way to produce sounds as well as the actual music score using them). The rest was ambitious and went into completely wrong direction. I understand they were betting on the upcoming virtual reality (it seems to become promising every decade or two) but that bet lost, leaving MPEG-4 in laughably sad state.
And MPEG-4 Visual is the perfect demonstration of all what’s wrong with MPEG-4. ITU H.263 describes a bitstream suitable to transferring one video stream, MPEG-4 Visual describes a bitstream that can hold a multitude of different objects—3D meshes, synthetic faces, and (probably by mistake) a video sequence that can be mapped onto those objects as a texture (when ordinary wavelet-based still texture is not enough for your needs). As the result people care only about video coding and try to ignore other bits (and BIFS). And since video coding is defined that way to accommodate for all possible uses (like non-rectangular video or even binary masks), there are many features that nobody who is not seriously insane would use.
As for the video coding itself, I can characterise it as perverted H.263—beside reusing coding concepts like that shitty AC prediction mode (more about it later) that appeared in the second edition of ITU H.263 a year before ISO 14496-2 was released and codebooks, there’s “short header mode” which is essentially an attempt to repack H.263 stream into MPEG-4 one.
There are enough rants about the stupidity of not having a bit-exact transform while having rather unlimited sequences of dependent frames resulting in various artefacts caused by the mismatches between different FDCT/IDCT implementations used by different encoders and decoders. There are other stupidities like not having a loop filter but intending to use OBMC as the default. There’s this stupid decision of using the same codebook but with different meanings for intra coefficients; it makes sense for H.263 AIC as it would allow the already existing decoders to keep the same decoding routine but here it’s a different set of meanings so they could introduce a completely different codebook more suited for the intra coefficient. Actually Microsoft did exactly that, throwing a lot of useless pieces out as well (and then their codecs got hacked and renamed to DivX ;-) 3
and the rest is history). And of course there are these attempts to fit B-frames into AVI (some lumped B-frames together with P-frames and used the following empty frame to display it, others simply coded frame sequence as is and hoped that the player will either introduce delay or read frames ahead and rearrange them).
But to my taste the shittiest thing is DC and AC prediction (also a part of H.263 Advanced Intra Coding). The idea is simple: predict DC from the neighbour blocks, and then use the same block as the source of prediction for the first row or column of ACs. The problem here is that AC prediction mode also changes block scan order depending on the prediction direction. So while normally one can choose when to read bitstream and when to decode data into actual picture, here we have a fixed step: decode DCs, predict DC, set scan order depending on the AC prediction mode and DC prediction direction, decode and dequantise ACs, predict ACs, save them for the next possible prediction. Of course you can postpone it to later but then you’ll have to reorder coefficients. I can’t remember any other block-based video codec that would have the same problem—normally you should be able to decode all data in one pass and reconstruct it in the next pass if that’s what you want or combine operations in any way you like (e.g. combine inverse quantisation and IDCT). The only worse offender I can think of is X8 coding (a completely alternative intra frame coding mode) in WMA2 and WMA3 where some bit reading decisions depended on already reconstructed pixels (I vaguely remember it also using spatial intra prediction, somewhat like later in H.264 but for 8×8 blocks and two-pixel wide strip for neighbour context).
In either case, MPEG-4 Part 2 is a rich source of stupid and questionable things, everybody can look into it and find something to own dislike. I listed what I encountered during the decoder development but I know there are still more to hate. For example, there’s GMC sprite mode which some encoders abused for better motion compensation, there’s Studio Profile which only few Kierans have heard of, there’s this consistent approach of predicting motion vectors differently for each macroblock type, there’s… But enough for now. I hope existing examples were enough to demonstrate why I spend most time avoiding working on it (side note: I hope to read a similar post from Luca about Opus one day).
Additionally it should explain why there were lots of codecs based on H.263 (including the original Flash Video, RealVideo 1 and 2, Vivo Video and so on) while MPEG-4 part 2 got its popularity essentially thanks to the hacked version of a codec not really compatible with it. At least nobody has been caring about it since very long time while ITU H.264 is still here and strong (party for the format and software advantages, party because its success inflamed the greed of the various IP rights holders, making the successor codecs too expensive to adopt).
As for the NihAV
support, I’ve tested on a couple of videos I care about and it seems to work passably. Hopefully I’ll never encounter other features that I’ll need to implement (like GMC sprites). There’s still some work to be done though: despite the playback being fast as it is, I still want to try implementing multi-threaded decoding (it should not be that hard as I’ve made all the preparations in the decoder already and I can reuse code from my multi-threaded H.264 decoder) and I still need to deal somehow with MP3 in AVI. So it’ll probably take another week or two.
For 3D multimedia, including Web3D, Apple had QuickTime 3D Meta File (3DMF), Microsoft had MetaStream 3D, and Adobe had Shockwave, but all of them failed.
Even Blender’s company developed the Blender web plugin with game engine features, but it also failed, and the company went bankrupt.
I remember learning VRML back in school in late 1990s. And hearing about virtual reality technologies that are the future a year or two earlier (like VFX1 helmet).
And a decade before that we had cyberpunk novels being all the rage…
I vaguely recall the MPEG-4 ASP reference encoder/decoder was very messy. The H.264 one was leaps ahead in terms of code organisation.
Suggested post: compare the quality of reference codec implementations.
Uhm, for source code base where “designed by committee” is not merely a pejorative term but an accurate description of it, I’d rather not touch it.
Also it probably requires somebody more familiar with the conventional source code practices than somebody familiar with video coding. I can only poke fun at modern reference implementations using SIMD to be of any non-theoretical interest.