H.264 specification sucks

So it has come to a stage where I have nothing better to do so I tried to write H.264 decoder for NihAV (so I can test the future nihav-player with the content beside just sample files and cutscenes from various games). And while I’ve managed to decode at least something (more about that in the end) the specification for H.264 sucks. Don’t get me wrong, the format by itself is not that bad in design but the way it’s documented is far from being good (though it’s still serviceable—it’s not an audio codec after all).

And in the beginning to those who want to cry “but it’s GNU/Linux, err, MPEG/AVC”. ITU H.264 was standardised in May 2003 while MPEG-4 Part 10 came in December 2003. Second, I can download ITU specification freely and various editions too while MPEG standard still costs money I’m not going to pay.

I guess the main problems of H.264 come from two things: dual coding nature (i.e. slice data can be coded using variable-length codes or binary arithmetic coder) and extensions (not as bad as H.263 but approaching it; and here’s a simple fact to demonstrate it—2003 edition had 282 pages, 2019 edition has 836 pages). Plus the fact that is codified the wrong name for Elias gamma’ codes I ranted on before.

Let’s start with the extensions part since most of them can be ignored and I don’t have much to say about them except for one thing—profiles. By itself the idea is good: you have certain set of constraints and features associated with the ID so you know in advance if you should be able to handle the stream or not. And the initial 2003 edition had three profiles (baseline/main/extended) with IDs associated with them (66, 77 and 88 correspondingly). By 2019 there have been a dozen of various profiles and even more profile IDs and they’re not actually mapped one to one (e.g. constrained baseline profile is baseline profile with an additional constraint_set1_flag set to one). In result you have lots of random profile IDs (can you guess what profile_idc 44 means? and 86? or 128?) and they did not bother to make a table listing all known profile IDs so you need to search all specification is order to find out what they mean. I’d not care much but they affect bitstream parsing, especially sequence parameter set where they decided to insert some additional fields in the middle for certain high profiles.

Now the more exciting part: coding. While I understand the rationale (you have simpler and faster or slower but more effective (de)coding mode while using the same ways to transform data) it created some problems for describing it. Because of that decision you have to look at three different places in order to understand what and how to decode: syntax tables in 7.3 which present in which order and under which conditions elements are coded, semantics in 7.4 telling you what that element actually means and what limitations or values it has, and 9.2 or 9.3 for explanations on how certain element should be actually decoded from the bitstream. And confusingly enough coded block pattern is put into 9.1.2 while it would be more logical to join it with 9.2, as 9.1 is for parsing generic codes used not just in slice data but various headers as well and 9.2 deals with parsing custom codes for non-CABAC slice data.

And it gets even worse for CABAC parsing. For those who don’t know what it is, that abbreviation means context-adaptive binary arithmetic coding. In other words it represents various values as sequences of bits and codes each bit using its own context. And if you ask yourself how the values are represented and which contexts are used for each bit then you point right at the problem. In the standard you have it all spread in three or four places: one table to tell you which range of contexts to use for a certain element, some description or separate table for the possible bit strings, another table or two to tell you which contexts should be used for each bit in various cases (e.g. for ctxIdxOffset=36 you have these context offsets for following bits: 0, 1, (2 or 3), 3, 3, 3), and finally an entry that tells you how to select a context for the first bit if it depends on already decoded data (usually by checking if top and left (macro)blocks have the same thing coded or not). Of course it’s especially fun when different bit contexts are reused for different bit positions or the same bit positions can have different contexts depending on previously decoded bit string (this happens mostly for macroblock types in P/SP/B-slices but it’s still confusing). My guess is that they tried to optimise the total number of contexts and thus merged the least used ones. In result you about 20 pages of context data initialisation in the 2019 edition (in initial edition of both H.264 and H.EVC it’s just eight pages)—compare that to almost hundred pages of default CDFs in AV1 specification. And CABAC part in H.265 is somehow much easier to comprehend (probably because they made the format less dependent on special bit strings and put some of the simpler conditions straight into binarisation table).

To me it seems that people describing CABAC coding (not the coder itself but rather how it’s used to code data) did not understand it well themselves (or at least could not convey the meaning clearly). And despite the principle of documenting format from decoder point of view (i.e. what bits should it read and how to act on them in order to decode bitstream) a lot of CABAC coding is documented from encoder point of view (i.e. what bits you should write for syntax element instead of what reading certain bits would produce). An egregious example of that is so-called UEGk binarisation. In addition to the things mentioned above it also has rather meaningless parameter name uCoff (which normally would be called something like escape value). How would I describe decoding it: read truncated unary sequence up to escape_len, if the read value is equal to escape_len then read an additional escape value as exp-Golomb code shifted by k and trailing k-bit value, otherwise escape value is set to zero. Add escape value to the initial one and if the value is non-zero and should be signed, read the sign. Section 9.2.3.2 spends a whole page on it with a third of it being C code for writing the value.

I hope I made it clear why H.264 specification sucks in my opinion. Again, the format itself is logical but comprehending certain parts of the specification describing it takes significantly more time than it should and I wanted to point out why. It was still possible to write a decoder using mostly the specification and referring to other decoders source code only when it was completely unclear or worked against expectations (and JM is still not the best codebase to look at either, HM got much better in that aspect).

P.S. For those zero people who care about NihAV decoder, I’ve managed to decode two random videos downloaded from BaidUTube (funny how one of them turned out to be simple CAVLC-coded video with no B-frames) without B-frames and without apparent artefacts in first hundred frames. There’s still a lot of work to make it decode data correctly (currently it lacks even loop filter and probably still has bugs) plus beside dreaded B-frames with their co-located MVs there are still some features like 8×8 DCTs or high-bitdepth support I’d like to have (but definitely no interlaced support or scalable/multiview shit). It should be good enough to play content I care about and that’s all, I do not want to waste extremely much time making it a perfect software that supports all possible H.264/AVC features and being the fastest one too.

Leave a Reply