On H.264 Coding Schemes Names

June 3rd, 2016

Continuing the theme set by the previous post, let’s talk more about confusing names introduced by H.264. I mean CAVLC and CABAC.

CAVLC stands for Context-based Adaptive Variable Length Coding. While technically true because it employs variable-length codes and the code set is selected based on context it’s nothing special (and I’ve not spotted anything there that would make it “adaptive”). Again, it’s a trivial thing less exercised before because they had less ROM for codebooks. The idea of “let’s select codebook depending on top and/or left decoded values” it too trivial to get an own name IMO.

CABAC stands for Context-based Adaptive Binary Arithmetic Coding and the name is partly stupid and partly misleading. But before I explain why I want to present some history and terminology.

Arithmetic coding was developed in late sixties to early seventies but mostly known by work of Rissanen and Langdon that resulted in many IBM patents. The idea is that you can assign probabilities to various symbols, send them to the coder and the coding result is a long fraction belonging to the range obtained by multiplying the ranges in sequence. I.e. if we have probabilities for A, B and C as ranges [0; 1/3), [1/3; 2/3) and [2/3; 1) then AB is coded in [1/9; 2/9) range and BA is coded in [1/3; 4/9) range. It’s the ideal coding method since it codes probabilities in the minimum possible amount of bits (unless you remember it’s real world and we don’t have infinite-precision arithmetic; still, the losses are very small and there’s no better coding method).

And in 1979 G.*.*. Martin (no, not the writer known for Tuf Voyaging) introduced range coding. Which is absolutely the same thing except that (de)coder maintains low and range values instead of low and high values in a conventional arithmetic coder (hence the name). Since it was kinda not covered by patents it got more popularity over the years.

And because dealing with arbitrary probabilities usually involves division by an arbitrary integer (and maybe increased coder precision) the further improvements were for sacrificing efficiency for speed until it boiled down to coding just two symbols and creating more elaborate models to code input that takes more than one bit. Arithmetic mode in JPEG seems to be simply feeding bits from Huffman codes and such to Q-coder (patented by IBM and thus extremely popular in the wild) that squeezes a bit more entropy out of them. Then you create an advanced version (MQ-coder) and push it into JPEG-2000 until binary coding is popular in image and video coding.

So, CABAC is:

  1. Context-based — yes, static coding would be a tad more effective than Huffman coding applied to bits (hint: it gives no savings). The problem that it’s the first step of the classical scheme: modelling and providing probability to the entropy coder;
  2. Adaptive — see above;
  3. Binary — true (just remember it codes not bits but most and least probable symbols);
  4. Arithmetic — actually it uses range coding;
  5. Coding — nothing to argue with here.

In general, the naming is a lot like USSR which was hardly a union, probably not soviet (whatever that word means—literal meaning is “belonging to the councils”), republics were just provinces (or local despoties) but it was more or less socialistic (to the point its ideology can be called international socialism and it was founded by SDAPR(B) too).

And I’d like to point out that CABAC should refer to the whole process of binarisation+context selection plus coding the result, not just the exact implementation used in ITU H.264 and ITU H.EVC (even if it’s called CBAC in AVS and “we have completely different coding” in VPx). And if you want an example of context-based adaptive binary non-arithmetic coding look at ELS used in G2M2 (and if you drop binary coding then you have examples in every other advanced lossless audio codec).

On Variable-Length Codes Names

May 31st, 2016

After seeing the recent commit in Libav this rant simply wrote itself.

People, Solomon Wolf Golomb was a genius whose work influenced various areas of science (I’ve read about his work in Martin Gardner’s books plus some of his papers) but please stop attributing to him stuff he did not invent. I’m talking about universal variable-length codes for integers (or Xine for short).

He has introduced (in late sixties) a specific kind of Xine for optimal coding of integers with certain distributions (I’ve recommended to read the paper introducing them before, it’s awesome and I wish more papers were written like that one). Those codes have a parameter k that is distribution parameter and also it’s used to split code into two parts—an unary prefix coding N/k and log2k bits coding N%k (for some values it’s rounded down, for another it’s rounded up). Later Robert Rice had a similar idea and independently introduced codes that were Golomb codes with parameter 2^k (and thus they’re often called Rice codes and they’re used more because they are easier to manipulate on computer). And that’s all—there are no other Golomb codes.

Yet thanks to ITU H.264 standard (aka GNUMPEG-4 AVC) we have exp-Golomb codes and interleaved exp-Golomb codes. I don’t know who decided on the name but it’s misleading and wrong (but because it’s in the standard people insist on using it; that also shows how much people designing codecs know about general compression methods). Maybe if some other Xines were rediscovered they’d go under equally ridiculous names like geo(metric)-Golomb or norm(al)-Golomb or quad-Golomb or recursive Golomb codes (because people have never heard of Levenstein coding).

Again, back in seventies Peter Elias proposed a scheme for coding: let’s call unary code (i.e. the one that codes an integer as a series on one value terminated with the other value like 000001 or 1111110) alpha code and fixed bit representation of an integer beta code then we can arrive to gamma code that combines both.

So “exp-Golomb” code is really Elias gamma code? NOPE! Like with TV interlacing came first and actual Elias gamma code is what is incorrectly called “interleaved exp-Golomb” (i.e. first you have flag bit telling whether the code is over or there are more bits left, then data bit, then flag bit again, rinse, repeat). And progressive version is Elias gamma prime has unary prefix i.e. alpha code for the length of the second part concatenated with beta code for that part (I’ve rechecked the original paper freely available at sci-hub—because the only time I paid for IEEE papers access was when my scientific supervisor sent me with money to pay for IEEE membership renewal for our chair). Then you can construct delta code that codes integer value in three parts (actual bits, their length and length of the length part) and jump to omega codes that code mostly lengths of the following length part (very meta!).

So there’s another thing to dislike in H.264 standard beside all interlaced modes and scalable and multi-view coding, it’s forcing a wrong terminology on the world (feel free to correct me if there were earlier uses). The same way there are confusions between arithmetic and range coding, various binary coders are not free from it too. But that’s a rant for another day.

On Italian Literature

May 28th, 2016

One cannot be called a true reverse engineer unless he tried (and failed) REing Italian literature collection. I’ve finally tried it (and, obviously, failed).

What’s so special about it? Here is Mike’s description. From what I’ve seen on the first CD videos occupy 280MB out of total maybe 300MB (and over 200MB of it is a single tutorial video). While the actual library data occupies about five megabytes there.

The main library application is written in Visual Basic 4 (16-bit version) and it’s not a compiled version but rather P-code and I’ve failed to find a decompiler for that exact version (32-bit? seems no problem; 16-bit or even 32-bit Visual Basic 3? also no problem; 16-bit VB3? keep searching). There are some utility apps of unknown purpose there written in Borland Delphi (also 16-bit and I’m pretty sure it was simply Borland Delphi then, no additional versions needed). And while those are in sane machine code (well, 16-bit x86 machine code is hardly sane but manageable) there’s a lot of Delphi cruft compiled in with TThis and TThat and TOtherThing and such (plus additions in Italian).

Despite files having extension .LZ[1-3] I doubt they employ any kind of Lempel-Ziv compression, I’d expect some different dictionary-based scheme (you have an index with all possible words after all). And looks like they’ve licensed some DBT thing (obviously it stands for Text DataBase in Italian) from some Italian Institute of Computational Linguistics and this DBT is responsible for the file formats but I’m too lazy to RE those half-megabyte .dlls without a decompiler (written in Delphi too).

A Quick Look on DLI

May 26th, 2016

So yesterday I had a quick look on DLI image format. It turns out to be somewhat related to video codecs (and JPEG of course): there’s 8×8 fast integer DCT approximation, quantisation and bit coding of the block. And bit coding is the most interesting part really—this format employs binary model with old-school arithmetic coding and context selection for the model; coefficients are coded as first an array of coded coefficients flags (plus a flag for last coded coefficient), each non-zero coefficient has an additional flag to signal whether it’s larger than one and in that case it’s coded as unary code (bits coded with arithmetic coder of course).

And I still don’t like the notion of “let’s make our video codec I-frames into an image codec” (the reverse of “motion <image format>” is not much better but at least it makes sense for intermediate formats). Images and video codecs have different use cases and required features but I think I ranted about it once.

When Old Will Beat New Again?

May 19th, 2016

Since my previous post hasn’t brought me answers I sought here’s another philosophical (i.e. no answers again) post on question that bothers me.

The concept is rather simple: some old tricks and methods become more appealing over the time when other more competitive methods lose their traction. So I often wonder when those old methods, approaches and tricks will become relevant again.

For instance, quadtree coding was not popular some time ago and yet we see it again in codecs where it handles blocks of smaller sizes inside some coding unit (ITU H.EVC, VP9, AOMv1—you name the codec). There’s similar story with vector quantisation—it still lives in some GPU-assisted form and is interesting again.

Now let’s talk about classical arithmetic coding. In the flow of time it was mostly supplanted by some variation of binary coding. But with the time binary coding becomes more and more unwieldy since you have to code bits with different contexts and you often don’t code bits per se but rather bits for some variable-length code for integers. So I wonder if the classical arithmetic coding may come to use again and return saner coding while still being faster? Of course one could point me to One Xiphophorus, the company that made the best VP3 encoder, since they’ve found this approach worked fine in Celt and should work fine in Daala (unrelated to them: is FFA1 still a thing?). But really, is CABAC/boolean coder still the coder(s) of future or we’ll see more interesting things from the past? And yes, I’m aware how rANS can be used for faster coding of probabilities and that ANS is used in VP10 experiments. But what about, say, better modelling with, say, order-10 contexts (or the ones that take parameters from both neighbour blocks and blocks upper in hierarchy into account)?

And another one is not related to my usual stuff but is still quite interesting. Will raytracing return again? From what I know the current way is to have lots of triangles, lots of textures, lots of crazy additional maps and lots of even crazier shaders. I believe it went this way:

  • let’s approximate everything by triangles and draw them;
  • simple colours are not good enough—let’s add textures;
  • not good enough—let’s add shading (like Gouraud or Phong);
  • not good enough—let’s introduce bump maps for better realism;
  • not good enough—let’s introduce light maps;
  • not good enough—let’s introduce computable shaders;
  • still not good enough—let’s render scene once, calculate different parameters from it, create new light/shadow/whatever maps, add them to the scene and rerender again;
  • you know what, it’s still not good enough—let’s …

(I don’t know much about computer graphics since our university course didn’t went much farther than Bresenham’s line algorithms and simple image formats)

With all this trickery you still have not achieved realistic picture especially when it comes to dynamic light, shadows and reflections. Yet during all this time there was raytracing which is simple as hell (and equally slow): you have a scene and for each pixel you simply trace its path until you end in some light source or simply give up. With massive parallelism of GPUs and complex shaders it looks to me that switching to raytracing might be easier (sure, there’s a problem of legacy, making all those developers switch from Magma and Vulcan to a new approach etc etc) but I still wonder if it makes sense from technical point of view or will make in the near future.

And as usual—I hope for the answers but I don’t expect that I receive any.

Schizophrenia in Open-source Projects

May 14th, 2016

Disclaimer: the word “schizophrenia” is used here as it’s perceived by majority, not to denote a certain psychological condition. Feel free to be offended.

I’ve wasted about a decade working at two multimedia projects (plus a patch or two to unrelated projects) and what I’ve seen there leads me to the conclusion they both suffer from schizophrenia, albeit in different forms.

FFmpeg

FFmpeg features two forms of schizophrenia—developers and code.

Developers schizophrenia can be seen in how some developers believe they are also Libav developers. Mostly they brag because they’ve sent a patch or two to Libav and now can use it as a free review service. While I dislike Carl Eugen he’s at least honest and acts to his beliefs (here, an amended Elenril’s Law fulfilled; in case you’ve forgotten it says “Every FFmpeg-related discussion ends up mentioning Michael. Or Carl Eugen.”).

Code schizophrenia is more celebrated. The most prominent example is ProRes support—they offer two decoders and two encoders for it. There are two ASF demuxers as well. And two audio resampling libraries. And there are talks about adding second libswscale (*shudder*). The best part is that if you ask why it will probably go like this:

— Why do you have feature X in two versions?
— Because Libav has it.
— But why do you take it if you have your own version?
— To make merging Libav codebase easier.
— But why do you need to merge it at all?
— To make merging Libav codebase easier.

Please please tell me I’m wrong and provide proper reasons why FFmpeg keeps merging Libav stuff and keeps several versions of the same feature.

Libav

Here it’s somewhat more interesting—you have developers with physical multiple personalities disorder. Unlike the case in FFmpeg here you have people that work solely on Libav but as several different people. Most prominent examples are Luca (known as lu_zero and koda on IRC) who is really several instances not always agreeing with themselves (and if you subscribe to Lu_zianism then he’s also both Michael and Carl Eugen too, if you don’t believe it—you should because it annoys him/them). And there’s also Alexandra (aka beta elenril) and Anton (aka sasshka 2.0). And that’s the majority of core Libav developers anyway.

But at least the project seems to be more happy with itself and probably has a dream similar to the Ukrainian dream (which is “bugger off you damned Russians” in case you didn’t know).


It was somewhat fun to watch the fate of proposed bitstream reader replacement. Alexandra (she’s also Top Libav Blogger ?2 by the way—simply because she blogs) proposed a new bitstream reader to replace the old horror (which is a good idea), and that new bitstream reader turned out to be faster than the old mess too, and what was the result? If I’d be British I’d call it sheep-worrying.

Those developers from FFmpeg that believe they also should have a say in Libav process started to express their opinions. While there were independent benchmarking proving the new implementation is indeed faster (which is a good thing to provide) those benchmarks were also run on the decoder not present in Libav, with badly converted functions for code reading, and that turned out to have some problems because of the encoder used (also not in Libav) that produced nonconforming stream and screwed multi-threaded decoding benchmarks (that one can be seen as both trollish and arrogant—kinda like judging The Beatles performance from an excerpt sung by your not very talented neighbour).

But mostly it was bikeshedding and asking why it was not using old get_bits interface. The answer for the latter is simple—because it was built from horrible macros used in half of the places directly so you should either to make everything follow those macros design or convert the old UPDATE_CACHE(); LAST_SKIP_BITS(); ... CLOSE_READER(); into saner get_bits(); skip_bits(); anyway. And Libav developers decided it’s better to have fully new interface anyway and to make it consistent with bytestream reading while at it.

So why did people who should have nothing to do with it bikeshedded that much? Probably because they know in their hearts that as soon as it hits Libav the work on copying it into FFmpeg starts and sooner or later it will reside in FFmpeg codebase probably along with old get_bits.h with most decoders switched to the new bitstream reader anyway. Why? See the theoretical conversation above. I’d like to know the answer why merges are really done but I guess I’ll get it no sooner than this bitstream reader is accepted into Libav master (i.e. never).

On QuickTime Codecs

May 7th, 2016

The amount of interesting codecs is dangerously low so I’ll probably stop writing about them at all (and that rises a question whether this blog should be kept alive at all).

So, scraping bottom of the barrel I come to QuickTime codecs.

There are two codecs from the standard QuickTime package that are yet to be implemented in opensource: QDesign Music and Apple Pixlet. The former is (obviously) an audio codec with simple tones+noise coding, I hope to document it soon. The latter is an intermediate codec based on wavelets, so it should not be that hard to RE. The main problem is that I don’t know where to find a decoder (and I’m too lazy to search for one actively). It’s said that the only version of QuickTime being able to decode it was on Mac OS X Panther (yes, not just when it was called Mac OS X but also when it was purely PowerPC only). I estimate this codec would be rather simple—on par with SMPTE VC-5 (and probably even without codebooks but rather with generic variable-length codes like in Pear Intermediate Codec and AmateurRes). And PowerPC assembly is not that bad after you get hold of rlwinm instruction, I’ve REd most of AIC from PowerPC binary after all.

And there are some third-party extensions even Compn doesn’t know about like NewTek SpeedHQ or Digital Anarchy Microcosm codec. The former is an ordinary DCT-based intermediate codec any koda can RE, the latter is somewhat funny lossless codec (funny because it uses range coder just to decode bytes and use them in 8- or 16-bit RLE) that is better left to Derek to RE. SheerVideo has been documented long time ago, ZyGo video was just another DiVX, VP3 and Indeo 4 have other decoders etc etc.

Life is boring.

Update: so there is a more modern Pixlet decoder. I’ve looked at it. There’s per-plane wavelet compression, parametrised Rice codes, everything rather trivial. The only interesting things are coding of the zeroeth subband (it’s splitted into first coefficient, top row, left column and all other coefficients coded with top+left prediction) and the fact they have subband header with magic 0xDEADBEEF. Nice touch!

Life is still boring though.

Some Thoughts on Reuniting

April 28th, 2016

Before I move to the point I’d like to give some historical examples based on countries.

Ukraine

Well, as you remember in 1917-1918 there were several Ukrainian republics, most known are Ukrainian People’s Republic, West Ukrainian People’s Republic and Ukrainian Soviet Socialistic Republic. There were some other small states like anarchic republic but they are not relevant here.

So, Ukrainian People’s Republic and West Ukrainian People’s Republic willingly united in 1919 and that day is a national holiday now (later it was obviously occupied by Soviet Russia, Poland, Romania and Czechoslovakia). But why did that union happen? Because people wanted it and there was a dream for the united Ukraine since ages.

Germanies

You should’ve learned about it at school (or witnessed if you’re old enough). Why did the unification happened? Because people from both sides wanted it and the Soviet union could not prevent it any more.

Moldova and Romania

These countries share common history, have the same language and people like the idea of single country. While unification has not happened yet it might happen even in this century.

Chinese Republics

Here the situation is funnier. People’s Republic of China doesn’t recognize Republic of China yet they somehow co-exist and probably in distant future they will be the one again. Why? Because PRC is changing and it’s not what it had been during Chairman ? times.

Korea

Here the situation is even funnier. There are two governments who think they are the only True UpstreamKorean state, it’s just half of it is still occupied. And while there are constant talks about reunification, neither state really wants it. One country has suffered under homebrew Socialism (just look up what ‘juche’ means) for too long so it will take an enormous amount of time and money to make both parts equal (even funnier if you consider that before 1960s North Korea was industrially developed and South Korea was an underdeveloped agrarian region). Germanies got it easier (as a person paying Solidaritätszuschlag I know that). So will the reunion ever happen? I wouldn’t bet on it.


And now, to the our favourite projects.

Time from time somebody outside projects or from FFmpeg side asks about projects reunification. There are talks about it at VDDs. And yet there are no results. Have you noticed that I mentioned no such talks initiated by Libav. Why? Probably because Libav does not want to merge back. And there you have it—reunification cannot happen peacefully because you don’t have majority on both sides wanting it.

And that raises two questions: why FFmpeg wants reunification and why Libav doesn’t want it (or as a single question—what prevents it).

It seems that for some reason not clear to me FFmpeg keeps merging all stuff from Libav (feel free to enlighten me, otherwise FFmpeg developers themselves might forget it and it’ll turn into tradition) and having both projects together will solve two problems: the need for merge and the lack of skilled developers (it’s always the issue).

What does Libav gain from the merging? Relief from constant merges? Unlikely since it’s not being done there. More developers? That’s nice but Libav project seems to be happy as is. Return to the known brand and distributions? See above and here.

Let’s assume the projects decided to play nice out of nowhere and please people who’d want them to reunite. What would happen then? Multiple discussions about development process (that lead to the split in the first place), including but not limited to: reviewing process (relaxed and not applicable to some people or mandatory for any change), code standards (especially formatting), what features to have in the united tree (flat history or merges, one native decoder for certain format or two, use the code snippet like it was done in FFmpeg or in Libav). And on this stage it will all start to fall apart again.

So there you have it: clash of different development ideologies and more benefits for one side than the other. Also it’s rather hard to force people to work on the project they don’t like (and now they can choose at least).

And since this discussion cannot avoid certain names, here it is: I believe that Carl Eugen Hoyos deserves to be the next FFmpeg leader. Obviously my opinion doesn’t matter there and I could not convince anybody at VDD’15 but I firmly believe so. He’s the one with passion for the project, he cares for codec support (even fringe formats), he likes to follow guidelines, he respects Michael and is unlikely to go and ruin what he created. And at VDD he looked kinda like the most responsible adult too, so he can be the project face. Again, this is merely my opinion that won’t change anything.

Sincerely yours, NihAV project developer (it’s still vapourware, thanks for asking).

Some Information on Micronas SC4 and VoxWare MetaSound

April 24th, 2016

So I’ve looked at them.

Micronas SC4 seems to be rather unusual as it seems to bring elements of LPC to ADPCM. So it’s not just the old conventional “get nibble, multiply by step, output prediction, update index and step values”—it keeps a history of last 6 decoded samples and predictions and use them to calculate a new prediction value. Details might appear in the Wiki one day.

VoxWare MetaSound is three families of 2-3 codecs bundled under the same brand. I’ve not looked at technical details but they seem to have lots and lots of tables with floating point numbers (or just a bit of tables if you’ve looked at MetaSound first).
Here are the codecs:

  • RT24 2400bps “Real-Time” codec (ID is VOXa)
  • RT28 2844bps “Real-Time” codec (ID is VOXh)
  • RT29 2978bps “High Quality” codec (ID is VOXg)
  • VR12 1260bps Variable Rate codec (ID is VOXb)
  • VR15 1537bps Variable Rate codec(ID is VOXc)
  • SC3 3200bps “Embedded” codec (no ID)
  • SC6 6400bps “Embedded” codec (no ID)

Ask for support by grabbing j-b and demanding it to be supported. I know there are other players beside VLC but that’s the only project advertising that it “plays it all” even on T-shirts. It’s time to be responsible for your own words. And ask for Bink2 too while at it.

General Thoughts about Reverse Engineering Speech Codecs

April 23rd, 2016

Spoiler: they are not nice.

Speech codecs are probably the worst from my REing point of view. Why? Not because they are particularly hard to RE but rather because they are unpleasant. Here’s my list of reasons:

  • They are math-heavy. Of course you need to know mathematics to understand most of the codecs but I had no DSP courses at the university yet I can understand how video codecs work even in fine details, the same with many audio codecs. With speech codecs I have only general ideas about how they work.
  • Even worse, there are hardly any conventions on how to do things and in result codecs are built in a process that puts designing ARM SoCs to shame.
  • Even worse, because codecs are math heavy they have to be implemented with efficiency in mind which results in horrible fixed-point math usually in 16-bit variables.
  • And because all of this wasn’t enough, if codec supports several bitrates it might have additional postprocessing functions for lower bitrates in the best case. In the usual case it’s a different codec.

And that all is the source of REing problems. Bitstream format is usually easy to find and parse, the functions that do something with it are not easy to understand at all. I often end up not understanding what the function does let alone what concept it implements. I might recognize only some of them like LPC filter.

So why are speech codecs so badly designed? In my opinion it comes from the design decision. The initial idea was to use as little bits as possible by having a synthetic model and transmitting only its parameters. It worked great (at least compared to MPEG-4 with its synthetic scene and audio description aka the key parts of standard that people pretend do not exist at all). So you have human throat which is basically a variable tube and vowels are modulated tone, consonants are modulated noise. Transmit filter coefficients (original LPC, LSF, parcor form or something else), noise flag and pitch frequency and you’re done, right? It works fine for some sounds but the quality is not that great and it fails with some sounds completely (the sounds often used by French for example, so there’s still a need for French Speech Codec or j-bc for short). How to improve it? By adding impulses to “excite” the model (i.e. tell when to start/stop the sounds). Not good enough? Add pitch tilting! Still not good enough? Add postprocessing filter (and it was mandatory there long before video codecs). And what if we want to code not just voice and higher frequency range audio too? Well, add…

And thus it became pile of hacks over pile of hacks with a side dish of hacks. And each stage can be done in several different ways which only adds to confusion. That’s not even starting to talk about smart ways to save bits by splitting frame into several subframes and omitting coding some information for selected subframes (it can be interpolated from other subframes information after all). Or using codebooks and vector quantisation. Or how to generate noise for silent frames. Or using better coding than just writing fixed amount of bits for every element. Or…


I’ve finished looking at Lernout&Hauspie CELP+SBC codecs (as usual I don’t understand most of the things they do there but maybe I’ll still document them) and this plus my past experience made me write this post. Next is VoxWare MetaVoice and maybe Micronas SC4. And something saner afterwards. Or maybe it will be the usual nothing.