A look at various video codecs from the 90s

January 18th, 2021

Since I had nothing better to do during Christmas “vacation” (it is the first time I’m in Germany at this time of year so of course I had nothing better to do) I looked at various codecs, mostly from last century, and wrote some notes about more interesting ones. Here I’d like to give some information about the rest lest I forget it completely.

  • Affinity Video—JPEG rip-off;
  • Lsvx—H.263 rip-off with possible raw frames;
  • Morgan TVMJ—I should’ve noticed “MJPEG” in the description sooner;
  • VDOWave 2—an unholy mix of H.263 and wavelets. It uses the coding scheme from H.263 (8×8 blocks, loop filter, halfpel motion compensation and even something suspiciously resembling OBMC) but blocks are coded as three 4×4 blocks that should be recombined using Haar transform into one 8×8 block. Plus there might be an additional enhancement layer for the whole frame based on the same wavelet as well.

And I should mention VSS Codec Light. While it is hard to get through all those levels of C++ abstractions, looks like it has arithmetic coding with static models, 4×4/8×8/16×16 blocks, 8×8 DCT, and five different wavelets variants to boot. At least it’s not another JPEG or H.263 rip-off.

Overall it feels that back in those days you had mostly JPEG rip-offs, H.263 rip-offs and wavelet-based codecs. I tried to look at more of the latter but one of the codecs turned out to be an impenetrable mess of deeply nested calls that seem to add stuff to the lists to be processed later somehow and another codec demonstrated that Ghidra disassembler has bugs in handling certain kinds of instructions involving FS register IIRC. In result it thinks the instruction should be a byte or two longer than it really is. So unless this is fixed I can’t look at it. There are still plenty old codecs I’ve not looked at.

A look at ACT-L2

January 9th, 2021

This is yet another video codec from the 90s used for streaming and completely forgotten now. But since I had nothing better to do I decided to look at it as well.

Essentially it is another H.263 rip-off with a twist. From H.263 it took overall codec design (I/P-frames, 8×8 DCT, DC prediction, OBMC) but the data coding is special. For starters, they don’t use any codebooks but rather rely on fixed-width bitfields. And those bit values are not written as they occur but rather packed together into separate arrays. There’s a way to improve compression though: those chunks can be further compressed using binary arithmetic coder with an adaptive model to code bytes (i.e. you have 256 states and you select state depending on which bits you have already decoded).

Additionally it has somewhat different method of coding block coefficients. Instead of usual (zero-run, level, end-of-block) triplets assigned to a single code it uses bit flags to signal that certain block areas (coefficients 0-3, 4-7, 8-11 and 12-63) are coded and for the first three areas it also transmits bit flags to signal that the coefficient is coded. And only the last area uses zero-run + level coding (using explicit bitfields for each).

Overall it’s an interesting idea and reminds me of TM2 or TM2X since those codecs also used data partitioning (and in case of TM2 data compression as well).

Bink going from RAD to Epic

January 8th, 2021

So in the recent news I read about Epic Games acquiring RAD Game Tools and I guess I have to say something about that.

RAD was an example of a good niche company: it has been providing game makers with essential tools for more than quarter of century and offering new things too. Some people might remember their Miles Sound System and Smacker format from DOS times, some have heard about Bink video, others care about their other tools or recent general data compression library that even got hardware decoding support on PS5. And one of the under-appreciated things is that the developers are free to publish their research so you can actually read how their things were developed and improved. If that does not convince you of their competence I don’t know what can. (Side note: considering that you usually get useless whitepapers that evade describing how things work, the posts from Charles or Fabian are especially outstanding).

Since I’m no expert in business matters and lack inside knowledge I don’t know if it’s a good or bad step for a company and its products. Nevertheless I wish them good luck and prosperous and interesting future even if we have Electronic Arts to show us an example of what happens when a small company gets bought by a large game developer and publisher.

P.S. I would be grateful if they fill in missing details about Bink2 video but this is unlikely to happen and probably somebody caring enough about it should finish the reverse engineering.

A look on weird audio codec

January 7th, 2021

Since I still have nothing better to do I decided to look at ALF2CD audio codec. And it turned out to be weird.

The codec is remarkable since while it seems to be simple transform+coefficient coding it does that in its own unique way: transform is some kind of integer FFT approximation and coefficient coding is done with CABAC-like approach. Let’s review all details for the decoder as much as I understood them (so not much).

Framing. Audio is split into sub-frames for middle and side channels with 4096 samples per sub-frame. Sub-frame sizes are fixed for each bitrate: for 512kbps it’s 2972 bytes each, for 384kbps it’s 2230 bytes each, for 320kbps it’s 2230/1486 bytes, for 256kbps it’s 1858/1114 bytes. Each sub-frame has the following data coded in it: first and last 16 raw samples, DC value, transform coefficients.

Coding. All values except for transform coefficients are coded in this sequence: non-zero flag, sign, absolute value coded using Elias gamma code. Transform coefficient are coded in bit-slicing mode: you transmit the lengths of region that may have 0x100000 set in their values plus bit flags to tell which entries in that actually have it set, then the additional length of region that may have 0x80000 set etc etc. The rationale is that larger coefficients come first so only first N coefficients may be that large, then N+M coefficients may have another bit set down below to bit 0. Plus this way you can have coarse or fine approximation of the coefficients to fit the fixed frame size without special tricks to change the size.

Speaking of the coder itself, it is context-adaptive binary range coder but not exactly CABAC you see in ITU H.26x codecs. It has some changes, especially in the model which is actually a combination of several smaller models in the same space and in the beginning of each sub-model you have to flip MPS value and maybe transition to some other sub-model. I.e. a single model is a collection of fixed probabilities of one/zero appearing and depending on what bit we decoded we move to another probability that more suits it (more zeroes to expect or more ones to expect). In H.26x there’s a single model for it, in ALF2CD there are several such models so when you hit the edge state aka “expect all ones or all zeroes” you don’t simply remain in the state but may transition to another sub-model with a different probabilities for expected ones-zeroes. A nice trick I’d say.

Coder also maintains around 30 bit states: state 0 is for coding non-zero flags, state 1 is for coding value sign, states 2-25 are for coding value exponent and state 26 is for coding value mantissa (or it’s states 2-17 for exponent and state 18 for mantissa bits when we code lengths of transform coefficient regions).

Reconstruction. This is done by performing inverse integer transform (which looks like FFT approximation but I’ve not looked at it that close), replacing first and last 16 coefficients with previously decoded ones (probably to deal with effects of windowing or imperfect reconstruction), and finally undoing mid/stereo for both sub-frames.

Overall it’s an interesting codec since you don’t often see arithmetic coding employed in lossy audio codecs unless they’re very recent ones of BSAC. And even then I can’t remember any audio codec using binary arithmetic coder instead of multi-symbol models. Who knows, maybe this approach will be used once again as something new. Most of those new ideas in various codecs have been implemented before after all (e.g. spatial prediction in H.264 is just a simplified version of spatial prediction in WMV2 X8-frames and quadtrees were used quite often in the 90s before reappearing in H.265; the same way Opus is not so modern if you know about ITU G.722.1 and heard that WMA Voice could have WMA Pro-coded frames in its stream).

ClearVideo briefly revisited

December 31st, 2020

Since I had nothing better to do for the rest of this year (I expect the next year to begin in the same fashion) I decided to take a look at the problem when some files were decoded with inter-frames becoming distorted like there’s some sharpening filter constantly applied. And what do you know, there’s some smoothing involved in certain cases.
Read the rest of this entry »

A quick look on Rududu

December 27th, 2020

Since I had nothing better to do I decided to look at Rududu codec. It is one of old more exotic codecs that nobody remembers.

I did not want to look that deep into its details (hence it’s just a quick look) so here are the principles it seems to employ:

  • it seems to employ some integer approximation of wavelet transform (instead of e.g. LeGall 5/3 transform employed by lossless JPEG-2000);
  • it probably has intra- and interframes but it does not employ motion compensation, just coefficients updating;
  • DWT coefficients are quantised (and common bias is removed) with scale and bias calculated for the whole frame;
  • coefficients are coded using quadtree (i.e. some parts of the bands can be left uncoded in addition to skipping the whole DWT subbands);
  • and finally, data is coded using adaptive models for absolute values and bits for both signs and “region coded” flags and the probabilities from these models are fed to the range coder.

So while this codec is nothing outstanding it’s still a nice change from the mainstream video coding approach defined by ITU H.26x codecs.

Vivo2 revisited

December 22nd, 2020

Since I have nothing better to do (after a quick glance at H.264 decoder—yup, nothing) I decided to look at Vivo 2 again to see if I can improve it from being “decoding and somewhat recognizable” to “mostly okay” stage.

To put a long story short, Vivo 2 turned out to be an unholy mix of H.263 and MPEG-4 ASP. On one hoof you have H.263 codec structure, H.263 codebooks and even the unique feature of H.263 called PB-frames. On the other hoof you have coefficient quantisation like in MPEG-4 ASP and coefficient prediction done on unquantised coefficients (H.263 performs DC/AC prediction on already dequantised coefficients while MPEG-4 ASP re-quantises them for the prediction).

And the main weirdness is IDCT. While the older standards give just ideal transform formula, multiplying by matrix is slow and thus most implementations use some (usually fixed-point integer) approximation that also exploits internal symmetry for faster calculation (and hence one of the main problems with various H.263 and DivX-based codecs: if you don’t use the exactly the same transform implementation as the reference you’ll get artefacts because those small differences will accumulate). Actually ITU H.263 Annex W specifies bit-exact transform but nobody cares by this point. And Vivo Video has a different approach altogether: it generates a set of matrices for each coefficient and thus instead of performing IDCT directly it simply sums one or two matrices for each non-zero coefficient (one matrix is for coefficient value modulo 32, another one is for coefficient value which is multiple of 32). Of course it takes account for it being too coarse by multiplying matrices by 64 before converting to integers (and so the resulting block should be scaled down by 64 as well).

In either case it seems to work good enough so I’ve finally enabled nihav-vivo in the list of default crates and can finally forget about it as did the rest of the world.

NihAV: frame reordering

December 18th, 2020

Since I have nothing better to do I’d like to talk about how NihAV handles output frames.

As you might remember I decided to make decoders output frames on synchronous basis, i.e. if a frame comes to the decoder it should be decoded and output and in case when the codec support B-frames a reordering might happen later in a special frame reorderer. And the reorderer for the concrete decoder was selected based on codec capabilities (if you don’t have frame reordering in format then don’t do it).

Previously I had just two of them, NoReorderer (it should be obvious for which cases it is intended) and IPBReorderer for codecs with I/P/B-frames. The latter simply holds last seen reference frame (I- or P-frame) and outputs B-frames until the next reference frame comes. This worked as expected until I decided to implement H.264 decoder and hit the famous B-pyramid (i.e. when B-frames serve as a reference for another B-frames or even P-frames). To illustrate that imagine an input sequence of frames I0 P4 B2 B1 B3 which should be output as I0 B1 B2 B3 P4. The approach from IPBReorderer would output it as I0 B2 B1 B3 P4 which is not quite correct. So I had to add so-called ComplexReorderer which keeps an array of frames sorted by display timestamp and marks the frames up to a reference I- or P-frame available for output when the next reference frame comes. Here’s a step-by-step example:

  • I0 comes and is stored in the queue;
  • P4 comes and is stored in the queue, I0 is marked as being ready for output;
  • B2 comes and is stored in the queue right before P4;
  • B1 comes and is stored in the queue right before B2 so the queue now is B1 B2 P4;
  • B3 comes and is stored in the queue between B2 and P4;
  • then a next reference frame should come and we should store it and mark B1 B2 B3 P4 ready for output.

Of course one can argue that this waits for more than needed and we should be able to output B1 and B2 even before B3 arrives (or even better we can output B1 immediately as it appears). That is true but it is rather hard to do in the general case. Real-world DTS values depend on container timebase so how do you know there are no additional frames in sequence 0 1000 333 667 (plus the decoder can be told to stop outputting unreferenced frames). Relying on frame IDs generated by the decoder? H.264 has three different modes of generating picture IDs with one of them assigning even numbers to frames (and odd numbers to the second frame field if those are present). While it can be resolved, that will complicate the code for no good reason. So as usual I picked the simplest working solution trading theoretically lower latency for clarity and simplicity.

NihAV: optimisation potential

December 13th, 2020

Today I can say what I’ve wasted about two months on: it was H.264 decoder. For now it’s the only entry in nihav-itu crate but I might add G.7xx decoders there or even the standard H.263 decoder in addition to all those decoders based on it.

Performance-wise it is not very good, about 2.5-3x times slower than libavcodec one without SIMD optimisations on random BaidUTube 720p videos but I’ve not tried to make it the fastest one and prefer clarity over micro-optimisations. But this still has a lot of optimisation potential as the title says. I suspect that even simply making motion interpolation functions work on constant-size blocks would make it significantly faster let alone adding SIMD. In either case it is fast enough to decode 720p in 2x realtime on my laptop so if I ever finish a proper video player I can use it to watch content beside game cutscenes and few exotic files.

As for the features it’s limited but it should be able to play the conventional files just fine plus some limited subset of High profile (just 8-bit 4:2:0 YUV without custom scaling lists). A lot of features that I don’t care about were ignored (proper loop filtering across the slice edges—nope, weighted prediction—maybe later, high-bitdepth or different chroma subsampling format support—quite unlikely, interlaced formats—no in principle).

While developing that decoder I also got better knowledge of H.264 internals for which I’m not that grateful but that’s to be expected from a codec designed by a committee with features being added to it afterwards.

In either case hopefully I’ll not be that bored to do optimisations unless I have to, so the potential will remain the potential and I’ll do some more interesting stuff instead. And there’s always Settlers II as the ultimate time consumer 😉

Hamburger as the symbol of modern IT terminology

November 25th, 2020

As anybody knows, this American dish of non-American origin is named after Hamburger Frikadelle which means (minced meat) patty from Hamburg. And because Americans are known for their deep knowledge of other languages somebody decided that the first syllable is a separate word so the words like cheeseburger and simply burger were born (you can call it the American wasei-eigo if you like). Anyway, the same process of maiming words and giving them new meaning happens in IT as well, irritating those few who still remember the original word and its meaning.
Read the rest of this entry »