A quick look on Rududu

December 27th, 2020

Since I had nothing better to do I decided to look at Rududu codec. It is one of old more exotic codecs that nobody remembers.

I did not want to look that deep into its details (hence it’s just a quick look) so here are the principles it seems to employ:

  • it seems to employ some integer approximation of wavelet transform (instead of e.g. LeGall 5/3 transform employed by lossless JPEG-2000);
  • it probably has intra- and interframes but it does not employ motion compensation, just coefficients updating;
  • DWT coefficients are quantised (and common bias is removed) with scale and bias calculated for the whole frame;
  • coefficients are coded using quadtree (i.e. some parts of the bands can be left uncoded in addition to skipping the whole DWT subbands);
  • and finally, data is coded using adaptive models for absolute values and bits for both signs and “region coded” flags and the probabilities from these models are fed to the range coder.

So while this codec is nothing outstanding it’s still a nice change from the mainstream video coding approach defined by ITU H.26x codecs.

Vivo2 revisited

December 22nd, 2020

Since I have nothing better to do (after a quick glance at H.264 decoder—yup, nothing) I decided to look at Vivo 2 again to see if I can improve it from being “decoding and somewhat recognizable” to “mostly okay” stage.

To put a long story short, Vivo 2 turned out to be an unholy mix of H.263 and MPEG-4 ASP. On one hoof you have H.263 codec structure, H.263 codebooks and even the unique feature of H.263 called PB-frames. On the other hoof you have coefficient quantisation like in MPEG-4 ASP and coefficient prediction done on unquantised coefficients (H.263 performs DC/AC prediction on already dequantised coefficients while MPEG-4 ASP re-quantises them for the prediction).

And the main weirdness is IDCT. While the older standards give just ideal transform formula, multiplying by matrix is slow and thus most implementations use some (usually fixed-point integer) approximation that also exploits internal symmetry for faster calculation (and hence one of the main problems with various H.263 and DivX-based codecs: if you don’t use the exactly the same transform implementation as the reference you’ll get artefacts because those small differences will accumulate). Actually ITU H.263 Annex W specifies bit-exact transform but nobody cares by this point. And Vivo Video has a different approach altogether: it generates a set of matrices for each coefficient and thus instead of performing IDCT directly it simply sums one or two matrices for each non-zero coefficient (one matrix is for coefficient value modulo 32, another one is for coefficient value which is multiple of 32). Of course it takes account for it being too coarse by multiplying matrices by 64 before converting to integers (and so the resulting block should be scaled down by 64 as well).

In either case it seems to work good enough so I’ve finally enabled nihav-vivo in the list of default crates and can finally forget about it as did the rest of the world.

NihAV: frame reordering

December 18th, 2020

Since I have nothing better to do I’d like to talk about how NihAV handles output frames.

As you might remember I decided to make decoders output frames on synchronous basis, i.e. if a frame comes to the decoder it should be decoded and output and in case when the codec support B-frames a reordering might happen later in a special frame reorderer. And the reorderer for the concrete decoder was selected based on codec capabilities (if you don’t have frame reordering in format then don’t do it).

Previously I had just two of them, NoReorderer (it should be obvious for which cases it is intended) and IPBReorderer for codecs with I/P/B-frames. The latter simply holds last seen reference frame (I- or P-frame) and outputs B-frames until the next reference frame comes. This worked as expected until I decided to implement H.264 decoder and hit the famous B-pyramid (i.e. when B-frames serve as a reference for another B-frames or even P-frames). To illustrate that imagine an input sequence of frames I0 P4 B2 B1 B3 which should be output as I0 B1 B2 B3 P4. The approach from IPBReorderer would output it as I0 B2 B1 B3 P4 which is not quite correct. So I had to add so-called ComplexReorderer which keeps an array of frames sorted by display timestamp and marks the frames up to a reference I- or P-frame available for output when the next reference frame comes. Here’s a step-by-step example:

  • I0 comes and is stored in the queue;
  • P4 comes and is stored in the queue, I0 is marked as being ready for output;
  • B2 comes and is stored in the queue right before P4;
  • B1 comes and is stored in the queue right before B2 so the queue now is B1 B2 P4;
  • B3 comes and is stored in the queue between B2 and P4;
  • then a next reference frame should come and we should store it and mark B1 B2 B3 P4 ready for output.

Of course one can argue that this waits for more than needed and we should be able to output B1 and B2 even before B3 arrives (or even better we can output B1 immediately as it appears). That is true but it is rather hard to do in the general case. Real-world DTS values depend on container timebase so how do you know there are no additional frames in sequence 0 1000 333 667 (plus the decoder can be told to stop outputting unreferenced frames). Relying on frame IDs generated by the decoder? H.264 has three different modes of generating picture IDs with one of them assigning even numbers to frames (and odd numbers to the second frame field if those are present). While it can be resolved, that will complicate the code for no good reason. So as usual I picked the simplest working solution trading theoretically lower latency for clarity and simplicity.

NihAV: optimisation potential

December 13th, 2020

Today I can say what I’ve wasted about two months on: it was H.264 decoder. For now it’s the only entry in nihav-itu crate but I might add G.7xx decoders there or even the standard H.263 decoder in addition to all those decoders based on it.

Performance-wise it is not very good, about 2.5-3x times slower than libavcodec one without SIMD optimisations on random BaidUTube 720p videos but I’ve not tried to make it the fastest one and prefer clarity over micro-optimisations. But this still has a lot of optimisation potential as the title says. I suspect that even simply making motion interpolation functions work on constant-size blocks would make it significantly faster let alone adding SIMD. In either case it is fast enough to decode 720p in 2x realtime on my laptop so if I ever finish a proper video player I can use it to watch content beside game cutscenes and few exotic files.

As for the features it’s limited but it should be able to play the conventional files just fine plus some limited subset of High profile (just 8-bit 4:2:0 YUV without custom scaling lists). A lot of features that I don’t care about were ignored (proper loop filtering across the slice edges—nope, weighted prediction—maybe later, high-bitdepth or different chroma subsampling format support—quite unlikely, interlaced formats—no in principle).

While developing that decoder I also got better knowledge of H.264 internals for which I’m not that grateful but that’s to be expected from a codec designed by a committee with features being added to it afterwards.

In either case hopefully I’ll not be that bored to do optimisations unless I have to, so the potential will remain the potential and I’ll do some more interesting stuff instead. And there’s always Settlers II as the ultimate time consumer 😉

Hamburger as the symbol of modern IT terminology

November 25th, 2020

As anybody knows, this American dish of non-American origin is named after Hamburger Frikadelle which means (minced meat) patty from Hamburg. And because Americans are known for their deep knowledge of other languages somebody decided that the first syllable is a separate word so the words like cheeseburger and simply burger were born (you can call it the American wasei-eigo if you like). Anyway, the same process of maiming words and giving them new meaning happens in IT as well, irritating those few who still remember the original word and its meaning.
Read the rest of this entry »

An upcoming image format war?

November 19th, 2020

So this week libwebp2 appeared in a public repository. From a quick glance it looks like lossy format is based on AV1 coding blocks and lossless format is largely the same as the original WebP lossless but both now use ANS coding. And (of course) there’s a hint on experimental lossy encoding using neural networks.

Let’s pretend that JPEG has finally died (again) and GIF and PNG are both gone. So what modern image formats intended for general audience are out there?

Of course there’s Nokia HEIF which is picture(s) split into tiles, coded with H.EVC and stored in MP4. Because of the wonderful patent situation around it probably it won’t be used outside iEcosystem.

AVIF—same container, different codec (AV1 in this case).

WebP/WebP2—Baidu image format with lossy compression based on Baidu VPx codec (VP8 or VP10) and lossless compression from the French division of Baidu.

JPEG XL—a joint project between Cloudinary and Swiss division of Baidu responsible for Baidu Chrömli (in case you did not know that’s a Swiss word for various small sweets bits like Guetsli, Brunsli and such; Brötli/Gipfeli/Zöpfli/Grittibänzli are related to bread though, especially Brötli). Anyway, that’s a different format with different set of features that include lossless JPEG recompression (and hopefully the best practical lossless image compression as one would expect from creators of FLIF).

So my point is if you’d have to choose between all those formats essentially you have to pick some format from Baidu (either directly from it or using its codec). Somehow this future does not excite me much so I’d rather stick to old formats for which a single programmer can write a standalone decoder in reasonable time.

Also for some reason this reminds me of Soviet space program where there were three main construction centres (led by Korolyov, Chelomey and Yangel) producing different missiles and spaceships many of those are still in use. But the competition was also hurtful for the general progress. As you remember there were three heavy spaceships proposed by neither of them was really successful: Korolyov’s N1 had failures because of the engines, Yangel’s R-56 was cancelled early in favour of N1, Chelomey’s UR-700 has never been realized either, Glushko’s Energia had two launches (both successful) but it was too late and there was no payload for it beside equally successful Buran program. So on one hand you have variety and on the other hand you have a lot of wasted resources and efforts.

I see parallels here and with AV1 as well. Why the company controlling libaom would develop libgav1 too?

And while speaking about AV1 I should mention that it reminds me of another kind of project, namely Olympic games.

Originally the Olympics were competition between various people from various city-states for both religious and entertainment reasons. Later they were resurrected as a mean to promote sports and unity, but just a couple decades later the games became more of a political instrument promoting national teams instead of being just a competition of individuals from various places (partly because all the training becomes too costly for a non-professional sportsman, partly because countries want the prestige). And a bit later it became a business project that 2004 Summer games in Athens demonstrated the best.

So you have a committee that holds the rights to the symbols, logos, mascots and everything else. The receiving party has to build large infrastructure to host various competitions and hope that the guests will bring enough money to compensate at least some of those costs (and maybe those buildings would be useful later but quite often they are not). Various companies pay a lot of money to become sponsors in hope that such status will work as effective advertisement, broadcasting companies pay a lot of money for broadcasting rights in hope of getting more viewers (and money from ads). So before the games a lot of parties pay a lot of money and afterwards they might make profit or not. And the host country is left with huge expenses for constructing stadiums and such—and those rather useless constructions that are too big for regular events or training. And of course the prestige. Where money go to and which Olympics were profitable to the host country is left as an exercise to the reader.

In a similar way AV1 feels like such project: it drew resources from different companies and people from different opensource multimedia projects to build something huge that is not really useful (I know that in theory it should trade bandwidth for CPU heat but how many customers will be AV1 ready before AV2 is released and the cycle repeats?) and people involved in libaom, svt-av1, dav1d and rav1e would better be doing something else including better multimedia frameworks (I work on NihAV mostly because the alternatives are even worse) or new codecs or even on a decent video editor so people making videos for BaidUTube won’t have to rely on expensive proprietary solutions that tend to crash anyway or suspicious Chinese or Russian programs that rip off opensource libraries (I’ve seen one using mencoder compiled as a .dll).

Anyway, like the Olympics were intended to promote sport and healthy living but became business projects that are financial loss to the most parties, AV1 looks like a project that also while being positioned as the saviour of opensource multimedia essentially benefits just a small group of organisations. And as with many other things I say I’d be happy to be proven wrong.

P.S. In case you say that I’m inconsistent and dislike both competing groups inside one company and uniting efforts (for the sake of the same company). Well, I’d prefer different entities (companies or opensource projects or whatever) to produce single solution each while there’s more than just one entity doing it. To return to space analogies, I’d rather see many private companies developing an own line of spaceships each (for various purposes too) instead of ULA producing several kinds of radically different spaceships without any outside competition.

H.264 specification sucks

November 14th, 2020

So it has come to a stage where I have nothing better to do so I tried to write H.264 decoder for NihAV (so I can test the future nihav-player with the content beside just sample files and cutscenes from various games). And while I’ve managed to decode at least something (more about that in the end) the specification for H.264 sucks. Don’t get me wrong, the format by itself is not that bad in design but the way it’s documented is far from being good (though it’s still serviceable—it’s not an audio codec after all).

And in the beginning to those who want to cry “but it’s GNU/Linux, err, MPEG/AVC”. ITU H.264 was standardised in May 2003 while MPEG-4 Part 10 came in December 2003. Second, I can download ITU specification freely and various editions too while MPEG standard still costs money I’m not going to pay.

I guess the main problems of H.264 come from two things: dual coding nature (i.e. slice data can be coded using variable-length codes or binary arithmetic coder) and extensions (not as bad as H.263 but approaching it; and here’s a simple fact to demonstrate it—2003 edition had 282 pages, 2019 edition has 836 pages). Plus the fact that is codified the wrong name for Elias gamma’ codes I ranted on before.

Let’s start with the extensions part since most of them can be ignored and I don’t have much to say about them except for one thing—profiles. By itself the idea is good: you have certain set of constraints and features associated with the ID so you know in advance if you should be able to handle the stream or not. And the initial 2003 edition had three profiles (baseline/main/extended) with IDs associated with them (66, 77 and 88 correspondingly). By 2019 there have been a dozen of various profiles and even more profile IDs and they’re not actually mapped one to one (e.g. constrained baseline profile is baseline profile with an additional constraint_set1_flag set to one). In result you have lots of random profile IDs (can you guess what profile_idc 44 means? and 86? or 128?) and they did not bother to make a table listing all known profile IDs so you need to search all specification is order to find out what they mean. I’d not care much but they affect bitstream parsing, especially sequence parameter set where they decided to insert some additional fields in the middle for certain high profiles.

Now the more exciting part: coding. While I understand the rationale (you have simpler and faster or slower but more effective (de)coding mode while using the same ways to transform data) it created some problems for describing it. Because of that decision you have to look at three different places in order to understand what and how to decode: syntax tables in 7.3 which present in which order and under which conditions elements are coded, semantics in 7.4 telling you what that element actually means and what limitations or values it has, and 9.2 or 9.3 for explanations on how certain element should be actually decoded from the bitstream. And confusingly enough coded block pattern is put into 9.1.2 while it would be more logical to join it with 9.2, as 9.1 is for parsing generic codes used not just in slice data but various headers as well and 9.2 deals with parsing custom codes for non-CABAC slice data.

And it gets even worse for CABAC parsing. For those who don’t know what it is, that abbreviation means context-adaptive binary arithmetic coding. In other words it represents various values as sequences of bits and codes each bit using its own context. And if you ask yourself how the values are represented and which contexts are used for each bit then you point right at the problem. In the standard you have it all spread in three or four places: one table to tell you which range of contexts to use for a certain element, some description or separate table for the possible bit strings, another table or two to tell you which contexts should be used for each bit in various cases (e.g. for ctxIdxOffset=36 you have these context offsets for following bits: 0, 1, (2 or 3), 3, 3, 3), and finally an entry that tells you how to select a context for the first bit if it depends on already decoded data (usually by checking if top and left (macro)blocks have the same thing coded or not). Of course it’s especially fun when different bit contexts are reused for different bit positions or the same bit positions can have different contexts depending on previously decoded bit string (this happens mostly for macroblock types in P/SP/B-slices but it’s still confusing). My guess is that they tried to optimise the total number of contexts and thus merged the least used ones. In result you about 20 pages of context data initialisation in the 2019 edition (in initial edition of both H.264 and H.EVC it’s just eight pages)—compare that to almost hundred pages of default CDFs in AV1 specification. And CABAC part in H.265 is somehow much easier to comprehend (probably because they made the format less dependent on special bit strings and put some of the simpler conditions straight into binarisation table).

To me it seems that people describing CABAC coding (not the coder itself but rather how it’s used to code data) did not understand it well themselves (or at least could not convey the meaning clearly). And despite the principle of documenting format from decoder point of view (i.e. what bits should it read and how to act on them in order to decode bitstream) a lot of CABAC coding is documented from encoder point of view (i.e. what bits you should write for syntax element instead of what reading certain bits would produce). An egregious example of that is so-called UEGk binarisation. In addition to the things mentioned above it also has rather meaningless parameter name uCoff (which normally would be called something like escape value). How would I describe decoding it: read truncated unary sequence up to escape_len, if the read value is equal to escape_len then read an additional escape value as exp-Golomb code shifted by k and trailing k-bit value, otherwise escape value is set to zero. Add escape value to the initial one and if the value is non-zero and should be signed, read the sign. Section 9.2.3.2 spends a whole page on it with a third of it being C code for writing the value.

I hope I made it clear why H.264 specification sucks in my opinion. Again, the format itself is logical but comprehending certain parts of the specification describing it takes significantly more time than it should and I wanted to point out why. It was still possible to write a decoder using mostly the specification and referring to other decoders source code only when it was completely unclear or worked against expectations (and JM is still not the best codebase to look at either, HM got much better in that aspect).

P.S. For those zero people who care about NihAV decoder, I’ve managed to decode two random videos downloaded from BaidUTube (funny how one of them turned out to be simple CAVLC-coded video with no B-frames) without B-frames and without apparent artefacts in first hundred frames. There’s still a lot of work to make it decode data correctly (currently it lacks even loop filter and probably still has bugs) plus beside dreaded B-frames with their co-located MVs there are still some features like 8×8 DCTs or high-bitdepth support I’d like to have (but definitely no interlaced support or scalable/multiview shit). It should be good enough to play content I care about and that’s all, I do not want to waste extremely much time making it a perfect software that supports all possible H.264/AVC features and being the fastest one too.

MOV — Matroska of its time

October 24th, 2020

Disclaimer: all container formats suck, either by being too simple and tied to certain (types of) codecs, too ineffective (by wasting too many bytes of frame metadata and headers compared to other formats), or too flexible and complicated to implement in full. And there’s Ogg.

Let’s start our story with old times. Back in the day Electronic Arts made probably the only two good things it’s ever made. I’m talking of course about Deluxe Paint and IFF container format (and that happened 35 years ago; it’s a pity the company still exists). The chunked approach to storage was reused by certain other companies. Remember RIFF that was used for storing audio (RMI or WAV), video (AVI) or pictures (WebP) among other things? Remember RealMedia? Remember QuickTime MOV? That’s the thing we’re going to talk about.

As you can guess all these containers are just a series of chunks, some chunks being a container for another list of chunks. MOV and MP4 are the exception because they have atoms and boxes correspondingly. Anyway, this structure allows you to put virtually anything and while some formats used that in moderation (WAV essentially in top-level RIFF chunk with header and data chunks, there may be user metadata present and that’s about it) some others abused that possibility as much as possible (no points for guessing).

Let’s quickly review MOV structure: essentially you have moov atom with the description of the container data (which includes atoms for each individual track description of course) and mdat chunk with actual data if you’re lucky. Sounds simple? The devil hides in the deeply-nested atoms.

First, MOV has a lot of features for specific groups of things like:

  • streaming—tracks can be joined into groups to signal that only one of them should be played depending on language/quality/decoding capabilities (kinda like what RMVB was famous for);
  • mastering—all those atoms for matte, kropping! clipping and notorious edit lists;
  • DVD-like playback—the less said about track referencing feature the better.

But what I said that you have mdat chunk with actual data “if you’re lucky”? Because of the wonderful feature called data reference which tells you where the actual data is stored: it may be the same file, the same file but different resource fork (for classic MacOS), different file; you can even get some URL to the resource with the data. And of course different tracks may be stored in different data locations. Flexibility!

Then you get to actual frames (or “samples” as they’re called). For certain reasons frames of the same track can be clustered together in a block (or “chunk” as the specification calls it). And to make things better frames can have different duration stored in a different table in run-length form i.e. “first N frames have duration X, then next M frames have duration Y, then next K frames…”. So in order to extract proper frame with a correct timestamp you just need to find out in which block it resides and with which offset, sum durations for the previous frames, apply information from track metadata (don’t forget the edit list!) and you’re done. Except when you have audio in PCM format or compressed with some standard fixed compression scheme (A-law/μ-law, IMA ADPCM, MACE 3:1/6:1). Then you need to calculate duration from size. And that’s one of the things I really hate in containers: they should be codec-agnostic, in other words you should be able to seek to a certain time point (with a given precision of course), extract frame data and feed it to a decoder that does not know which container it came from either. And of course Matroska is the worst violator as it employs codec-specific ways to shave off some of the frame payload (so if you don’t know how to reconstruct the frame data the decoder won’t recognize it either).

Anyway, having such nice format with so many features not needed by anybody 99% of the time it was a perfect fit for MPEG-4 container format. So MOV was adopted, some of its terminology was changed (as mentioned before atoms->boxes) and lots of new features were added as well both in metadata and file structure (like those segments for streaming).

And if by this point the title is still not clear to you, QuickTime MOV format appeared long before Matroska but conceptually it had many of the features present in the latter as well. To put it crooked, if you replace atoms with EBML entities and add ways to reduce payload (e.g. by generating missing frame headers) you can rename MOV to MKV and nobody will notice a difference.

P.S. It’s just I’m writing a second iteration of NihAV-based video player which should not just play single file continuously and maybe deadlock in process of doing that because of some SDL audio issue but a proper player that can play multiple files, pause and seek. And while testing seeking in MOV files I’ve discovered some of those wonderful issues that inspired me to write this post.

Hacking to solve adventure game puzzles…

October 20th, 2020

I love adventure games (or simply quests as they’re known where I came from) but sometimes the best way to pass some moment there is to cheat.

One of such instances is The Legend of Kyrandia – Book One which is a nice game I play again sometime but there’s the infamous maze there that it not fun. And in the old times I could work around it by hex-editing a savegame to give me the ever-glowing fireberry, stones and some other quest items so I did not have to explore the full maze.

And I thought this was a thing of the past until I tried to play Galador also known as The Prince and the Coward as it’s been supported by ScummVM (or should it be called CabalVM now?) since couple of years ago and I haven’t played it yet. Mostly it’s fine but there are three moments which I could not pass at all: in two instances you must quickly pick up an object and in the third one you need to throw a stone at a certain place three times while the cursor jitters (and repeat that three times).

My reflexes never were that great to begin with but playing the game with a touchpad instead of mouse made it impossible: when a menu with actions appears after the game reacts on your click it’s already too late to select an action and click it. So one solution would be to connect mouse and try until you pass. But I got lazy during those years (back in the day I could do all those timed sequences in Space Quest II and IV while SQ3 required an utility to slow down computer for the escape from pirates sequence). So I did something different: hacked the source code to show what action happens when I actually select that “pick up” object action and added a handler so when I press 'p' key it’d do the action without bothering with menus (or crash if you don’t move cursor to the proper object). And similarly for the stone-throwing scene I removed jitter and mapped 'p' to pick up a new stone.

It’s not something I’m proud of but it should be a good demonstration how you can work around certain game limitations if you have an access to its engine source code—even if it’s not something trivial like maximum ammo/thousands of resources/infinite health.

NihAV: audio player done

October 7th, 2020

As I wrote in my previous post, I had functioning audio player nearing completion. And now I’ve finally added all features I wanted to add and can call it done.

While previously I mostly ranted on the bloat introduced by the components authors, here I’d like to describe the design and the reasoning behind it.
Read the rest of this entry »