NihAV: feature complete

June 13th, 2021

A year ago I wrote a post about NihAV being conceptually done, now it’s even closer to perfection since I’ve finished writing a useful video player.

Writing it (and especially debugging it) was surprisingly hard so in most cases after looking at it I thought that I’d rather do something else—RE some game format, work on deflate support, just anything but that. But at last it’s done and working adequately. The previous player could just play video until it’s over (or deadlock occasionally), this one supports pausing and seeking plus it seems not to deadlock.

Conceptually it’s very simple as any other player concept, it’s just various features and limitations complicate the design. You simply open input, read packets, discard them or decode with a corresponding decoder and present the result (draw the frame or send audio to the sound card). Now what to do in order to make it offer all the features I wanted from it?

First, interactivity and low(er) response latency. The latter is easy to achieve, you just put audio and video decoding into separate threads so the main loop just does the demuxing and checking for user commands. So now you have user commands that might affect decoding threads (for example, seeking or going to play the next file). I implemented it by setting a special flag that decoding thread checks and discards all queued input packets until some command is received.

Another fun thing was volume control. In my audio player I simply queue samples and let SDL deal with them, here I switched to callback and fill the requested buffer with samples adjusted by current volume. This way you have volume change applied almost immediately instead of it taking effect in a second or more depending in audio queue fill.

And speaking about queues, that was also a fun thing to manage. In a default implementation (demux-send-repeat) you either end up sending all packets before the decoding thread processes them all (and this will consume a lot of memory) or you block on sending via a limited message channel (which either requires a separate thread with more complicated interaction between them all or you can forget about interactivity). The answer is obviously to keep check of how full the channel is and do not demux new packets unless you’re sure you can send them. I keep a queue for the packets or events that should be sent but there’s no space in communication channel for that yet. Additionally I make audio part report the current buffer fill so I know there’s no reason to send more packets yet. Similarly for video I keep a queue of ready to display frames (in SDL textures, so drawing them is just one SDL blit call).

Overall, nothing particularly tricky but debugging it was still not fun. I actually ended up adding a compile-time debugging feature that will dump a lot of internal playback information into debug.log so I can figure out what actually happened there (i.e. NIHing a logger but you should not be surprised by that).

Of course it still lacks a lot of features for a serious player like proper synchronisation (with automatic framedropping and audio underrun corrections), more features (like playback at different speed, taking screenshots and switching between different input streams) and GUI. But I’m fine with the current state of my player and maybe enhance it later if the need arises.

Fun thing is that my player seemed to stall on MP4s. As it turned out, the problem was in MP4 demuxer producing packets in interleaved order (one for audio stream, one for video stream, one for audio stream again, one for video stream…) while it should’ve output packets for various streams for more or less the same time position (which means usually two AAC frames between video frames instead of just one). After the change my player works as expected on MP4s as well. And if I ever get to fixing and optimising H.264 decoder it should be good enough to serve as an everyday video player (I use nihav-sndplay to listen to my music collection already).

I guess there’s just one of the original ideas (of what I wanted to try at the time of public NihAV release) left that I haven’t tried yet, namely to experiment on writing some DCT-based video encoder with rate control and such (maybe even for VP6). This should keep me occupied for a couple of years. Or at least inspire me to do something else instead.

On the Origin of Bloatware

June 11th, 2021

This is inspired by both a private discussion on why modern computing is so complex and my migration from Ubuntu 12.04LTS to systemd 20.04LTS.

Since I’ve finally changed from my less than ten years old operating system to something more modern I’ve noticed that it became noticeably slower (not irritatingly slower though but slower nevertheless) except for Firefox (which is probably not because of JS engine improvements but rather because of native execution of now supported APIs instead of polyfills). And trying various desktop environments before settling on Cinnamon I’m horrified by how bloated and unusable (to me) they are. My friends complain about modern technology demanding more effort to maintain because of complexity and weird interdependencies—while it’s supposed to make your life easier. So why it is like that?

For a keen reader the title of this post contains the answer. For the rest I’ll elaborate it below.
Read the rest of this entry »

Revisiting legendary Q format

May 30th, 2021

Since I had nothing better to do I decided to look again at Q format and try to write a decoder for NihAV while at it.

It turns out there are three versions of the format are known: the one in Death Gate (version 3), the one in Shannara (version 4) and the one in Mission Control (not an adventure game this time; it’s version 5 obviously). Versions 4 and 5 differ only in minor details, the compression is the same. Version 3 uses the same principles but some coding details are different.

The main source of confusion was the fact that you have two context-dependent opcodes, namely 0xF9 and 0xFB. The first one either repeats the previous block several times or reuses motion vector from that block. The second one is even trickier. For version 5 frames and for version 4 with mode 7 signalled it signals a series of blocks with 3-16 colours in each. But for mode 6 it signals a series of blocks with the same type as the previous one but with some parameters changed. If it was preceded by a fill block, these blocks with have fill value. After a block with patterns you have a series of blocks with patterns reusing the same colours as the original block. For motion-compensated block you have the same kind of motion information transmitted.

But the weirdest thing IMO is the interlaced coding in version 5. For some reason (scalability? lower latency?) they decided to code frame in two part, so frame type 9 codes even rows of the frame and frame type 11 codes odd rows—and in this cases rows are four pixels high as it is one block height. That is definitely not something I was expecting.

All in all, the format turned out to be even weirder than I expected it to be.

Why I still like C and strongly dislike C++

May 26th, 2021

This comes up in my conversations surprisingly often so I thought it’s worth to write my thoughts down instead of repeating them again and again.

As it is common with C programmers, C was not my first nor my last language, but I still like it and when I have to write programs I do it in C. Meanwhile I try to be aware of modern (and not so modern) programming languages and their trends and write my own multimedia-related hobby project in Rust. So why I have not moved to anything else yet and how C++ comes to all this?
Read the rest of this entry »

ZMBV support in NihAV and deflate format fun

May 22nd, 2021

As I said in the previous post, I wanted to add ZMBV support to NihAV, mostly because it is rather simple codec (which means I can write a decoder and an encoder for it without spending too much time), it’s lossless and supports various bit-depths too (which means I can encode various content into it preserving the original format).

I still had to improve my deflate support (both decompressing and compressing) a bit to support the way the data is stored there. At least now I mostly understand what various flags are for.

First of all, by itself deflate format specifies just a bitstream split into blocks of data that may contain any amount of coded data. And these blocks start at the next bit after the previous block has ended, no byte aligning except by chance or after a copy block (which aligns bitstream before storing length and block contents).

Then, there is raw format used in various formats (like Zip or gzip) and there’s zlib format used for most cases data is stored as part of some other format (that means you have two initial bytes like 0x78 0x5E and 2×2 bytes of checksum in the end).

So, ZMBV uses unterminated stream format: first frame contains zlib header plus one or several blocks of data padded with an empty copy block to the byte limit, next frame contains continuation of that stream (also one or more blocks padded to the byte boundary) and so on. This is obviously done so you can decode frames one after another and still exploit the redundancy from the previously coded frame data if you’re lucky.

Normally you would start decoding data and keep decoding it until the final block (there’s a flag in block header for that) has been decoded—or error out earlier for insufficient data. In this case though we need to decode data block, check if we are at the end of input data and then return the decoded data. Similarly during data compression we need to encode all current data and pad output stream to the byte boundary if needed.

This is not hard or particularly tricky but it demonstrates that deflated data can be stored in different ways. At least now I really understand what that Z_SYNC_FLUSH flag is for.

Adding deflate support to NihAV

May 18th, 2021

Since I wanted to do something different I decided to finally implement deflate support for NihAV—by which I mean compression support in addition to decompression. Here is how well it went.

As usual, my goal was to implement it in mostly straightforward way but with reasonable speed instead of having something completely on par with zlib or better.

At first I implemented the simplest form of compression – copying data without compression (but with the proper headers and ADLER-32 checksum at the end). Then I added a simple encoding with fixed codes that simply output symbols as it—no compression yet but at least it tests how well bitstream is written. Then I moved to dynamic codes. Then I added a brute force search and started encoding matches. So by the end of the weekend I had something working already and then I could make it faster and/or better.

Of course the first thing to remember is that you can reduce search time by using some structure for a faster text search. I think suffix trie is now popular but I settled for an old-fashioned hash by three bytes. Initially it was twice as slow since while the number of string comparisons decreased hundredfold, updating hash table on each step consumed too much time. So I switched to linked-list hash that resembles FAT somewhat (i.e. for each position in the input you have a pointer to the next location of the same three-letter hash plus an additional table pointing to the start of chain for each hash key). And I calculated it once per a large block just discarding matches outside of the desired range. Of course this can be done better but it worked fast enough for me.

Now the compression. There are three main strategies I tried: naïve aka greedy one (you simply output the longest match you can find at the current step), lazy (you also check the next position if it produces even better match and use it if possible—surprisingly enough it gives a significant benefit) and theoretically optimal (you construct a trellis and see which combination and literals can give you the best coding; it has issues but theoretically it’s the best one).

So why it’s “theoretically optimal” and not just optimal? Because it needs to calculate the accurate bit cost and you can’t know it until you produce all the symbols to be encoded and calculate the actual lengths for them. Of course you can do it in an iterative process or employ a heuristic to predict bit length somehow but I simply used “9 bits for the symbol and 5 bits plus escape bits for distance additionally if it’s present”. I think for some cases it even produced larger files than lazy decoding.

Here is a list from the top of my head of things than can be improved (but I guess anybody who has written a LZ77-based compressor knows it better than me):

  • method selection—sometimes copying data verbatim is better (in the case of noise) or using fixed codes (because the overhead from transmitting dynamic codes eats all the advantage);
  • partitioning—currently I use 64kB blocks but depending on content (usually detected by the symbol frequency variations) it’s better to cut block earlier or make it larger. I played a bit with the block size but changing it (in either direction) currently leads to compression ratio drops;
  • faster search for the matching strings;
  • heuristics for either faster optimal parsing or better-compressing other method.

Of course some of it can be sped up by simply using unsafe Rust so no checks on array access are performed but I don’t think it’s worth it now.

And finally here are some benchmarks for the curious ones performed on a source file of the program:

  • copy: 32156 bytes (from 32145 bytes)
  • fixed codes and greedy search: 7847 bytes, 80ms
  • dynamic codes and greedy search: 6818 bytes, 80ms
  • dynamic codes and lazy search: 6665 bytes, 100ms
  • dynamic codes and “optimal” search: 6529 bytes, 690ms
  • gzip -9 for the reference: 6466 bytes, <10ms

As you can see it’s not fast but it works. I also checked that the resulting compressed data is decoded fine (plus some tests on large files that will be split into several blocks). Now all that’s left is to implement ZMBV decoder and encoder.

Missing optimisation opportunity in Rust

May 12th, 2021

While I’m struggling to write a video player that would satisfy my demands I decided to see if it’s possible to make my H.264 decoder a bit faster. It turned out it can be done with ease and that also raises the question concerning the title of this post.

What I did cannot be truly called optimisations but rather “optimisations” yet they gave a noticeable speed-up. The main optimisation candidates were motion compensation functions. First I shaved a tiny fraction of second by not zeroing temporary arrays as their contents will be overwritten before the first read.

And then I replaced the idiomatic Rust code for working with block like

    for (dline, (sline0, sline1)) in dst.chunks_mut(dstride).zip(tmp.chunks(TMP_BUF_STRIDE).zip(tmp2.chunks(TMP_BUF_STRIDE))).take(h) {
        for (pix, (&a, &b)) in dline.iter_mut().zip(sline0.iter().zip(sline1.iter())).take(w) {
            *pix = ((u16::from(a) + u16::from(b) + 1) >> 1) as u8;

with raw pointers:

    unsafe {
        let mut src1 = tmp.as_ptr();
        let mut src2 = tmp2.as_ptr();
        let mut dst = dst.as_mut_ptr();
        for _ in 0..h {
            for x in 0..w {
                let a = *src1.add(x);
                let b = *src2.add(x);
                *dst.add(x) = ((u16::from(a) + u16::from(b) + 1) >> 1) as u8;
            dst = dst.add(dstride);
            src1 = src1.add(TMP_BUF_STRIDE);
            src2 = src2.add(TMP_BUF_STRIDE);

What do you know, the total decoding time for the test clip I used shrank from 6.6 seconds to 4.9 seconds. That’s just three quarters of the original time!

And here is the problem. In theory if Rust compiler knew that the input satisfies certain parameters i.e. that there’s always enough data to perform full block operation in this case, it would be able to optimise code as good as the one I wrote using pointers or even better. But unfortunately there is no way to tell the compiler that input slices are large enough to perform the operation required amount of times. Even if I added mathematically correct check in the beginning it would not eliminate most of the checks.

Let’s see what happens with the iterator loop step by step:

  1. first all sources are checked to be non-empty;
  2. then in outer loop remaining length of each source is checked to see if the loop should end;
  3. then it is checked if the outer loop has run not more than requested number of times (i.e. just for the block height);
  4. then it checks line lengths (in theory those may be shorter than block width) and requested width to find out the actual length of the inner loop;
  5. and finally inside the loop it performs the averaging.

And here’s what happens with the pointer loop:

  1. outer loop is run the requested amount of times;
  2. inner loop is run the requested amount of times;
  3. operation inside the inner loop is performed.

Of course those checks are required to make sure you work only with the accessible data but it would be nice if I could either mark loops as “I promise it will run exactly this number of times” (maybe via .take_exact() as Luca suggested but I still don’t think it will work perfectly for 2D case) or at least put code using slices instead of iterators into unsafe {} block and tell compiler that I do not want boundary checks performed inside.

Update: in this particular case the input buffer size should be stride * (height - 1) + width i.e. it is always enough to perform operation in the way described above but if you use .chunks_exact() the last line might be not handled which is wrong.

The former is rather hard to implement for the common case so I don’t think it will happen anywhere outside Fortran compilers, the latter would cause conflicts with different Deref trait implementation for slices so it’s not likely to happen either. So doing it with pointers may be clunky but it’s the only way.

The Magic of Animation

May 1st, 2021

Since I had nothing better to do, I decided to re-play some old adventure games and one of them was King’s Quest VII (I don’t know why it gets Roberta Williams name attached to it, she’s behind all previous KQ games as well, it’s Mask of Eternity and the 2015-2016 re-imagining that deserve Not Roberta Williams game title). And in my usual habit I also looked at intro/ending animations. As you all remember, there are DOS, Mac and Windows releases of the game and each of them uses its own format. Windows version uses 10-fps MS Video 1 in AVI, Mac version uses 8-fps Cinepak in MOV (with data in a separate resource fork as expected from Mac video), DOS version turned out to use 5-fps RBT. Thanks to Mike Melanson documenting it in the course of his experiment, I was able to write a quick and dirty program to unpack .rbt files (essentially it’s just raw frames compressed with LZStac so if you don’t care about handling errors or less common cases then 3.5kB program in C is enough).

And while doing that I remembered that animating of this game was partly done in-house but most of the work was outsourced to the various animation studios including the infamous Animation Magic. In case you forgot that is a studio with Russian origins that was mostly known for their unforgettable animation of CD-i games. Yes, those CD-i games. To their defence it mostly came from them being inexperienced with the computer animation and slowly the animations in their games became better at the expense of them becoming less memorable (animated fairy tales books are interesting only when Dingo Pictures does them!). But between MAH BOI games that are refusing to be forgotten and rather obscure Magic Tales series there were two edutainment first-person shooters, namely I.M. Meen and Chill Manor, that are still somewhat remembered by their wildly imaginative cutscenes. Of course somebody had to look at the format.

It turned out to be a custom format with intra-only RLE-packed video with the only interesting things about it being the use of up to 128 colours only, the fact it should be drawn over some external background (even the intro or ending), and that it uses run length 0 as “fill until the end of current line” mode. Audio is raw PCM, so nothing remarkable there.

For the comparison here’s the captured image from the intro playback (stolen from Mike’s review of the game):

and a decoded frame from intro.ani (not the same one but close enough):

You can find the missing background among the game files in PCX format though.

This is a bit crazy format but it was fun looking at it.

Le spam

May 1st, 2021

Sometimes I look inside Baidu Mail spam folder to see if there’s anything useful got there by mistake (notifications from various shops with purchase confirmations end there quite often, to give one example). And there’s a weird tendency I’ve spotted recently.

In the last five days I’ve received 47 spam mails. 37 of them were in French. I’m used to receiving spam in various European languages (including but not limited to Bulgarian, German, Italian, Russian and Spanish) but before last year it was mostly in English. Additionally a good deal of them now is about some promotional actions from supermarket chains like Aldi, Carrefour or Lidl (and I’ve never considered either of them to be some luxury store).

What’s wrong with this world?

Looking at Q Format

April 24th, 2021

For the lack of anything better to do I took a second look at Shannara game from Legend Entertainment (yes, I was that bored). And while it failed to captivate me once again, at least I have discovered yet another video format.

Actually I like old adventure games of theirs, especially the fact that they use RealSound technology (even if it’s just a way to play PCM on PC Speaker). But Shannara is a hybrid game with all the map travelling and fighting monsters. And I could not get into Terry Brooks’s books either, the first Shannara book reminded me of Lord of the Rings but in post-apocalyptic setting with magic appearing for some reason so I dropped it halfway.

In either case, the game featured full-motion animations and of course I had to look at them. As one would expect, all of them could be found in FLICS subdirectory and some of them even were in FLIC format. The rest were sporting rather rare .q extension and I doubted those were Quantum archives. After looking closer it turned out to be quite interesting format.

Video is compressed by splitting frame into 4×4 blocks, usually coding those blocks either as a block filled with two colours using a pattern or by copying some previous block (it does not try to motion search up to a pixel precision but it can reuse any block from a frame). There is an additional coding mode for coding either raw 4×4 block or block filled with 3-8 colours in a pattern. And additionally 128 of the most commonly used patterns for a group of frames are transmitted in a separate chunk before those frames, in result you can use just one byte to code that index instead of two bytes for a full pattern.

Even if I haven’t managed to figure out all details from it and there may be other flavours of it in other games, it was a surprisingly original format and it was fun looking at it.