Archive for the ‘NihAV’ Category

NihAV: now with GIF support

Monday, September 11th, 2023

One would wonder why. The answer is simple: I merely wanted to play with LZW encoding and created a GIF encoder. Then I needed to debug it and thus ended up with a decoder as well. And while I was at it I added animated GIF support for both decoding and encoding.

I’m not going to repeat on how it works, it’s been written in all possible textbooks and tutorials on data compression (and then some), I’m more interested in implementing an efficient encoder for it. I should remind you that LZW compression was patented (at least by Sperry and IBM) and that patent introduced a lot of chaos in multimedia before MP3 or H.264 were a thing.

The algorithm simply tells you to find a match for the input string in the dictionary and add a new string to the dictionary. You can keep the actual strings in the dictionary (simple but not very effective), you can keep the reference to it in the input (simple and saves a lot on dictionary space but not very effective and you can’t feed it input by chunks), you can keep the same dictionary structure as the decoder and try to match input to the code sequence (the most compact way to store the dictionary but not so straightforward to use; more on a practical extension of this approach at the end of this post). Since I do not care about wasting whopping two megabytes of RAM on the dictionary, I implemented a simple trie: each node contains a code ending at that node and 256 pointers to the next next trie nodes (each code/pointer fits into 16-bit integer with zero as “invalid value” pointer). Of course it can be made more compact but I prefer simplicity and clarity.

Now, the question is whether the compression efficiency can be improved somehow. Since the decoder expects encoder to work in a certain manner we can’t change format. The tricks that work good for LZ77 like lazy matching do not work that great for LZ78: you have dictionary state changed at each step and you don’t know which entry will be more useful in the future. Also emitting shorter matches means we also litter the dictionary with duplicates, which does not help compression in the long run either.

Thus the only thing we can use is creative dictionary clearing. E.g. we can emit reset code after every N symbols encoded to keep the code length constant (that’s the trick various libraries used to produce the compliant stream without doing any compression and thus not infringing the patent), or you can keep the dictionary filled with what it was until the very end (and if data properties change your compression ratio hurts), or (as most encoders do) you simply reset the dictionary after it gets full. But can you do better?

I tried two heuristics. First, I tried segmenting data by having a sliding window and comparing the character occurrence in each half, if those differed by a lot then it was a good place to restart. From what I heard, this simple approach allows better compression for LZ77- and BWT-based compression schemes. In this case it helped only when compressing the data that could not be compressed at all (e.g. greyscale continuous-tone images), in other cases the results were worse than with the default strategy. The second approach was simple: keep the rolling average of last, say, sixteen match lengths and if it drops significantly when the dictionary is full then it’s time to clear it. Surprisingly enough it seemed to improve compression for several fractions of a percent in general cases (and sometimes worsened the ratio on hard-to-compress files by about the same amount) so I kept it.


And while talking about GIF and LZW, here are some words about FFmpeg support of GIF: it’s rather laughable. While encoding to GIF has been supported from the early days (from 2002, to be less vague), actual LZW compression support was introduced only in 2009 (using a work of some Summer of Code student from 2007; the notorious patent on LZW compression had expired by that time). A bit later in the same year the encoder was improved to work with paletted input instead of RGB24 (which was maimed to use fixed palette anyway). In the same time the original animated GIF encoder+muxer (not to be confused with now improved single-image GIF encoder) kept existing with all the original deficiencies (RGB24 input, fixed palette, no compression) so by encoding to .gif without specifying image2 muxer you’d get significantly worse results. I heard it’d been improved at last back in 2018 but I have not been caring before so why start now?

The interesting thing there is that LZW compression employed for both GIF and TIFF seems to be based on compress approach with a hash table. Essentially the dictionary is stored in the same form as in the decoder (a reference to the previous code plus suffix character) but indices are hash codes instead of sequential entries and the encoder in a decoder mirror mode: it traverses the table to find the code that corresponds to the current code plus a new character (and if it’s not found, the prefix code is output and new code pair is added to the dictionary). Also it seems to compress data slightly worse than the straightforward implementation in some cases (e.g. in my tests one 320×240 PGM file got expanded to 83159 bytes instead of 83132 bytes and another 112×288 PPM file got compressed to 10371 bytes instead of 10366 bytes; the difference is negligible but it’s there). I suspect this is caused by hash collisions i.e. in some rare cases a shorter code and a longer one have the same hash and the former gets selected even when the latter should be better.


In either case, this might have been not a very useful exercise but it was fun and somewhat educational too. Now only to find a thing to do next…

NihAV: adding SGA support

Saturday, September 2nd, 2023

Since I had nothing better to do this week I decided to finally add Digital Pictures SGA decoding support to NihAV. While there are many different formats described in The Wiki, I’ve decided to play only those not described there (namely $81/$8A, $85, $86 and $89).

In my previous post on this matter I mentioned that the formats I took interest in are using 8×8 tiles that may be subdivided into 8×4 or 4×4 parts and filled with several colours using a predefined pattern (or an arbitrary one for 8×8 tile if requested) plus some bits to select one of two possible colours for each tile pixel. The main difference between $81/$8A scheme and the others is that it codes all data in the same bitstream while the later versions split colours and opcode+pattern bits into two separate partitions (maybe they had plans for compressing it?) plus they store audio data inside the frame.

And here are some notes on the games (I think most of those are PC or Macintosh ports but it’s possible the same files were used in console versions of some of those games as well):

  • Double Switch—this one uses $81 compression (in still images, cutscenes embed them along with $A2 audio in $F1 chunks);
  • Quarterback Attack—this one uses $8A compression in $F9 chunks;
  • Night Trap$85 compression and megafiles (i.e. almost all cutscenes are stored in single NTMOVIE file that require some external index to access them). Also the PC release had a short documentary about the moral panic around that game (in the same format of course; in two resolutions even);
  • Corpse Killer$86 compression and one megafile for all cutscenes;
  • Supreme Warrior$89 compression, one megafile and no frame dimensions given. For most of the cutscenes it’s 256×160 but at the end (logo and maybe something else) it’s different. Additionally there are two audio tracks: some audio chunks contain twice as much data (and have high bit of size set), in that case the first half corresponds to English speech and the second half is Chinese; otherwise it’s the same for both versions (e.g. background music, fighters grunting, sound effects and so on).

Overall, it was an interesting experience even if I don’t care about the games themselves.

H.264 decoder postmortem

Sunday, August 27th, 2023

I mentioned before couple of times that NihAV has its own functioning H.264 decoder. And after my failed attempts to use hardware accelerated decoding instead, I spend some time trying to optimise it but eventually gave up. On one hand it’s fast enough for my needs, on the other hand it’s too tedious to optimise it further (even if I can spare time on it, I’d rather not).

To put it into perspective, initially it was about three times slower than libavcodec one without SIMD optimisations, now it’s only about two times slower (with SIMD turned on it’s about five times as slow, feel free to laugh at me). But at the same time playing 720p content (and I have next to no files with larger resolution) in multi-threading mode takes 20-25% of the core so it’s not that bad.

So how the cycles are wasted and is there a potential for serious optimisation?
(more…)

NihAV: giving up on hardware acceleration

Thursday, August 3rd, 2023

After having several attempts on trying to add hardware-accelerated decoding support for NihAV I’m giving up, the reason being the sorry state of it in general.

I’m aware of two major APIs for hardware-accelerated video decoding for Linux, those are VDPAU and VA-API. Plus there are some specific toolkits e.g. from Intel but from what I remember those are even more complicated.

So, VDPAU has only bare-bone documentation without actual explanation what is expected for each codec in order to decode it. VA-API turned out to be even worse: it points out to 01.org for documentation which no longer exists (and redirects to some Intel’s page blurbing how great they are at open source). And web.archive.org shows that that page essentially contained a link to libva and libva-utils repositories plus some references to the projects that have VA-API support implemented. “…so shut up and go away” was not written but implied.

At least VA-API has three crates implementing its bindings in Rust and not just one not updated in four years like VDPAU but how usable are those? There’s FeV that seems to support JPEG decoding only (and has a strict warning against AMD GPUs), there’s libva-sys that is a pile of auto-generated bindings and there’s cros-libva. The latter seems to be the cleanest one and most actively developed (too actively developed to my taste as it changes base APIs every couple of months). Unfortunately it’s still not exactly clear how to use it for H.264 decoding (and the cros-codecs crate provides equally confusing API). And the final straw is that it seems to be intended for single-thread use only by design, which means it’s not possible to use with my library (e.g. my player uses separate threads for audio and video decoding, so I can’t use the standard decoder interface for hardware-accelerated decoding without some ugly hacks).

Oh well, I’ll work on improving my own H.264 decoder performance—while it’s not much fun either at least it’s clear what I can do with it and how it can be done.

P.S. This reminds me of the situation with ALSA. From what I heard it’s one of the worst documented subsystems in Linux with too flexible interface, to the point that it took help from ALSA developers to make at least MPlayer support ALSA output. The most probable reason is that it’s common to smoke weed in Czechia (where ALSA was developed), but what is the excuse for the other libraries?

Why I work on NihAV

Sunday, July 30th, 2023

I started NihAV as a more or less toy project to play with different concepts and try new stuff like finding out how vector quantisation works or attempting to write an encoder. Having enough experience with libavcodec and libavformat, I did not want to touch them again (and still don’t) and there was a hope that rust-av will provide a viable albeit limited alternative for multimedia playback (it still hasn’t). In theory I’ve achieved my original goals—NihAV supports decoding a lot of exotic formats (some of which are not handled by any other open-source project), it even has some encoders and its own transcoder tool and there’s even two players (one for audio files, another one can also play videos). So I could relax and do something else entirely but yet I’m working on adding new features to NihAV that take a lot of effort and do not bring me joy. Why?

(more…)

NihAV: updated for Rust 1.69

Thursday, July 27th, 2023

Since I had nothing better to do I decided to optimise my H.264 decoder a bit more, and that required a rather recent version of rustc that supports sym construct in asm!{} (so I can reference data tables in the inline assembly). Why this specific version though? I picked whatever was both recent enough to support the aforementioned feature (and older version had multiple micro version releases which hints on some problems with them) and not too recent either (again, I’m no beta tester of the compiler and I don’t need other shiny features).

And while at it I decided to make the code a bit more up to date. cargo-clippy is still annoying with its default warning about all-caps names and some lints that changed names and their suppressors no longer work. Getting rid of some leftover hints for the old versions of the compiler (like explicit drop()s for the objects borrowing code and some type hints) was nice though. Inline assembly is still only halfway done, especially considering that using const in it won’t be possible in stable for a long time and sym sucks compared to GCC inline assembly (it provides just a symbol name and you should magically know for yourself how the target platform works in order to make it possible to load it correctly; on AMD64 it’s rather simple but on aarch64 and on 32-bit ARMs that depends on target OS and PIC mode). Who would’ve thought that assembly may be platform-dependent! Looks like the current solution to that problem is to expose current configuration to the user so it’s up to you to check all environment variables and write the appropriate code. And of course even that solution will be available some time in the future since the developers haven’t thought about it at all.

Anyway, now my H.264 decoder features some more assembly optimisations and decodes video even faster than before. Though I fear it still takes too much CPU for the comfortable playback of my typical content so I’ll have to dabble in the hardware video acceleration. NihAV is a learning project after all.

NihAV: another step to being self-sufficient

Tuesday, June 13th, 2023

I’ve mentioned previously that I played with my H.264 decoder trying to make it multi-threaded. Now I went a bit further and plugged it into my video player. So now instead of hopelessly lagging on 720p video it can play it in real time just fine—so after improving my player even further (and enabling assembly optimisations when Rust compiler is good enough for that) I can use it to play most of the videos I care about without resorting to the external decoders or players. And in theory using it more will lead to fixing and polishing it more thus forming a stable loop.

Anyway, the code is not public yet as I hacked this new encoder in a separate crate and I still need to merge it back and clean up a bit, but I’d like to describe the interfaces and my reasons behind them.

So, multi-threaded decoder has a separate interface (for obvious reasons). I thought about writing a wrapper for single-threaded decoders to behave like multi-threaded ones but decided against it (at least for now). NADecoderMT has the following methods:

  • init()—initialises the decoder. One of the parameters is number of threads to use. IMO it’s the caller that decides how many threads it can spare as the decoder does not know what will be done in parallel (maybe there’s another multi-threaded decoder or two are running);
  • can_take_input()—queries if the decoder is ready to queue the next frame for decoding. Of course you can call queue_pkt() and check if it accepted the input but it may not always be desired (e.g. if we need to retrieve an input packet and then hold it waiting until the decoder is ready to accept it);
  • queue_pkt()—tries to queue the next frame for decoding;
  • has_output()—checks if the decoder has produced some frames for the output. Since get_frame() is waiting for a frame to be decoded this function is necessary unless you want to block the thread calling the decoder;
  • get_frame()—waits until at least one frame is decoded and returns it (or a special error if there are no frames to be decoded);
  • flush()—stops decoding all frames and clears the state (e.g. after seek).

Another peculiarity of this decoder interface is that it operates on pairs of a frame and its sequential number. The reason is simple: you get decoded frames out of order so you need to distinguish them somehow (and in case of a decoding error we need to know which frame caused it).

This also leads to a special frame reorder mechanism for such codecs. I’ve created MTFrameReorderer that requires you to “register” frame for decoding (providing you with an ID that is fed to the decoder along with frame data) and to “unregister” frame on error (that’s one of the places where returned frame ID comes in handy). Unfortunately it’s not possible to create a generic reorderer that would a) work completely codec-agnostic b) not require a whole file (or an indefinitely long sequence of frames) to be buffered before output and c) produce monotone increasing sequence of frames. Considering how H.264 has no real concept of frames and can build a pyramid of referenced frames adding layer by layer (and mind you, some frames may have an error during decoding and thus not present in output). I simply gave up and made a heuristic that checks if we have enough initial frames decoded and outputs some of them if it’s possible. At least it seems to work rather fine on the conformance suite (except for a couple of specially crafter files but oh well).

Maybe in the future I’ll try more multi-threaded decoders but for now even one decoder is enough, especially such practical one. Still, I need to find something more interesting to do.

Further Cinepak experiments

Monday, June 5th, 2023

For having nothing better to do I kept experimenting with Cinepak encoder.

I considered implementing some variant of codebook decomposition scheme suggested by Tomas in the comments to the previous post but I’m still not sure if I should bother even if it looks promising. So I tried the old thresholds-based scheme instead.

And what do you know, it speeds things up considerably: my usual test sample gets encoded in 27-35 seconds (depending on thresholds) instead of 44 seconds in the usual mode. But since I don’t know what would be good thresholds I did the opposite and added a refinement mode: after deciding which codebook to use for which block I re-generate codebook using only those blocks that belong to it. Of course it increases processing time, for example that file it takes 75 seconds to encode with refinement—which is still 70% more time but still less than double (for comparison, in full ELBG mode it’s an increase from about 160 seconds to 270 seconds).

So by rough estimate selecting only relevant blocks for codebook generation shaves 20-40% off the encoding time. And splitting data into partitions and generating a codebook by parts made the process about three times faster. I suspect that with a proper approach to clustering vector quantisation can be made two-three times faster but I don’t think I want to experiment with that. I should call it a day and move to something else instead.

Quick experiments with Cinepak encoder vector quantisation

Saturday, June 3rd, 2023

Out of curiosity I decided to check how partitioning input before creating a codebook affects encoding speed. So I’ve added a mode to Cinepak encoder that partitions vectors by luma variance and creates a part of common codebook just for them. The other two modes are median cut (the simplest one but also with mediocre output) and ELBG (that uses median cut to create the initial codebook—also if it’s not full that means we have all possible entries and do not need to perform ELBG at all).

Here are rough results on encoding several different files (and using different number of strips): median cut worked for 11-14 seconds, ELBG took 110-160 seconds, new mode (I decided to call it fast) takes 43-62 seconds. I think even such approximate numbers speak for themselves. Also there’s an interesting side effect: because of the partitioning it tends to produce smaller codebooks overall.

And while we’re speaking about quantisation results, here’s the first frame of waterfall sample encoded in different modes:

median cut

fast

full ELBG

As you can see, median cut produces not so good images but maybe those artefacts will make people think about the original Cinepak more. Fast mode is much nicer but it still has some artefacts (just look at the left edge of the waterfall) but if you don’t pay too much attention it’s not much worse than full ELBG.

Are there ways to improve it even further? Definitely. For starters, the original encoder exploits the previous codebook to create a new one while my encoder always generates a new codebook from scratch (in theory I can skip median cut stage for inter strips but I suspect that ELBG will work much longer in this case). The second way is to fasten up the ELBG itself. From what I could see it spends most of the time determining to which cluster each of the points belong. By having some smarter structure (something like k-d tree and some caching to skip recalculating certain clusters altogether) it should be possible to speed it up in several times. Unfortunately in this case I value clarity more so I’ll leave it as is.

P.S. I may also try to see how using thresholds and block variance to decide its coding mode affects the speed and quality (as in this case we first decide how to code blocks and then form codebooks for them instead of forming codebooks first and then deciding which mode suits the current block better; and in this case we’ll have smaller sets to make codebooks from too). But I may do something different instead. Or nothing at all.

NihAV experiments: multi-threaded decoder

Thursday, June 1st, 2023

In my efforts to have an independent player (that relies on third-party libraries merely for doing input and output while the demuxing and decoding is done purely by NihAV) I had to explore the way of writing a multi-threaded H.264 decoder. And while it’s not working perfectly it’s a good proof of a concept. Here I’ll describe how I hacked my existing decoder to support multi-threading.
(more…)