Archive for the ‘NihAV’ Category

ZMBV support in NihAV and deflate format fun

Saturday, May 22nd, 2021

As I said in the previous post, I wanted to add ZMBV support to NihAV, mostly because it is rather simple codec (which means I can write a decoder and an encoder for it without spending too much time), it’s lossless and supports various bit-depths too (which means I can encode various content into it preserving the original format).

I still had to improve my deflate support (both decompressing and compressing) a bit to support the way the data is stored there. At least now I mostly understand what various flags are for.

First of all, by itself deflate format specifies just a bitstream split into blocks of data that may contain any amount of coded data. And these blocks start at the next bit after the previous block has ended, no byte aligning except by chance or after a copy block (which aligns bitstream before storing length and block contents).

Then, there is raw format used in various formats (like Zip or gzip) and there’s zlib format used for most cases data is stored as part of some other format (that means you have two initial bytes like 0x78 0x5E and 2×2 bytes of checksum in the end).

So, ZMBV uses unterminated stream format: first frame contains zlib header plus one or several blocks of data padded with an empty copy block to the byte limit, next frame contains continuation of that stream (also one or more blocks padded to the byte boundary) and so on. This is obviously done so you can decode frames one after another and still exploit the redundancy from the previously coded frame data if you’re lucky.

Normally you would start decoding data and keep decoding it until the final block (there’s a flag in block header for that) has been decoded—or error out earlier for insufficient data. In this case though we need to decode data block, check if we are at the end of input data and then return the decoded data. Similarly during data compression we need to encode all current data and pad output stream to the byte boundary if needed.

This is not hard or particularly tricky but it demonstrates that deflated data can be stored in different ways. At least now I really understand what that Z_SYNC_FLUSH flag is for.

Adding deflate support to NihAV

Tuesday, May 18th, 2021

Since I wanted to do something different I decided to finally implement deflate support for NihAV—by which I mean compression support in addition to decompression. Here is how well it went.

As usual, my goal was to implement it in mostly straightforward way but with reasonable speed instead of having something completely on par with zlib or better.

At first I implemented the simplest form of compression – copying data without compression (but with the proper headers and ADLER-32 checksum at the end). Then I added a simple encoding with fixed codes that simply output symbols as it—no compression yet but at least it tests how well bitstream is written. Then I moved to dynamic codes. Then I added a brute force search and started encoding matches. So by the end of the weekend I had something working already and then I could make it faster and/or better.

Of course the first thing to remember is that you can reduce search time by using some structure for a faster text search. I think suffix trie is now popular but I settled for an old-fashioned hash by three bytes. Initially it was twice as slow since while the number of string comparisons decreased hundredfold, updating hash table on each step consumed too much time. So I switched to linked-list hash that resembles FAT somewhat (i.e. for each position in the input you have a pointer to the next location of the same three-letter hash plus an additional table pointing to the start of chain for each hash key). And I calculated it once per a large block just discarding matches outside of the desired range. Of course this can be done better but it worked fast enough for me.

Now the compression. There are three main strategies I tried: naïve aka greedy one (you simply output the longest match you can find at the current step), lazy (you also check the next position if it produces even better match and use it if possible—surprisingly enough it gives a significant benefit) and theoretically optimal (you construct a trellis and see which combination and literals can give you the best coding; it has issues but theoretically it’s the best one).

So why it’s “theoretically optimal” and not just optimal? Because it needs to calculate the accurate bit cost and you can’t know it until you produce all the symbols to be encoded and calculate the actual lengths for them. Of course you can do it in an iterative process or employ a heuristic to predict bit length somehow but I simply used “9 bits for the symbol and 5 bits plus escape bits for distance additionally if it’s present”. I think for some cases it even produced larger files than lazy decoding.

Here is a list from the top of my head of things than can be improved (but I guess anybody who has written a LZ77-based compressor knows it better than me):

  • method selection—sometimes copying data verbatim is better (in the case of noise) or using fixed codes (because the overhead from transmitting dynamic codes eats all the advantage);
  • partitioning—currently I use 64kB blocks but depending on content (usually detected by the symbol frequency variations) it’s better to cut block earlier or make it larger. I played a bit with the block size but changing it (in either direction) currently leads to compression ratio drops;
  • faster search for the matching strings;
  • heuristics for either faster optimal parsing or better-compressing other method.

Of course some of it can be sped up by simply using unsafe Rust so no checks on array access are performed but I don’t think it’s worth it now.

And finally here are some benchmarks for the curious ones performed on a source file of the program:

  • copy: 32156 bytes (from 32145 bytes)
  • fixed codes and greedy search: 7847 bytes, 80ms
  • dynamic codes and greedy search: 6818 bytes, 80ms
  • dynamic codes and lazy search: 6665 bytes, 100ms
  • dynamic codes and “optimal” search: 6529 bytes, 690ms
  • gzip -9 for the reference: 6466 bytes, <10ms

As you can see it’s not fast but it works. I also checked that the resulting compressed data is decoded fine (plus some tests on large files that will be split into several blocks). Now all that’s left is to implement ZMBV decoder and encoder.

Missing optimisation opportunity in Rust

Wednesday, May 12th, 2021

While I’m struggling to write a video player that would satisfy my demands I decided to see if it’s possible to make my H.264 decoder a bit faster. It turned out it can be done with ease and that also raises the question concerning the title of this post.

What I did cannot be truly called optimisations but rather “optimisations” yet they gave a noticeable speed-up. The main optimisation candidates were motion compensation functions. First I shaved a tiny fraction of second by not zeroing temporary arrays as their contents will be overwritten before the first read.

And then I replaced the idiomatic Rust code for working with block like

    for (dline, (sline0, sline1)) in dst.chunks_mut(dstride).zip(tmp.chunks(TMP_BUF_STRIDE).zip(tmp2.chunks(TMP_BUF_STRIDE))).take(h) {
        for (pix, (&a, &b)) in dline.iter_mut().zip(sline0.iter().zip(sline1.iter())).take(w) {
            *pix = ((u16::from(a) + u16::from(b) + 1) >> 1) as u8;
        }
    }

with raw pointers:

    unsafe {
        let mut src1 = tmp.as_ptr();
        let mut src2 = tmp2.as_ptr();
        let mut dst = dst.as_mut_ptr();
        for _ in 0..h {
            for x in 0..w {
                let a = *src1.add(x);
                let b = *src2.add(x);
                *dst.add(x) = ((u16::from(a) + u16::from(b) + 1) >> 1) as u8;
            }
            dst = dst.add(dstride);
            src1 = src1.add(TMP_BUF_STRIDE);
            src2 = src2.add(TMP_BUF_STRIDE);
        }
    }

What do you know, the total decoding time for the test clip I used shrank from 6.6 seconds to 4.9 seconds. That’s just three quarters of the original time!

And here is the problem. In theory if Rust compiler knew that the input satisfies certain parameters i.e. that there’s always enough data to perform full block operation in this case, it would be able to optimise code as good as the one I wrote using pointers or even better. But unfortunately there is no way to tell the compiler that input slices are large enough to perform the operation required amount of times. Even if I added mathematically correct check in the beginning it would not eliminate most of the checks.

Let’s see what happens with the iterator loop step by step:

  1. first all sources are checked to be non-empty;
  2. then in outer loop remaining length of each source is checked to see if the loop should end;
  3. then it is checked if the outer loop has run not more than requested number of times (i.e. just for the block height);
  4. then it checks line lengths (in theory those may be shorter than block width) and requested width to find out the actual length of the inner loop;
  5. and finally inside the loop it performs the averaging.

And here’s what happens with the pointer loop:

  1. outer loop is run the requested amount of times;
  2. inner loop is run the requested amount of times;
  3. operation inside the inner loop is performed.

Of course those checks are required to make sure you work only with the accessible data but it would be nice if I could either mark loops as “I promise it will run exactly this number of times” (maybe via .take_exact() as Luca suggested but I still don’t think it will work perfectly for 2D case) or at least put code using slices instead of iterators into unsafe {} block and tell compiler that I do not want boundary checks performed inside.

Update: in this particular case the input buffer size should be stride * (height - 1) + width i.e. it is always enough to perform operation in the way described above but if you use .chunks_exact() the last line might be not handled which is wrong.

The former is rather hard to implement for the common case so I don’t think it will happen anywhere outside Fortran compilers, the latter would cause conflicts with different Deref trait implementation for slices so it’s not likely to happen either. So doing it with pointers may be clunky but it’s the only way.

Fixing SVQ1 decoding bug

Saturday, March 6th, 2021

In the comments to the previous post a certain Paul B. pointed out that SVQ1 decoder (the one in libavcodec or mine) decodes certain files with visual artefacts. So I opened the old dreary QuickTime.qts with Ghidra to look at its contents once again (last time it was for QDesign Music details but luckily I’ve marked SVQ1 decoder functions as well).

The official binary specification turned out to have slightly different design with just one block decoding function that gets intra or inter codebooks passed to it (so intra block is essentially adding residue to zero block using intra codebooks). And, more curiously, the codec uses 16-bit values for pixels up to the very end of decoding.

As you can guess, the artefacts looking like white blocks are caused by the pixel value going out of 8-bit range. I actually hooked GDB script to mplayer2 that loads QuickTime decoder (and presents some garbage instead of proper decoded frame) to see what happens with the block showing such artefact. It turned out that pixel with the original value 0xCF got increased to 0x14F during codebook additions and the reference decoder had output it as 0x4F. So I changed clamping to discarding top bits and it works much better.

Considering that codebooks are stored as single .dll resource and block decoding function works (for performance reasons) as a chain of block modifying functions with stackless calling convention I call the results good enough and let those who want more dig there instead of me.

ClearVideo briefly revisited

Thursday, December 31st, 2020

Since I had nothing better to do for the rest of this year (I expect the next year to begin in the same fashion) I decided to take a look at the problem when some files were decoded with inter-frames becoming distorted like there’s some sharpening filter constantly applied. And what do you know, there’s some smoothing involved in certain cases.
(more…)

Vivo2 revisited

Tuesday, December 22nd, 2020

Since I have nothing better to do (after a quick glance at H.264 decoder—yup, nothing) I decided to look at Vivo 2 again to see if I can improve it from being “decoding and somewhat recognizable” to “mostly okay” stage.

To put a long story short, Vivo 2 turned out to be an unholy mix of H.263 and MPEG-4 ASP. On one hoof you have H.263 codec structure, H.263 codebooks and even the unique feature of H.263 called PB-frames. On the other hoof you have coefficient quantisation like in MPEG-4 ASP and coefficient prediction done on unquantised coefficients (H.263 performs DC/AC prediction on already dequantised coefficients while MPEG-4 ASP re-quantises them for the prediction).

And the main weirdness is IDCT. While the older standards give just ideal transform formula, multiplying by matrix is slow and thus most implementations use some (usually fixed-point integer) approximation that also exploits internal symmetry for faster calculation (and hence one of the main problems with various H.263 and DivX-based codecs: if you don’t use the exactly the same transform implementation as the reference you’ll get artefacts because those small differences will accumulate). Actually ITU H.263 Annex W specifies bit-exact transform but nobody cares by this point. And Vivo Video has a different approach altogether: it generates a set of matrices for each coefficient and thus instead of performing IDCT directly it simply sums one or two matrices for each non-zero coefficient (one matrix is for coefficient value modulo 32, another one is for coefficient value which is multiple of 32). Of course it takes account for it being too coarse by multiplying matrices by 64 before converting to integers (and so the resulting block should be scaled down by 64 as well).

In either case it seems to work good enough so I’ve finally enabled nihav-vivo in the list of default crates and can finally forget about it as did the rest of the world.

NihAV: frame reordering

Friday, December 18th, 2020

Since I have nothing better to do I’d like to talk about how NihAV handles output frames.

As you might remember I decided to make decoders output frames on synchronous basis, i.e. if a frame comes to the decoder it should be decoded and output and in case when the codec support B-frames a reordering might happen later in a special frame reorderer. And the reorderer for the concrete decoder was selected based on codec capabilities (if you don’t have frame reordering in format then don’t do it).

Previously I had just two of them, NoReorderer (it should be obvious for which cases it is intended) and IPBReorderer for codecs with I/P/B-frames. The latter simply holds last seen reference frame (I- or P-frame) and outputs B-frames until the next reference frame comes. This worked as expected until I decided to implement H.264 decoder and hit the famous B-pyramid (i.e. when B-frames serve as a reference for another B-frames or even P-frames). To illustrate that imagine an input sequence of frames I0 P4 B2 B1 B3 which should be output as I0 B1 B2 B3 P4. The approach from IPBReorderer would output it as I0 B2 B1 B3 P4 which is not quite correct. So I had to add so-called ComplexReorderer which keeps an array of frames sorted by display timestamp and marks the frames up to a reference I- or P-frame available for output when the next reference frame comes. Here’s a step-by-step example:

  • I0 comes and is stored in the queue;
  • P4 comes and is stored in the queue, I0 is marked as being ready for output;
  • B2 comes and is stored in the queue right before P4;
  • B1 comes and is stored in the queue right before B2 so the queue now is B1 B2 P4;
  • B3 comes and is stored in the queue between B2 and P4;
  • then a next reference frame should come and we should store it and mark B1 B2 B3 P4 ready for output.

Of course one can argue that this waits for more than needed and we should be able to output B1 and B2 even before B3 arrives (or even better we can output B1 immediately as it appears). That is true but it is rather hard to do in the general case. Real-world DTS values depend on container timebase so how do you know there are no additional frames in sequence 0 1000 333 667 (plus the decoder can be told to stop outputting unreferenced frames). Relying on frame IDs generated by the decoder? H.264 has three different modes of generating picture IDs with one of them assigning even numbers to frames (and odd numbers to the second frame field if those are present). While it can be resolved, that will complicate the code for no good reason. So as usual I picked the simplest working solution trading theoretically lower latency for clarity and simplicity.

NihAV: optimisation potential

Sunday, December 13th, 2020

Today I can say what I’ve wasted about two months on: it was H.264 decoder. For now it’s the only entry in nihav-itu crate but I might add G.7xx decoders there or even the standard H.263 decoder in addition to all those decoders based on it.

Performance-wise it is not very good, about 2.5-3x times slower than libavcodec one without SIMD optimisations on random BaidUTube 720p videos but I’ve not tried to make it the fastest one and prefer clarity over micro-optimisations. But this still has a lot of optimisation potential as the title says. I suspect that even simply making motion interpolation functions work on constant-size blocks would make it significantly faster let alone adding SIMD. In either case it is fast enough to decode 720p in 2x realtime on my laptop so if I ever finish a proper video player I can use it to watch content beside game cutscenes and few exotic files.

As for the features it’s limited but it should be able to play the conventional files just fine plus some limited subset of High profile (just 8-bit 4:2:0 YUV without custom scaling lists). A lot of features that I don’t care about were ignored (proper loop filtering across the slice edges—nope, weighted prediction—maybe later, high-bitdepth or different chroma subsampling format support—quite unlikely, interlaced formats—no in principle).

While developing that decoder I also got better knowledge of H.264 internals for which I’m not that grateful but that’s to be expected from a codec designed by a committee with features being added to it afterwards.

In either case hopefully I’ll not be that bored to do optimisations unless I have to, so the potential will remain the potential and I’ll do some more interesting stuff instead. And there’s always Settlers II as the ultimate time consumer 😉

NihAV: audio player done

Wednesday, October 7th, 2020

As I wrote in my previous post, I had functioning audio player nearing completion. And now I’ve finally added all features I wanted to add and can call it done.

While previously I mostly ranted on the bloat introduced by the components authors, here I’d like to describe the design and the reasoning behind it.
(more…)

NihAV: towards an audio player

Sunday, October 4th, 2020

So after weeks of doing nothing and looking at lossless audio codecs (in no particular order) I’m going back to developing NihAV and more particularly an audio player.
(more…)