Archive for the ‘NihAV’ Category

Quick experiments with Cinepak encoder vector quantisation

Saturday, June 3rd, 2023

Out of curiosity I decided to check how partitioning input before creating a codebook affects encoding speed. So I’ve added a mode to Cinepak encoder that partitions vectors by luma variance and creates a part of common codebook just for them. The other two modes are median cut (the simplest one but also with mediocre output) and ELBG (that uses median cut to create the initial codebook—also if it’s not full that means we have all possible entries and do not need to perform ELBG at all).

Here are rough results on encoding several different files (and using different number of strips): median cut worked for 11-14 seconds, ELBG took 110-160 seconds, new mode (I decided to call it fast) takes 43-62 seconds. I think even such approximate numbers speak for themselves. Also there’s an interesting side effect: because of the partitioning it tends to produce smaller codebooks overall.

And while we’re speaking about quantisation results, here’s the first frame of waterfall sample encoded in different modes:

median cut

fast

full ELBG

As you can see, median cut produces not so good images but maybe those artefacts will make people think about the original Cinepak more. Fast mode is much nicer but it still has some artefacts (just look at the left edge of the waterfall) but if you don’t pay too much attention it’s not much worse than full ELBG.

Are there ways to improve it even further? Definitely. For starters, the original encoder exploits the previous codebook to create a new one while my encoder always generates a new codebook from scratch (in theory I can skip median cut stage for inter strips but I suspect that ELBG will work much longer in this case). The second way is to fasten up the ELBG itself. From what I could see it spends most of the time determining to which cluster each of the points belong. By having some smarter structure (something like k-d tree and some caching to skip recalculating certain clusters altogether) it should be possible to speed it up in several times. Unfortunately in this case I value clarity more so I’ll leave it as is.

P.S. I may also try to see how using thresholds and block variance to decide its coding mode affects the speed and quality (as in this case we first decide how to code blocks and then form codebooks for them instead of forming codebooks first and then deciding which mode suits the current block better; and in this case we’ll have smaller sets to make codebooks from too). But I may do something different instead. Or nothing at all.

NihAV experiments: multi-threaded decoder

Thursday, June 1st, 2023

In my efforts to have an independent player (that relies on third-party libraries merely for doing input and output while the demuxing and decoding is done purely by NihAV) I had to explore the way of writing a multi-threaded H.264 decoder. And while it’s not working perfectly it’s a good proof of a concept. Here I’ll describe how I hacked my existing decoder to support multi-threading.
(more…)

rv4enc: probably done

Saturday, May 13th, 2023

In one of the previous posts I said that this encoder will likely keep me occupied for a long time. Considering how bad was that estimation I must be a programmer.

Anyway, there were four main issues to be resolved: compatibility with the reference player, B-frame selection and performing motion estimation for interpolated macroblocks in them, and rate control.

I gave up on the compatibility. The reference player is unwieldy and I’d rather not run it at all let alone debug it. Nowadays the majority of players use my decoder anyway and the produced videos seem to play fine with it.

The question of motion vector search for interpolated macroblocks was discusses in the previous post. The solution is there but it slows down encoding by several times. As a side note, by omitting intra 4×4 mode in B-frames I’ve got a significant speed-up (ten to thirty percent depending on quantiser) so I decided to keep it this way by default.

The last two issues were resolved with the same trick: estimating frame complexity. This is done in a relatively simple way: calculate SATD (sum of absolute values of Hadamard-transformed block) of the differences between current and some previous frame with motion compensation applied. For speed reasons you can downsample those frames and use a simpler motion search (like with pixel-precision only). And then you can use calculated value to estimate some frame properties.

For example, if the difference between frames 0 and 1 is about the same as the difference between frames 1 and 2 then frame 1 should probably be coded as B-frame. I’ve implemented it as a simple dynamic frame selector that allows one B-frame between reference frames (it can be extended to allow several B-frames but I didn’t bother) and it improved coding compared to the fixed frame order.

Additionally there seems to be a correlation between frame complexity and output frame size (also depending on the quantiser of course). So I reworked rate control system to rely on those factors to select the quantiser for I- and P-frames (adjusting them if the predicted and the actual sizes differ too much). B-frames simply use P-frame quantiser plus constant offset. The system seems to work rather well except that it tends to assign too high quantisers for some frames, resulting in rather crisp I-frame followed by more and more blurry frames.

I suppose I’ll play with it for a week or two, hopefully improving it a bit, and then I shall commit it and move to something else.

P.S. the main goal of NihAV is to provide me with a playground for learning and testing new ideas. If it becomes useful beside that, that’s a bonus (for example, I’m mostly using nihav-sndplay to play audio nowadays). So RealVideo 4 encoder has served its purpose by allowing me to play more with various concepts related to B-frames and rate control (plus there were some other tricks). Even if its output makes RealPlayer hang, even if it’s slow—that does not matter much as I’m not going to use it myself and nobody else is going to use it either (VP6 encoder had some initial burst of interest from some people but none afterwards, and nobody cares about RV4 from the start).

Now the challenge is to find myself an interesting task, because most of the tasks I can think about involve improving some encoder or decoder or—shudder—writing a MOV/MP4 muxer. Oh well, I hope I’ll come with something regardless.

Starting yet another failure of an encoder

Thursday, April 6th, 2023

As anybody could’ve guessed from Cook encoder, I’d not want to stop on that and do some video encoder to accompany it. So here I declare that I’m starting working on RealVideo 4 encoder (again, making it public should prevent me from chickening out).

I can salvage up some parts from my VP7 encoder but there are several things that make it different enough from it (beside the bitstream coding): slices and B-frames. Historically RealMedia packets are limited to 64kB and they should not contain partial slices (grouping several slices or even frames in the packet is fine though), so the frame should be split during coding. And while Duck codecs re-invent B-frames to make them still be coded in sequence, RealVideo 4 has honest B-frames that should be reordered before encoding.

So while the core is pretty straightforward (try different coding modes for each macroblock, pick the best one, write bitstream), it gives me enough opportunity to try different aspects of H.264 encoding that I had no reason to care about previously. Maybe I’ll try to see if automatic frame type selection makes sense, maybe I’ll experiment with more advanced motion search algorithms, maybe I’ll try better heuristics for e.g. quantiser selection.

There should be a lot to keep me occupied (but I expect to spend even more time on evading that task for the lack of inspiration or a sheer amount of work to do demotivating me).

NihAV: towards RealMedia encoding support

Friday, March 10th, 2023

Since I have nothing better to do with my life, I’ve decided to add a rudimentary support for encoding into RealMedia. I’ve written a somewhat working RMVB muxer (it lacks logical stream support that is used to describe the streaming capabilities, it also has some quirks in audio and video support but it seems to produce something that other players understand) and now I’ve made a Cook audio encoder as well.

Somebody who knows me knows that I fail spectacularly at writing non-trivial lossy audio encoders but luckily here we have a format which even I can’t botch much. It is based on parametric bit-allocation derived from band energies. So all it takes is to perform slightly modified MDCT, calculate band energies, convert them to scale factors, perform bit allocation based on them, pack band contents depending on band categories and adjust the categories until all bands fit the frame. All of these steps are well-defined (including the order in which bands should be adjusted) so making it all work is rather trivial. But what about determining the frame size and coupling mode?

As it turns out, RealMedia supports only certain codec profiles (or “flavors” in its terminology), so Cook has about 32 different flavours defined for different combinations of sample rates, number of channels and bitrates. And each flavour has an internally bitrate parameters (frame size and which coupling parameters to use) for each channel pair so you just pick a fitting profile and go on with it. In theory it’s possible to add a custom profile but it’s not worth the hassle IMO.

And now here are some fun facts about it. Apparently there are three internal revisions of the codec: the original version for RealMedia G2, version 2 for RealMedia 8 and multichannel encoder (introduced in RealMedia 10, when they’ve switched to AAC already). Version 2 is the one supporting joint-stereo coding. The codec is based on G.722.1 (the predecessor of CELT even if Xiph folks keep denying that) but, because Cook frames are 256-1024 samples instead of fixed 320-sample frames, they’ve introduced gains to adjust better for volume changes inside the frame (but I haven’t bothered implementing that part). That is probably the biggest principal change in the format (not counting the different frame sizes and joint stereo support in the version 2). And finally I should probably mention that there are some flavours with the same bitrate that differ by the frequency range and where the joint stereo part starts (some of those are called “high response music” whatever that means).

Time to move to the video encoder…

Indeo 3 encoder: done

Thursday, February 16th, 2023

After fixing some bugs I think my encoder requires no further improvement (there are still things left out there to improve but they are not necessary). The only annoying problem is that decoding some videos with the original VfW binary gives artefacts. Looks like this happens because of its peculiar way to generate and then handle codebooks: corrector pairs and quads are stored as single 32-bit word and its top bit is also used to signal if it’s a dyad or a quad. And looks like for some values (I suspect that for large negative ones) the system does not work so great so while e.g. XAnim module does what is expected, here it mistakes a result for another corrector type and decodes the following bytestream in a wrong way. Of course I’m not going to deal with that annoyance and I doubt anybody will care.

Also I’ve pruned one stupid bug from my MS Video 1 encoder so it should be done as well. The third one, Cinepak, is in laughable state (well, it encodes data but it does not do that correctly let alone effectively); hopefully I’ll work on it later.

For now, as I see no interesting formats to support (suggestions are always welcome BTW), I’ll keep writing toy encoders that nobody will use (and hopefully return to improving Cinepak encoder eventually).

Revisiting MSVideo1 encoder

Wednesday, February 8th, 2023

Recently somebody asked me a question about my MS Video 1 encoder (not the one in NihAV though) and I’ve decided to look if my current encoder can be improved, so I took Ghidra and went to read the binary specification.

Essentially it did what I expected: it understands quality only, for which it calculates thresholds for skip and fill blocks to be used immediately, clustering is done in the usual K-means way and the only curious trick is that it used luminance for that.

So I decided to use that idea for improving my own encoder. I ditched the generic median cut in favour of specially crafted clustering in two groups (I select the largest cluster axis—be it luma, red, green or blue—split 4 or 16 pixels into two groups by being above average or not and calculate the average of those two groups). This made encoding about two times faster. I’ve also fixed a bug with 8-colour blocks so now it encodes data properly (previously it would result in a distorted block). And of course I’ve finally made quality affect encoding process (also by generating thresholds, but with a different formula—unlike the original my encoder uses no floating-point maths anywhere).

Also I’ve added palette mode support. The idea is simple: internally I operate on pixel quads (red, green, blue, luma) so for palette mode I just need to replace an actual pixel value with the index of the most similar palette entry. For that task I reused one of the approaches from my palettiser (it should be faster than iterating over the whole palette every time). Of course the proper way would be to map colour first to have the proper distortion calculated (because the first suitable colour may be far from perfect) but I decided not to pursue this task further, even if it results in some badly-coded features sometimes. It’s still not a serious encoder after all.

Now this member of the early 90s video codecs ruling triumvirate should be good enough. Cinepak encoder is still rather primitive so I’ll have to re-check it. Indeo 3 encoder seems to produce buggy output on complex high-motion scenes (I suspect it’s related to the number of motion vectors exceeding the limit) but it will have to wait until I rewrite the decoder. And then hopefully more interesting experiments will happen.

Indeo 3 encoder: done

Monday, February 6th, 2023

I’ve done what I wanted with the encoder, it seems to work and so I declare it to be finished. It can encode videos that other decoders can decode, it has some adjustable options and even a semblance of rate control.

Of course I’ll return to it if I ever use it and find some bugs but for now I’ll move to other things. For instance, Indeo 3 decoder needs to be rewritten now that I understand the codec better. Also I have some ideas for improving MS Video 1 encoder. And there’s TrueMotion 1 that I wanted to take a stab at. And there are some non-encoder things as well.

There’s a lot of stuff to keep me occupied (provided that I actually get myself occupied with it in the first place).

Starting yet another useless encoder

Thursday, January 26th, 2023

Even before I started to write my series of posts on FFhistory, I had another work in progress already which I’m now making public in order not to chicken out (as I did several times already). I’m talking about Indeo 3 encoder.

Why Indeo 3 of all possible things? It’s both not your conventional DCT-based codec and it’s widespread enough to be of some limited use for me (being present in AVI, MOV and VMD containers, only Cinepak is more ubiquitous). I’m not as good as Mike Melanson but I’m willing to try my hoof at it.

The funny thing is, there’s an opensource decoder for it and even a decent description in US patent 5,386,232 from 1995 (so it’s expired already and anybody can write an encoder for it). The problem is that those two sources don’t match between each other and somewhat disagree with the official binary specification (I’m pretty sure that both Indeo3 decoders were REd from XAnim module). And Ghidra does not like VfW binary (maybe it’ll like the version inside QT6 better) so I can’t easily refer to it either.

Anyway, I attempted and gave up writing an encoder for Indeo 3 several times because of its perceived combinatoric complexity. First you need to split frame recursively into blocks—how to select them? Then you need to select one of the coding modes (again, how?) and codebooks (same question). Trying to think of a reasonable way to implement it all made me shudder and give up until I finally read the format description and persisted enough to write at least something working (side note: I also have the same problem with TrueMotion 1 encoder which I also want to write one day, hopefully it’ll be easier now).

Also I tried to look into the encoder implementation and found it as a bunch of magic numbers at work. I’m not joking, during initialisation it seems to set several dozens of various integers and floats and use them for various coding decisions (at least what I could understand from it is that codebook selection is kinda tied to the internal quantiser parameter which is calculated depending on bitrate/quality—and various magic numbers).

So I want to document how this codec works, what differs in the different descriptions of it and how my encoder decides what to use in different situations. This should amount to another dozen of posts that nobody will read.

Rust inline assembly experience

Saturday, September 3rd, 2022

Since I need something less exciting than the series about IAEA willingly ignoring the terrorists occupying the largest nuclear power plant in Europe (even during its mission visit there) or the series called “what russia destroyed in my home city today”, I’ve tried inline assembly support in recent stable Rust compiler, here’s a short report.

Since I’m working on a multimedia framework, my primary interest about inline assembly is how well I can add SIMD code for various codecs. My previous attempt was optimising the adaptive filter in Monkey’s Audio decoder and while it worked it looked ugly because of the way Intel named its intrinsics (if you like the names like _mm_madd_epi16 then our tastes are very different) and the verbosity (constant need to cast vectors to different types. So I decided to wait until non-experimental inline assembly support is ready.

This time I’ve decided to look how easy it is to make SIMD optimisations for my own H.264 decoder (and it needs them in order to be usable when I finally switch to my own video player). Good things: I’ve managed to speed up overall decoding about 20%. Bad things: a lot of things can’t be made faster because of the limitations.

For those who are not familiar, H.264 decoder contains a lot of typical operations performed on blocks with sizes 2×2, 4×4, 8×8 or 16×16 (or rectangular blocks made by splitting those in half) with operations being copying, adding data to a block, averaging two blocks and so on.

Writing the code by itself is nice: you can have a function with a single unsafe{ asm!(..); } statement in it and you let the compiler to figure out the details (the rather famous x86inc.asm is mostly written to deal with the discrepancies between ABIs on different platforms and for templating MMX/SSE/AVX code). Even nicer is that you can specify arguments in a clear way (which is much better than passing constraints in three groups for GCC syntax) and used named arguments inside the code. Additionally it uses local labels in gas form which are a bit clearer to use and don’t clutter debug symbols.

Now for the bad things: inline assembly support (as of rustc 1.62.1) is lacking for my needs. Here’s my list of the annoyances with ascending severity:

  • the problem with sub-registers: I had to fill XMM register from a GPR one so I wrote movd xmm0, {val} with val being a 32-bit value. The compiler generated a warning and the actual instruction in the binary was movq xmm0, rdx (which is copying eight bytes instead of four). And it’s not immediately clear that you should write it as movd xmm0, {val:e} in that case (at least Luca has reported this on my behalf so it may be improved soon);
  • asm!() currently supports only registers as input/output arguments while in reality it should be able to substitute some things without using registers for them—e.g. when instruction can take a constant number (like shifts, it’s very useful for templated code) or a memory reference (there are not so many registers available on x86 so writing something like paddw xmm1, TABLE[{offset}] would save one XMM register for loading table contents explicitly). I’m aware there’s a work going on that so in the future we should be able to use constant and sym input types but currently it’s unstable;
  • and the worst issue is the lack of templating support. For instance, I have functions for faster averaging of two blocks—they simply load a certain amount of pixels from each line, average them and write back. For 16×16 case I additionally unroll the loop a bit more. It would be nice to put it into single macro that instantiates all the variants by substituting load/store instruction and enabling certain additional code inside the loop when block width is sixteen. Of course I work around it by copy-pasting and editing the code but this process is prone to introducing errors (especially when you confuse two nearly identical functions—and they tend to become long when written in assembly). And I can’t imagine how to use macro_rules!() to either construct asm!() contents from a pieces or to make it cut out some content out of it. Having several asm!() blocks one after another is not always feasible as nobody can guarantee you that the compiler won’t insert some code between them to juggle the registers used for the arguments.

All in all, I’d say that inline assembly support in Rust is promising but not yet fully usable for my needs.

Update. Luca actually tried to solve templating problem and even wrote a post about it. There’s a limited way to do that via concat!() instead of string substitution and a somewhat convoluted way to fit some blocks inside one assembly template. It’s not perfect but if you’re desperate enough it should work for you.