Archive for the ‘NihAV’ Category

Rust inline assembly experience

Saturday, September 3rd, 2022

Since I need something less exciting than the series about IAEA willingly ignoring the terrorists occupying the largest nuclear power plant in Europe (even during its mission visit there) or the series called “what russia destroyed in my home city today”, I’ve tried inline assembly support in recent stable Rust compiler, here’s a short report.

Since I’m working on a multimedia framework, my primary interest about inline assembly is how well I can add SIMD code for various codecs. My previous attempt was optimising the adaptive filter in Monkey’s Audio decoder and while it worked it looked ugly because of the way Intel named its intrinsics (if you like the names like _mm_madd_epi16 then our tastes are very different) and the verbosity (constant need to cast vectors to different types. So I decided to wait until non-experimental inline assembly support is ready.

This time I’ve decided to look how easy it is to make SIMD optimisations for my own H.264 decoder (and it needs them in order to be usable when I finally switch to my own video player). Good things: I’ve managed to speed up overall decoding about 20%. Bad things: a lot of things can’t be made faster because of the limitations.

For those who are not familiar, H.264 decoder contains a lot of typical operations performed on blocks with sizes 2×2, 4×4, 8×8 or 16×16 (or rectangular blocks made by splitting those in half) with operations being copying, adding data to a block, averaging two blocks and so on.

Writing the code by itself is nice: you can have a function with a single unsafe{ asm!(..); } statement in it and you let the compiler to figure out the details (the rather famous x86inc.asm is mostly written to deal with the discrepancies between ABIs on different platforms and for templating MMX/SSE/AVX code). Even nicer is that you can specify arguments in a clear way (which is much better than passing constraints in three groups for GCC syntax) and used named arguments inside the code. Additionally it uses local labels in gas form which are a bit clearer to use and don’t clutter debug symbols.

Now for the bad things: inline assembly support (as of rustc 1.62.1) is lacking for my needs. Here’s my list of the annoyances with ascending severity:

  • the problem with sub-registers: I had to fill XMM register from a GPR one so I wrote movd xmm0, {val} with val being a 32-bit value. The compiler generated a warning and the actual instruction in the binary was movq xmm0, rdx (which is copying eight bytes instead of four). And it’s not immediately clear that you should write it as movd xmm0, {val:e} in that case (at least Luca has reported this on my behalf so it may be improved soon);
  • asm!() currently supports only registers as input/output arguments while in reality it should be able to substitute some things without using registers for them—e.g. when instruction can take a constant number (like shifts, it’s very useful for templated code) or a memory reference (there are not so many registers available on x86 so writing something like paddw xmm1, TABLE[{offset}] would save one XMM register for loading table contents explicitly). I’m aware there’s a work going on that so in the future we should be able to use constant and sym input types but currently it’s unstable;
  • and the worst issue is the lack of templating support. For instance, I have functions for faster averaging of two blocks—they simply load a certain amount of pixels from each line, average them and write back. For 16×16 case I additionally unroll the loop a bit more. It would be nice to put it into single macro that instantiates all the variants by substituting load/store instruction and enabling certain additional code inside the loop when block width is sixteen. Of course I work around it by copy-pasting and editing the code but this process is prone to introducing errors (especially when you confuse two nearly identical functions—and they tend to become long when written in assembly). And I can’t imagine how to use macro_rules!() to either construct asm!() contents from a pieces or to make it cut out some content out of it. Having several asm!() blocks one after another is not always feasible as nobody can guarantee you that the compiler won’t insert some code between them to juggle the registers used for the arguments.

All in all, I’d say that inline assembly support in Rust is promising but not yet fully usable for my needs.

Update. Luca actually tried to solve templating problem and even wrote a post about it. There’s a limited way to do that via concat!() instead of string substitution and a somewhat convoluted way to fit some blocks inside one assembly template. It’s not perfect but if you’re desperate enough it should work for you.

Spectral Band Replication in various audio codecs

Thursday, July 7th, 2022

While the war continues I try to distract myself with something less unpleasant, recently it was implementing SBR support for AAC decoder in NihAV. And since I’ve done that why not talk about this technology and its varieties?

The idea of saving bits in lossy codec by coding less important high frequencies in an approximate way is not very new, the first case that comes to mind was aacPlus codec from Coding Technologies that served as a base for AAC SBR (and MP3Pro if you remember such thing). It allows to code the upper part of the spectrum very efficiently compared to the normal AAC codec (I’d say it takes 4-12kbps for that compared to 50-60kbps used for the lower frequencies) which allowed to cut audio bitrate significantly comparing to AAC-LC of similar quality. And since it was a good idea, other codecs used it as well but in different format (because patent infringements are fun when both parties employ hordes of lawyers).

So, let’s look at how it’s done in the codecs I can remember (if you know more feel free to comment):

  • AAC SBR (also mp3PRO but who cares)—the original that inspired other implementations. It works by splitting frame into series of 64-band slots (using complex numbers unless it’s a coarse low-power SBR that uses only real numbers), copying lower frequencies into high ones using a certain shape and adding scaled noise or tones (those two are mutually exclusive). For transmission efficiency lots of those parameters are derived from the configuration (that is transmitted once for couple frames) and essentially only envelopes used to shape coefficients and noise plus some flags are coded. You have to generate a lot of tables (like how QMF bands are grouped for four modes of operation, what gains to use on coefficients/noise/tones for each QMF band in each slot and so on). Eventually there were other variants developed (because there are other AAC codecs that could use it) but the approach remains the same;
  • E-AC-3—this codec has SPectral eXtension which divides frame into fixed sub-bands and copies data from lowed sub-bands, applies a specific scale to that data and blends it with noise scaled with another scale;
  • AC-4—this one has A-SPX that looks a lot like the original SBR (and considering that D*lby got the team behind it it’s not that surprising). I can’t be bothered to look at the finer details but from a quick glance it looks very similar (starting with all those FixVar and VarFix envelopes). If you want to know more about the implementation just ask Paul B. Mahol, it should be more fun than the usual questions about AC-4 he gets;
  • ATRAC9 (but not earlier)—this codec seems to split spectrum into four parts, fills them either with mirrored coefficients from below or with noise and applies coarser or finer scaling to those bands;
  • WMA9 (or was it WMAPro or WMA3?)—as usual it’s “we should overengineer AAC” approach. There’s not much known about how it really functions but it seems to split higher frequencies into variable-length bands and code motion vectors for each band telling from which position to copy (and since audio frames in MDCT-based codec are essentially P-frames, this is too close to being a video codec for my taste). There are three modes of operation for the bands too: copy data, fill with noise, or copy only the large coefficients and fill the rest with noise. I have an impression they tried to make it less computation-heavy than AAC SBR while having the similar or larger amount of features.

I guess you can see how these approaches are different yet alike at the same time and why it was not much fun to implement it. Yet I still don’t consider this time wasted as I gained more understanding on how it works (and why I didn’t want to touch it before). Now maybe it’s time to finally play with inline assembly in Rust.

Raw streams support in NihAV

Thursday, November 18th, 2021

Sadly there’s enough MP3s in my music collection to ignore the format and I’ve finally implemented MP3 decoding support for NihAV. That involved introducing several new concepts which I’d like to review in this post.

Previously NihAV operated on a simple approach: there’s a demuxer that produces full packets, those packets are fed to the corresponding decoder and the decoded audio/video data is used somehow. With MP3 you have a raw stream of audio packets (sometimes with an additional metadata). While I could pretend to have a demuxer (that will simply read data and form packets) I decided to do it differently.

NihAV: now with Flash support

Tuesday, November 2nd, 2021

During my work on VP6 encoder I was contacted by Ruffle developer who was interested in it, one thing led to another and I licensed my decoder for the use there (the main issues were cutting off all the interfaces from NihAV that are not needed for it and selecting the license). But it’s over and they say it’s working fine. Meanwhile I got curious and decided to finally do what no other bit of open-source code could do: encode VP6 to FLV without relying on any external software.

In addition to the FLV muxer I also implemented all known decoders as well and that was uneven load. One evening was enough to implement two and half codecs: FLV1 (it’s just H.263 with slightly different header and block format), Flash ADPCM (a slight variation of IMA ADPCM) and a bit of ASAO. Another day was spent on trying to make ASAO work properly (I dislike codecs with parametric bit allocation like this one, at least it’s not a typical speech codec). VP6 modifications took minutes, Flash Screen Video was done in less than an hour, Flash Screen Video 2 took the rest of a day (because I completely forgot how priming works there). I wasted another day on hacking barely enough support for onMetaData packet parsing and the other codec-specific bits in FLV demuxer.

And now it’s ready and more or less working. It can even play H.264+AAC combination (remember when it was popular), the only codecs it does not support are Speex (I’m not sure if I ever want to touch it) and MP3 (this one I’ll deal with eventually and FLV will provide me with nicely split MP3 packets for testing before the infrastructure for handling raw streams is ready).

Now what to do next? It would be nice to have SANM/SMUSH support, maybe get to MP3 already (so nihav-sndplay is even more usable for me) or RE all those VoxWare codecs (I hope I can find the samples). There’s some interest for bearly functioning VP7 encoder too.

But who cares about that? I can encode VP6 into FLV now (even if I have no reasons to do so).

NihAV: now with lossless audio encoder

Tuesday, October 26th, 2021

Since I wanted to try my hoof at various encoding concepts it’s no wonder that after lossy audio encoder (IMA ADPCM with trellis encoding), lossless video encoder (ZMBV, using my own deflate implementation for compressing), lossy video encoder (Cinepak, for playing with vector quantisation, and VP6, for playing with many other concepts) it was time for a lossless audio encoder.

To remind you, there are essentially two types of lossless audio compressors—fast and asymmetric (based on LPC filters) and slow and symmetric (based on adaptive filters, usually long LMS ones). The theory behind them is rather simple and described below.

VP8: dubious decisions, deficiencies and outright idiocy

Friday, October 15th, 2021

I’ve finally finished VP8 decoder for NihAV (which was done mainly by hacking already existing VP7 decoder) and I have some unpleasant words to say about VP8. If you want to read praises to the first modern open-source patent-free video codec (and essentially the second one since VP3/Theora) then go and read any piece of news from 2011. Here I present my experience implementing the format and what I found not so good or outright bad about the “specification” and the format itself.


VP6 encoding guide

Wednesday, October 6th, 2021

As I wanted to do before, I’ve written a short guide on how to encode VP6 to FLV. You can find it here, at NihAV site.

You should be able to encode raw video into VP6 in AVI or (with a slightly custom build) to VP6 in EA format (if you want to test if the encoder is good enough for modding purposes; but I guess even Peter Ross won’t care about that). As usual, it’s not guaranteed to work but it seems to work for me.

And that should be it. I might do VP7 encoder later (much later!) just for lulz but so far I can see way more interesting things to do (more formats to decode, lossless audio encoder and such).

VP6 encoder design

Saturday, October 2nd, 2021

This is the penultimate post in the series (there shall be another post, on how to use the encoder—but if there’s no interest I can simply skip it making this the last post in the series). As promised before, here I’ll present the layout and the details of my encoder.

Is VP8 a Duck codec?

Friday, October 1st, 2021

There’s a blog out there with posts dedicated to the history of On2 (née Duck). And one particular post (archived version) brought an unsettling thought that refuses to leave me. Does VP8 belong to Duck or Baidu (yes, I’ll keep calling this company by value) codecs?

Arguments for Duck theory:

  1. it was released in 2008, before acquisition (which happened in 2010);
  2. it can be seen as an improvement of VP7, which is definitely a Duck codec;
  3. its documentation is as lacking as for the previous codecs.

Arguments for Baidu theory:

  1. it became famous after the company was bought and the codec was open-sourced;
  2. as a follow-up from the previous item, there is an open-source library for decoding and encoding it (I think the previous source dump had an encoder just for TMRT and maybe it was an oversight);
  3. it has its own ecosystem (all previous codecs were stored in AVI, this one uses WebMKV);
  4. I don’t have to implement it in NihAV (because I wanted nihav_duck crate to contain decoders for all Duck formats and if VP8 is not really a Duck codec I don’t have to do anything).

So, what do you think?

VP6 — rate control and rate-distortion optimisation

Thursday, September 30th, 2021

First of all, I want to warn you that “optimisation” part of RDO comes from mathematics with its meaning being selecting an element which satisfies certain criteria the best. Normally we talk about optimisation as a way for code to run faster but the term has more general meaning and here’s one of such cases.

Anyway, while there is a lot of theory behind it, the concepts are quite simple (see this description from a RAD guy for a short concise explanation). To put it oversimplified, rate control is the part of an encoder that makes it output stream with the certain parameters (i.e. certain average bitrate, limited maximum frame size and such) and RDO is a way to adjust encoded stream by deciding how much you want to trade bits for quality in this particular case.

For example, if you want to decide which kind of macroblock you want to encode (intra or several kinds of inter) you calculate how much the coded blocks differ from the original one (that’s distortion) and add the cost of coding those blocks (aka rate) multiplied by lambda (which is our weight parameter that tells how much to prefer rate over distortion or vice versa). So you want to increase bitrate? Decrease lambda so fidelity matters more. You want to decrease frame size? Increase lambda so bits are more important. From mathematical point of view the problem is solved, from implementation point of view that’s where the actual problems start.