Archive for the ‘Audio’ Category

Looking at a-Pac

Monday, September 19th, 2022

Since currently there’s a preparation phase for the next Ukrainian counteroffensive and I don’t know what other features to add to NihAV (beside improving video player), I was bored enough to do this.

After a comment from Paul, I’ve looked at some random lossless audio codec from 90’s, namely a-Pac. While the codec by itself it very simple (no LPC or IIR filter, no variable-length codes except in block header, no arithmetic coder either) as you can see from its description in the Wiki, it was a bit tricky to RE.

The main issues being: apac.exe is in NE format (which means 16-bit code and Ghidra sucks at figuring out which segment registers should be used when) and, which is worse, written in Delphi with all the quirks of its object model implementation. So it’s after some search I’ve found the virtual table corresponding to the APAC format handler (both coding and decoding) and after trying one function after another I’ve finally found the one responsible for encoding—after which finding a decoder function was easy.

The code itself (beside the weirdness introduced by Delphi compiler like using both positive and negative offsets in vtable calls) is very old-fashioned too: bit reading is done by keeping bitreader state as global variables, reading n-bit values by reading a single bit in a loop and refilling that bit buffer every byte (with a callback function).

It’s still fun to see how we’ve moved from simple formats like this to overcomplicated formats like LA or OptimFrog while settling down on in-between complexity FLAC, Monkey’s Audio and WavPack.

Spectral Band Replication in various audio codecs

Thursday, July 7th, 2022

While the war continues I try to distract myself with something less unpleasant, recently it was implementing SBR support for AAC decoder in NihAV. And since I’ve done that why not talk about this technology and its varieties?

The idea of saving bits in lossy codec by coding less important high frequencies in an approximate way is not very new, the first case that comes to mind was aacPlus codec from Coding Technologies that served as a base for AAC SBR (and MP3Pro if you remember such thing). It allows to code the upper part of the spectrum very efficiently compared to the normal AAC codec (I’d say it takes 4-12kbps for that compared to 50-60kbps used for the lower frequencies) which allowed to cut audio bitrate significantly comparing to AAC-LC of similar quality. And since it was a good idea, other codecs used it as well but in different format (because patent infringements are fun when both parties employ hordes of lawyers).

So, let’s look at how it’s done in the codecs I can remember (if you know more feel free to comment):

  • AAC SBR (also mp3PRO but who cares)—the original that inspired other implementations. It works by splitting frame into series of 64-band slots (using complex numbers unless it’s a coarse low-power SBR that uses only real numbers), copying lower frequencies into high ones using a certain shape and adding scaled noise or tones (those two are mutually exclusive). For transmission efficiency lots of those parameters are derived from the configuration (that is transmitted once for couple frames) and essentially only envelopes used to shape coefficients and noise plus some flags are coded. You have to generate a lot of tables (like how QMF bands are grouped for four modes of operation, what gains to use on coefficients/noise/tones for each QMF band in each slot and so on). Eventually there were other variants developed (because there are other AAC codecs that could use it) but the approach remains the same;
  • E-AC-3—this codec has SPectral eXtension which divides frame into fixed sub-bands and copies data from lowed sub-bands, applies a specific scale to that data and blends it with noise scaled with another scale;
  • AC-4—this one has A-SPX that looks a lot like the original SBR (and considering that D*lby got the team behind it it’s not that surprising). I can’t be bothered to look at the finer details but from a quick glance it looks very similar (starting with all those FixVar and VarFix envelopes). If you want to know more about the implementation just ask Paul B. Mahol, it should be more fun than the usual questions about AC-4 he gets;
  • ATRAC9 (but not earlier)—this codec seems to split spectrum into four parts, fills them either with mirrored coefficients from below or with noise and applies coarser or finer scaling to those bands;
  • WMA9 (or was it WMAPro or WMA3?)—as usual it’s “we should overengineer AAC” approach. There’s not much known about how it really functions but it seems to split higher frequencies into variable-length bands and code motion vectors for each band telling from which position to copy (and since audio frames in MDCT-based codec are essentially P-frames, this is too close to being a video codec for my taste). There are three modes of operation for the bands too: copy data, fill with noise, or copy only the large coefficients and fill the rest with noise. I have an impression they tried to make it less computation-heavy than AAC SBR while having the similar or larger amount of features.

I guess you can see how these approaches are different yet alike at the same time and why it was not much fun to implement it. Yet I still don’t consider this time wasted as I gained more understanding on how it works (and why I didn’t want to touch it before). Now maybe it’s time to finally play with inline assembly in Rust.

On Bluetooth codecs

Wednesday, December 15th, 2021

I got a strange request for LDAC decoder as it may help to “…verify (or debunk) Sony’s quality claims.” This made me want to write a post about the known BT codecs and what’s my opinion on them.

Bluetooth codecs can be divided into three categories: the standard ones defined in A2DP that nobody uses (MP3 and ATRAC), the standard ones that are widely used (AAC and SBC) and custom codecs supported by specific vendors.

So, let’s start with mandatory A2DP codecs:

  • SBC—a codec designed specifically for Bluetooth. It works like MPEG Audio Layer II but with 4 or 8 sub-bands and parametric bit allocation instead of fixed tables. This allows it to change bitrate at any frame (which allows it to adapt to changing transmission quality). I heard an opinion that it beats newer codecs at their bitrates in quality but the standard intentionally capped it to prevent that. I find that not that hard to believe;
  • MPEG-1,2 Audio—I’ve not heard that anybody actually uses them and it’s fro the best;
  • MPEG-2,4 AAC—it should give better quality than SBC but for a much larger delay and decoding complexity;
  • ATRAC family—this feels like a proprietary replacement of AAC to me. I’ve not heard that anybody actually supports any of the codecs in their products (it’s not that I’ve heard much about BT in general though).

Here I should also mention a candidate codec named LC3 (and LC3plus). Whatever audio codec FhG IIS develops, it’ll be AAC. LC3 is no exception as by the first glance it looks like like AAC LC with an arithmetic coding and some additional coding tools glued to it.

There’s CVSD codec for speech transmission over BT. It’s a speech codec and that’s enough about it.

Now let’s move to the proprietary codecs:

  • aptX—a rather simple codec with 4:1 compression ration (four 16-bit samples into single 16-bit word). The codec works by splitting audio into four sub-bands, applying ADPCM and quantising to the fixed amount. Beside inability to adapt to bad channels it should produce about the same quality as SBC (at least from a feature comparison point of view);
  • aptX HD—the same as non-HD version but works on 24-bit samples (and probably the only honest high-res/high-definition codec here);
  • aptX other variants—they exist but there’s no solid information about them;
  • LDAC—will be discussed below in more detail. For now suffice to say it’s on MP2 level and hi-res claims are just marketing;
  • LHDC and LLAC—not much is known about the codecs but after seeing quality comparison picture (with a note) on the official website I don’t expect anything good;
  • Ultra Audio Transmission—there’s no information about it except for a name mentioned in Wikipedia list of BT codecs and some marketing materials on the page with smartphone description by the same vendor;
  • Samsung BT codecs—see above.

Now let’s review LDAC specifically. I’m somewhat surprised nobody has written a decoder for it yet. It’s so easy to reconstruct the format from the open-source encoder that Paul B. Mahol could do it in a couple of days (before returning to Bink2 decoder hopefully). aptX has only binary encoder and yet people have managed to RE it. I’m not going to do it because I don’t care much about Bluetooth codecs in general and it’s not a good fit for NihAV either.

To the technical details. The codec frame is either one long MDCT or two interlaced half-size MDCTs (just like ATSC A/52B), coefficients are coded as pairs, quads or larger single values (which reminds me of MP3 and MP2, quantisation is very similar as well). Coefficients (in pairs and quads as well) are stored in bit fields, the only variable-length codebooks are used to code quantiser differences. There’s bit allocation information transmitted for each frame so different coefficients can have different bit sizes (and thus precision). Nevertheless the maximum it can have is just 15 bits per coefficient (plus sign), which makes it hardly any hi-resier than AAC LC or SBC. And the only excuse that can be said here is the one I heard about MP3 being hi-res: with the large scales and coefficients you can have almost infinite precision. Disproving it is left as an exercise to the reader.

I hope now it’s clear why I don’t look at the development of Bluetooth codecs much. Back to slacking.

Looking at Voxware MetaVoice

Monday, December 13th, 2021

Since there’s not much I’d like to do with NihAV, I decided to revisit one. old family of codecs.

It seems that they had several families of codecs and most (all?) of them are licensed from some other company, sometimes with some changes (there are four codecs licensed from Lernout & Hauspie, MetaSound is essentially TwinVQ with a different set of codebooks, RT2x and VR1x are essentially different flavours of the same codec, SC3 and SC6 might be related to Micronas codec though Micronas SC4 decoder does not look similar at all).

So here’s a short review of those various codecs that I have some information about:

  • L&H CELP 4.8kpbs—this is rather standard CELP codec with no remarkable features (and I’ve even managed to write a working decoder for it);
  • L&H SBC 8/12/16kbps—that one is a sub-band coder with variable frame size (and amount of bits allocated per band);
  • RT24/RT28/RT29HQ and VR12/VR18—all these codecs share the common core and essentially it’s a variable-bitrate LPC-based speech codec with four different frame modes with no information transmitted beside frame mode, pitch information and the filter coefficients (for CELP you’d also have pulse information).
  • SC3/SC6—this one seems to be more advanced and, by the look of it, it uses order 12 LPC filter (usually speech codecs use either LPC of order 10 or 16).

I’ll try to document it for The Wiki but don’t expect much. And I’m not going to implement decoders for these formats either (beside already implemented 4.8k CELP one): the codecs have variable bitrate so you need to decode a frame (at least partially) in order to tell how many bytes it will take—and I don’t want to introduce a hack in NihAV to support such mode (either the demuxer should serve variable-length frames or the decoder should expect fixed-size frames); and even worse thing is that they are speech codecs that I don’t understand well (and there’s a lot of obscure code there). It took me more than a week to implement and debug CELP decoder. Fun story: I could not use MPlayer2 binary loader because the codec was misdetected as MPEG Audio Layer II. The cause of that was libavformat and its “helpful” tag search: when twocc 0x0070 was not found, it tried upper-case 0x0050 which belongs to MP2. And after I’ve finally made it work I discovered a fun bug in the reference decoder: while calculating cosine, the difference can overflow and thus the resulting value is somewhat wrong (and it could be easily fixed by changing “less or equal” condition to “less” in table search refinement step).

Anyway, it’s done and now I can forget about it.

NihAV: now with lossless audio encoder

Tuesday, October 26th, 2021

Since I wanted to try my hoof at various encoding concepts it’s no wonder that after lossy audio encoder (IMA ADPCM with trellis encoding), lossless video encoder (ZMBV, using my own deflate implementation for compressing), lossy video encoder (Cinepak, for playing with vector quantisation, and VP6, for playing with many other concepts) it was time for a lossless audio encoder.

To remind you, there are essentially two types of lossless audio compressors—fast and asymmetric (based on LPC filters) and slow and symmetric (based on adaptive filters, usually long LMS ones). The theory behind them is rather simple and described below.
(more…)

Playing with trellis and encoding

Sunday, August 8th, 2021

I said before I want to play with encoding algorithms within NihAV and here’s another step (a previous major step was vector quantisation and a simple Cinepak encoder using it). Now it’s time for trellis search for encoding.

The idea by itself is very simple: you want to encode a sequence optimally (or decode transmitted sequence with distortions), so you represent your data as a set of possible states for each sample and search a path from one state to another with the minimum error. Since each state of the sample is connected with all states of the previous samples, its graph looks like a trellis:

The search itself is performed by selecting for each state a transition from a previous state that gives minimal error, then selecting a state with the least error for the last sample and tracing back the path that lead to it from the beginning. You just need to store the pointer to the previous state, error value and whatever decoder state you require.

I’ve chosen IMA ADPCM encoder as a test playground since it’s simple but useful. The idea of the format is very simple: you have a state consisting of current sample value and step size used as a multiplier for the transmitted 4-bit difference value; you reconstruct the difference, add it to the previous stored value, and correct step size (small delta—decrease step, large delta—increase step). You have 16 possible states for each sample which makes the search take not so long time.

There’s another tricky question of selecting initial step size (it will adapt to the samples but you need to start with something). I select it to be close to the difference between first and second samples and actually abuse first state to store not the index of the previous state but rather a step index. This way I start with (ideally) 16 different step sizes around the current one and can use the one that gives slightly lower error in the end.

And another fun fact: this way I can use just the code for decompression of single ADPCM sample and I don’t require actual code for compression—it traverses through all possible compressed codes already.

I hope this demonstrates that it’s an easy method that improves quality significantly (I have not conducted proper testing but a from a quick test it reduced mean squared error for me by 10-20%).

It should also come in handy for video compression but unfortunately rate distortion optimisation does not look that easy…

Why codecs are designed like this and why they are not very interchangeable

Monday, August 2nd, 2021

Sometimes I have to explain the role of various codecs and why it’s pointless in most cases to adapt compression tricks from image codecs to audio codecs (and vice versa) and even from lossy to lossless codecs in the same content. If you understand that already then you’ll find no new information here.

Yours truly
Captain Obvious
(more…)

A quick glance at Bink Audio

Tuesday, February 2nd, 2021

Since my attention was drawn to this format (and binary specification was provided as well) I’ve briefly looked at it—and a brief look should be enough.

From what I see it’s the same Bink Audio but in its own container instead of Bink. It has 24-byte header, a table of 16-bit audio block sizes and actual audio data (each frame may be prefixed with 0x99 0x99 but I’m not sure since I’ve not seen a single file in that format).

Frame header:

  • 1FCB magic;
  • one byte of version (version 2 groups audio frames together, previous one does not);
  • one byte with number of blocks per frame;
  • two-byte sampling rate;
  • four-byte variable, probably frame length in samples;
  • four-byte unknown variable, maybe suggested input buffer size?
  • four-byte unknown variable
  • four-byte variable, number of frames in seek table.

So as expected it’s nothing special.

Looking at XVD

Saturday, January 30th, 2021

A week ago a certain XviD developer made a request to look at something more compressed called XVD and so I did.
(more…)

A look on weird audio codec

Thursday, January 7th, 2021

Since I still have nothing better to do I decided to look at ALF2CD audio codec. And it turned out to be weird.

The codec is remarkable since while it seems to be simple transform+coefficient coding it does that in its own unique way: transform is some kind of integer FFT approximation and coefficient coding is done with CABAC-like approach. Let’s review all details for the decoder as much as I understood them (so not much).

Framing. Audio is split into sub-frames for middle and side channels with 4096 samples per sub-frame. Sub-frame sizes are fixed for each bitrate: for 512kbps it’s 2972 bytes each, for 384kbps it’s 2230 bytes each, for 320kbps it’s 2230/1486 bytes, for 256kbps it’s 1858/1114 bytes. Each sub-frame has the following data coded in it: first and last 16 raw samples, DC value, transform coefficients.

Coding. All values except for transform coefficients are coded in this sequence: non-zero flag, sign, absolute value coded using Elias gamma code. Transform coefficient are coded in bit-slicing mode: you transmit the lengths of region that may have 0x100000 set in their values plus bit flags to tell which entries in that actually have it set, then the additional length of region that may have 0x80000 set etc etc. The rationale is that larger coefficients come first so only first N coefficients may be that large, then N+M coefficients may have another bit set down below to bit 0. Plus this way you can have coarse or fine approximation of the coefficients to fit the fixed frame size without special tricks to change the size.

Speaking of the coder itself, it is context-adaptive binary range coder but not exactly CABAC you see in ITU H.26x codecs. It has some changes, especially in the model which is actually a combination of several smaller models in the same space and in the beginning of each sub-model you have to flip MPS value and maybe transition to some other sub-model. I.e. a single model is a collection of fixed probabilities of one/zero appearing and depending on what bit we decoded we move to another probability that more suits it (more zeroes to expect or more ones to expect). In H.26x there’s a single model for it, in ALF2CD there are several such models so when you hit the edge state aka “expect all ones or all zeroes” you don’t simply remain in the state but may transition to another sub-model with a different probabilities for expected ones-zeroes. A nice trick I’d say.

Coder also maintains around 30 bit states: state 0 is for coding non-zero flags, state 1 is for coding value sign, states 2-25 are for coding value exponent and state 26 is for coding value mantissa (or it’s states 2-17 for exponent and state 18 for mantissa bits when we code lengths of transform coefficient regions).

Reconstruction. This is done by performing inverse integer transform (which looks like FFT approximation but I’ve not looked at it that close), replacing first and last 16 coefficients with previously decoded ones (probably to deal with effects of windowing or imperfect reconstruction), and finally undoing mid/stereo for both sub-frames.

Overall it’s an interesting codec since you don’t often see arithmetic coding employed in lossy audio codecs unless they’re very recent ones of BSAC. And even then I can’t remember any audio codec using binary arithmetic coder instead of multi-symbol models. Who knows, maybe this approach will be used once again as something new. Most of those new ideas in various codecs have been implemented before after all (e.g. spatial prediction in H.264 is just a simplified version of spatial prediction in WMV2 X8-frames and quadtrees were used quite often in the 90s before reappearing in H.265; the same way Opus is not so modern if you know about ITU G.722.1 and heard that WMA Voice could have WMA Pro-coded frames in its stream).