Archive for the ‘Audio’ Category

Looking at the (un)Original Sound Quality format

Tuesday, June 27th, 2023

I was asked to look at it and since I have nothing better to do, I finally did.

OSQ was a format used by WaveLab mastering software for storing lossless compressed audio (they seem to have dropped support for it in the more recent versions). Internally it’s a clone of ages-old Shorten which does not even have LPC mode, only fixed predictors.

Of course the format somewhat differs and there are more prediction modes but they still use maximum three previously decoded samples and filter coefficients are constant. There are other original additions such as stereo decorrelation or adaptive Rice coding with the parameter selected based on previous statistics and some floating-point calculations that look more complex than they should be. Overall though it looks like somebody took Shorten and decided not to do the hard part (LPC), maybe in order to make files to save faster, and tried to compensate it by a couple of simple tricks.

I’ll probably document it at my leisure but overall it’s a rather silly format.

NihAV: towards RealMedia encoding support

Friday, March 10th, 2023

Since I have nothing better to do with my life, I’ve decided to add a rudimentary support for encoding into RealMedia. I’ve written a somewhat working RMVB muxer (it lacks logical stream support that is used to describe the streaming capabilities, it also has some quirks in audio and video support but it seems to produce something that other players understand) and now I’ve made a Cook audio encoder as well.

Somebody who knows me knows that I fail spectacularly at writing non-trivial lossy audio encoders but luckily here we have a format which even I can’t botch much. It is based on parametric bit-allocation derived from band energies. So all it takes is to perform slightly modified MDCT, calculate band energies, convert them to scale factors, perform bit allocation based on them, pack band contents depending on band categories and adjust the categories until all bands fit the frame. All of these steps are well-defined (including the order in which bands should be adjusted) so making it all work is rather trivial. But what about determining the frame size and coupling mode?

As it turns out, RealMedia supports only certain codec profiles (or “flavors” in its terminology), so Cook has about 32 different flavours defined for different combinations of sample rates, number of channels and bitrates. And each flavour has an internally bitrate parameters (frame size and which coupling parameters to use) for each channel pair so you just pick a fitting profile and go on with it. In theory it’s possible to add a custom profile but it’s not worth the hassle IMO.

And now here are some fun facts about it. Apparently there are three internal revisions of the codec: the original version for RealMedia G2, version 2 for RealMedia 8 and multichannel encoder (introduced in RealMedia 10, when they’ve switched to AAC already). Version 2 is the one supporting joint-stereo coding. The codec is based on G.722.1 (the predecessor of CELT even if Xiph folks keep denying that) but, because Cook frames are 256-1024 samples instead of fixed 320-sample frames, they’ve introduced gains to adjust better for volume changes inside the frame (but I haven’t bothered implementing that part). That is probably the biggest principal change in the format (not counting the different frame sizes and joint stereo support in the version 2). And finally I should probably mention that there are some flavours with the same bitrate that differ by the frequency range and where the joint stereo part starts (some of those are called “high response music” whatever that means).

Time to move to the video encoder…

A quick glance at WA

Saturday, January 21st, 2023

A certain Paul B. asked me to look at WA (aka WavArc) and it turned out to be rather interesting. For starters, it’s the only lossless audio archiver and not compressor I’m aware of (the difference is an ability to store multiple files in single archive). Of course there are things like Rar or WinZip with special multimedia compression modes but they’re general-purpose archives with content-specific compression methods and not audio-only archivers.

There are two versions of the executable: 32-bit DOS one and 16-bit DOS one (where all 32-bit integer and FPU operations are emulated). The latter turned out in showcase what Ghidra lacks in supporting old DOS executables, so eventually I tried 32-bit version. Even if I had to load it manually, luckily it turned out to have PE-like header for the loader so it was no problem figuring out segment mapping. After that it was a piece of cake.

Essentially there are three compression modes: store (mode 0), Shorten-like fixed prediction and fixed Rice codes (modes 1-4) and mode 5 with LPC prediction and residue coding using either fixed Rice codes or arithmetic coder (with fixed model transmitted before residues).

Overall, it’s a curious piece of software that was interesting to look at.

Looking at a-Pac

Monday, September 19th, 2022

Since currently there’s a preparation phase for the next Ukrainian counteroffensive and I don’t know what other features to add to NihAV (beside improving video player), I was bored enough to do this.

After a comment from Paul, I’ve looked at some random lossless audio codec from 90’s, namely a-Pac. While the codec by itself it very simple (no LPC or IIR filter, no variable-length codes except in block header, no arithmetic coder either) as you can see from its description in the Wiki, it was a bit tricky to RE.

The main issues being: apac.exe is in NE format (which means 16-bit code and Ghidra sucks at figuring out which segment registers should be used when) and, which is worse, written in Delphi with all the quirks of its object model implementation. So it’s after some search I’ve found the virtual table corresponding to the APAC format handler (both coding and decoding) and after trying one function after another I’ve finally found the one responsible for encoding—after which finding a decoder function was easy.

The code itself (beside the weirdness introduced by Delphi compiler like using both positive and negative offsets in vtable calls) is very old-fashioned too: bit reading is done by keeping bitreader state as global variables, reading n-bit values by reading a single bit in a loop and refilling that bit buffer every byte (with a callback function).

It’s still fun to see how we’ve moved from simple formats like this to overcomplicated formats like LA or OptimFrog while settling down on in-between complexity FLAC, Monkey’s Audio and WavPack.

Spectral Band Replication in various audio codecs

Thursday, July 7th, 2022

While the war continues I try to distract myself with something less unpleasant, recently it was implementing SBR support for AAC decoder in NihAV. And since I’ve done that why not talk about this technology and its varieties?

The idea of saving bits in lossy codec by coding less important high frequencies in an approximate way is not very new, the first case that comes to mind was aacPlus codec from Coding Technologies that served as a base for AAC SBR (and MP3Pro if you remember such thing). It allows to code the upper part of the spectrum very efficiently compared to the normal AAC codec (I’d say it takes 4-12kbps for that compared to 50-60kbps used for the lower frequencies) which allowed to cut audio bitrate significantly comparing to AAC-LC of similar quality. And since it was a good idea, other codecs used it as well but in different format (because patent infringements are fun when both parties employ hordes of lawyers).

So, let’s look at how it’s done in the codecs I can remember (if you know more feel free to comment):

  • AAC SBR (also mp3PRO but who cares)—the original that inspired other implementations. It works by splitting frame into series of 64-band slots (using complex numbers unless it’s a coarse low-power SBR that uses only real numbers), copying lower frequencies into high ones using a certain shape and adding scaled noise or tones (those two are mutually exclusive). For transmission efficiency lots of those parameters are derived from the configuration (that is transmitted once for couple frames) and essentially only envelopes used to shape coefficients and noise plus some flags are coded. You have to generate a lot of tables (like how QMF bands are grouped for four modes of operation, what gains to use on coefficients/noise/tones for each QMF band in each slot and so on). Eventually there were other variants developed (because there are other AAC codecs that could use it) but the approach remains the same;
  • E-AC-3—this codec has SPectral eXtension which divides frame into fixed sub-bands and copies data from lowed sub-bands, applies a specific scale to that data and blends it with noise scaled with another scale;
  • AC-4—this one has A-SPX that looks a lot like the original SBR (and considering that D*lby got the team behind it it’s not that surprising). I can’t be bothered to look at the finer details but from a quick glance it looks very similar (starting with all those FixVar and VarFix envelopes). If you want to know more about the implementation just ask Paul B. Mahol, it should be more fun than the usual questions about AC-4 he gets;
  • ATRAC9 (but not earlier)—this codec seems to split spectrum into four parts, fills them either with mirrored coefficients from below or with noise and applies coarser or finer scaling to those bands;
  • WMA9 (or was it WMAPro or WMA3?)—as usual it’s “we should overengineer AAC” approach. There’s not much known about how it really functions but it seems to split higher frequencies into variable-length bands and code motion vectors for each band telling from which position to copy (and since audio frames in MDCT-based codec are essentially P-frames, this is too close to being a video codec for my taste). There are three modes of operation for the bands too: copy data, fill with noise, or copy only the large coefficients and fill the rest with noise. I have an impression they tried to make it less computation-heavy than AAC SBR while having the similar or larger amount of features.

I guess you can see how these approaches are different yet alike at the same time and why it was not much fun to implement it. Yet I still don’t consider this time wasted as I gained more understanding on how it works (and why I didn’t want to touch it before). Now maybe it’s time to finally play with inline assembly in Rust.

On Bluetooth codecs

Wednesday, December 15th, 2021

I got a strange request for LDAC decoder as it may help to “…verify (or debunk) Sony’s quality claims.” This made me want to write a post about the known BT codecs and what’s my opinion on them.

Bluetooth codecs can be divided into three categories: the standard ones defined in A2DP that nobody uses (MP3 and ATRAC), the standard ones that are widely used (AAC and SBC) and custom codecs supported by specific vendors.

So, let’s start with mandatory A2DP codecs:

  • SBC—a codec designed specifically for Bluetooth. It works like MPEG Audio Layer II but with 4 or 8 sub-bands and parametric bit allocation instead of fixed tables. This allows it to change bitrate at any frame (which allows it to adapt to changing transmission quality). I heard an opinion that it beats newer codecs at their bitrates in quality but the standard intentionally capped it to prevent that. I find that not that hard to believe;
  • MPEG-1,2 Audio—I’ve not heard that anybody actually uses them and it’s fro the best;
  • MPEG-2,4 AAC—it should give better quality than SBC but for a much larger delay and decoding complexity;
  • ATRAC family—this feels like a proprietary replacement of AAC to me. I’ve not heard that anybody actually supports any of the codecs in their products (it’s not that I’ve heard much about BT in general though).

Here I should also mention a candidate codec named LC3 (and LC3plus). Whatever audio codec FhG IIS develops, it’ll be AAC. LC3 is no exception as by the first glance it looks like like AAC LC with an arithmetic coding and some additional coding tools glued to it.

There’s CVSD codec for speech transmission over BT. It’s a speech codec and that’s enough about it.

Now let’s move to the proprietary codecs:

  • aptX—a rather simple codec with 4:1 compression ration (four 16-bit samples into single 16-bit word). The codec works by splitting audio into four sub-bands, applying ADPCM and quantising to the fixed amount. Beside inability to adapt to bad channels it should produce about the same quality as SBC (at least from a feature comparison point of view);
  • aptX HD—the same as non-HD version but works on 24-bit samples (and probably the only honest high-res/high-definition codec here);
  • aptX other variants—they exist but there’s no solid information about them;
  • LDAC—will be discussed below in more detail. For now suffice to say it’s on MP2 level and hi-res claims are just marketing;
  • LHDC and LLAC—not much is known about the codecs but after seeing quality comparison picture (with a note) on the official website I don’t expect anything good;
  • Ultra Audio Transmission—there’s no information about it except for a name mentioned in Wikipedia list of BT codecs and some marketing materials on the page with smartphone description by the same vendor;
  • Samsung BT codecs—see above.

Now let’s review LDAC specifically. I’m somewhat surprised nobody has written a decoder for it yet. It’s so easy to reconstruct the format from the open-source encoder that Paul B. Mahol could do it in a couple of days (before returning to Bink2 decoder hopefully). aptX has only binary encoder and yet people have managed to RE it. I’m not going to do it because I don’t care much about Bluetooth codecs in general and it’s not a good fit for NihAV either.

To the technical details. The codec frame is either one long MDCT or two interlaced half-size MDCTs (just like ATSC A/52B), coefficients are coded as pairs, quads or larger single values (which reminds me of MP3 and MP2, quantisation is very similar as well). Coefficients (in pairs and quads as well) are stored in bit fields, the only variable-length codebooks are used to code quantiser differences. There’s bit allocation information transmitted for each frame so different coefficients can have different bit sizes (and thus precision). Nevertheless the maximum it can have is just 15 bits per coefficient (plus sign), which makes it hardly any hi-resier than AAC LC or SBC. And the only excuse that can be said here is the one I heard about MP3 being hi-res: with the large scales and coefficients you can have almost infinite precision. Disproving it is left as an exercise to the reader.

I hope now it’s clear why I don’t look at the development of Bluetooth codecs much. Back to slacking.

Looking at Voxware MetaVoice

Monday, December 13th, 2021

Since there’s not much I’d like to do with NihAV, I decided to revisit one. old family of codecs.

It seems that they had several families of codecs and most (all?) of them are licensed from some other company, sometimes with some changes (there are four codecs licensed from Lernout & Hauspie, MetaSound is essentially TwinVQ with a different set of codebooks, RT2x and VR1x are essentially different flavours of the same codec, SC3 and SC6 might be related to Micronas codec though Micronas SC4 decoder does not look similar at all).

So here’s a short review of those various codecs that I have some information about:

  • L&H CELP 4.8kpbs—this is rather standard CELP codec with no remarkable features (and I’ve even managed to write a working decoder for it);
  • L&H SBC 8/12/16kbps—that one is a sub-band coder with variable frame size (and amount of bits allocated per band);
  • RT24/RT28/RT29HQ and VR12/VR18—all these codecs share the common core and essentially it’s a variable-bitrate LPC-based speech codec with four different frame modes with no information transmitted beside frame mode, pitch information and the filter coefficients (for CELP you’d also have pulse information).
  • SC3/SC6—this one seems to be more advanced and, by the look of it, it uses order 12 LPC filter (usually speech codecs use either LPC of order 10 or 16).

I’ll try to document it for The Wiki but don’t expect much. And I’m not going to implement decoders for these formats either (beside already implemented 4.8k CELP one): the codecs have variable bitrate so you need to decode a frame (at least partially) in order to tell how many bytes it will take—and I don’t want to introduce a hack in NihAV to support such mode (either the demuxer should serve variable-length frames or the decoder should expect fixed-size frames); and even worse thing is that they are speech codecs that I don’t understand well (and there’s a lot of obscure code there). It took me more than a week to implement and debug CELP decoder. Fun story: I could not use MPlayer2 binary loader because the codec was misdetected as MPEG Audio Layer II. The cause of that was libavformat and its “helpful” tag search: when twocc 0x0070 was not found, it tried upper-case 0x0050 which belongs to MP2. And after I’ve finally made it work I discovered a fun bug in the reference decoder: while calculating cosine, the difference can overflow and thus the resulting value is somewhat wrong (and it could be easily fixed by changing “less or equal” condition to “less” in table search refinement step).

Anyway, it’s done and now I can forget about it.

NihAV: now with lossless audio encoder

Tuesday, October 26th, 2021

Since I wanted to try my hoof at various encoding concepts it’s no wonder that after lossy audio encoder (IMA ADPCM with trellis encoding), lossless video encoder (ZMBV, using my own deflate implementation for compressing), lossy video encoder (Cinepak, for playing with vector quantisation, and VP6, for playing with many other concepts) it was time for a lossless audio encoder.

To remind you, there are essentially two types of lossless audio compressors—fast and asymmetric (based on LPC filters) and slow and symmetric (based on adaptive filters, usually long LMS ones). The theory behind them is rather simple and described below.
(more…)

Playing with trellis and encoding

Sunday, August 8th, 2021

I said before I want to play with encoding algorithms within NihAV and here’s another step (a previous major step was vector quantisation and a simple Cinepak encoder using it). Now it’s time for trellis search for encoding.

The idea by itself is very simple: you want to encode a sequence optimally (or decode transmitted sequence with distortions), so you represent your data as a set of possible states for each sample and search a path from one state to another with the minimum error. Since each state of the sample is connected with all states of the previous samples, its graph looks like a trellis:

The search itself is performed by selecting for each state a transition from a previous state that gives minimal error, then selecting a state with the least error for the last sample and tracing back the path that lead to it from the beginning. You just need to store the pointer to the previous state, error value and whatever decoder state you require.

I’ve chosen IMA ADPCM encoder as a test playground since it’s simple but useful. The idea of the format is very simple: you have a state consisting of current sample value and step size used as a multiplier for the transmitted 4-bit difference value; you reconstruct the difference, add it to the previous stored value, and correct step size (small delta—decrease step, large delta—increase step). You have 16 possible states for each sample which makes the search take not so long time.

There’s another tricky question of selecting initial step size (it will adapt to the samples but you need to start with something). I select it to be close to the difference between first and second samples and actually abuse first state to store not the index of the previous state but rather a step index. This way I start with (ideally) 16 different step sizes around the current one and can use the one that gives slightly lower error in the end.

And another fun fact: this way I can use just the code for decompression of single ADPCM sample and I don’t require actual code for compression—it traverses through all possible compressed codes already.

I hope this demonstrates that it’s an easy method that improves quality significantly (I have not conducted proper testing but a from a quick test it reduced mean squared error for me by 10-20%).

It should also come in handy for video compression but unfortunately rate distortion optimisation does not look that easy…

Why codecs are designed like this and why they are not very interchangeable

Monday, August 2nd, 2021

Sometimes I have to explain the role of various codecs and why it’s pointless in most cases to adapt compression tricks from image codecs to audio codecs (and vice versa) and even from lossy to lossless codecs in the same content. If you understand that already then you’ll find no new information here.

Yours truly
Captain Obvious
(more…)