While the war continues I try to distract myself with something less unpleasant, recently it was implementing SBR support for AAC decoder in NihAV
. And since I’ve done that why not talk about this technology and its varieties?
The idea of saving bits in lossy codec by coding less important high frequencies in an approximate way is not very new, the first case that comes to mind was aacPlus
codec from Coding Technologies that served as a base for AAC SBR (and MP3Pro if you remember such thing). It allows to code the upper part of the spectrum very efficiently compared to the normal AAC codec (I’d say it takes 4-12kbps for that compared to 50-60kbps used for the lower frequencies) which allowed to cut audio bitrate significantly comparing to AAC-LC of similar quality. And since it was a good idea, other codecs used it as well but in different format (because patent infringements are fun when both parties employ hordes of lawyers).
So, let’s look at how it’s done in the codecs I can remember (if you know more feel free to comment):
- AAC SBR (also mp3PRO but who cares)—the original that inspired other implementations. It works by splitting frame into series of 64-band slots (using complex numbers unless it’s a coarse low-power SBR that uses only real numbers), copying lower frequencies into high ones using a certain shape and adding scaled noise or tones (those two are mutually exclusive). For transmission efficiency lots of those parameters are derived from the configuration (that is transmitted once for couple frames) and essentially only envelopes used to shape coefficients and noise plus some flags are coded. You have to generate a lot of tables (like how QMF bands are grouped for four modes of operation, what gains to use on coefficients/noise/tones for each QMF band in each slot and so on). Eventually there were other variants developed (because there are other AAC codecs that could use it) but the approach remains the same;
- E-AC-3—this codec has SPectral eXtension which divides frame into fixed sub-bands and copies data from lowed sub-bands, applies a specific scale to that data and blends it with noise scaled with another scale;
- AC-4—this one has A-SPX that looks a lot like the original SBR (and considering that D*lby got the team behind it it’s not that surprising). I can’t be bothered to look at the finer details but from a quick glance it looks very similar (starting with all those FixVar and VarFix envelopes). If you want to know more about the implementation just ask Paul B. Mahol, it should be more fun than the usual questions about AC-4 he gets;
- ATRAC9 (but not earlier)—this codec seems to split spectrum into four parts, fills them either with mirrored coefficients from below or with noise and applies coarser or finer scaling to those bands;
- WMA9 (or was it WMAPro or WMA3?)—as usual it’s “we should overengineer AAC” approach. There’s not much known about how it really functions but it seems to split higher frequencies into variable-length bands and code motion vectors for each band telling from which position to copy (and since audio frames in MDCT-based codec are essentially P-frames, this is too close to being a video codec for my taste). There are three modes of operation for the bands too: copy data, fill with noise, or copy only the large coefficients and fill the rest with noise. I have an impression they tried to make it less computation-heavy than AAC SBR while having the similar or larger amount of features.
I guess you can see how these approaches are different yet alike at the same time and why it was not much fun to implement it. Yet I still don’t consider this time wasted as I gained more understanding on how it works (and why I didn’t want to touch it before). Now maybe it’s time to finally play with inline assembly in Rust.