RealAudio Cook aka RealOpus « Kostya's Boring Codec World

RealAudio Cook aka RealOpus

Let’s start with a bit of history since knowing how things developed often helps to understand how they ended up like they are.

There is an organisation previously called CCITT (phone line modem owners might remember it) later renamed to ITU-T. It is known for standardisation and accepting various standards under the same confusing name e.g. PCM and A-law/mu-law quantisation are G.711 recommendation from 1972 while G.711.0 is lossless audio compression scheme from 2009 and G.711.1 is a weird extension from 2008 that splits audio into two bands, compresses low band with A- or mu-law and uses MDCT and vector quantisation on top band.

And there is also a “family” of G.722 speech codecs: basic G.722 that employs splitting audio into subbands and applying ADPCM on them; G.722.1 is a completely different parametric bit allocation, VQ and MDCT codec we discuss later; G.722.2 is a traditional speech codec better known as AMR-WB.

So, what’s the deal with G.722.1? It comes from PictureTel family of Siren codecs (which later served as a base for G.719 too). Also as I mentioned before this codec employs MDCT, vector quantisation and parametric bit allocation. So you decode envelope defined by quantisers, allocate bits to bands depending on those (no, it’s not 1:1 mapping), unpack bands that are coded using vector quantisation dependent on amount of bits and perform MDCT on them. You might be not familiar but this is exactly how certain RealAudio codec works. And I don’t think you can guess its name even if I mention that it was written by Ken Cooke. But you cannot say nothing was changed: RealAudio codec works with different frame sizes (from 32 to 1024 IIRC), it has different codebooks, it has joint stereo mode and finally it has multichannel coding mode based on pairs. In other words, it has evolved from niche speech codec to general purpose audio codec rivalling AAC and it was indeed a codec of choice for RealMedia before they have finally switched to AAC and HE-AAC years later (which was the first time for them using open standard verbatim instead of licensing a proprietary technology or adding their own touches on standards drafts as before—even DNET had special low-bitrate mode).

Now let’s jump to 2012 and VideoLAN Dev Days ’12. I gave a talk there about reverse engineering codecs (of course) and it was a complete failure so that was my first and last public talk but that’s not important. And before me Timothy Terriberry gave an overview of Opus. So I listen how it combines speech and general audio codec (like USAC which you might still not know under its commercial name xHE-AAC)—boring, how speech codec works (it’s Skype SILK codec they dumped to open source at some point and like with Duck TrueMotion VP3 before, Xiph has picked it up and adopted for own purposes)—looks like typical speech codec that I can barely understand how it functions, and then CELT part comes up. CELT is general audio codec developed by Xiph that is essentially what your Opus files will end as (SILK is used only at extremely low bitrates in files produced by the reference encoder—or so I heard from the person implementing a decoder for it). And a couple of months before VDD12 I actually bothered to enter technical details about Cook into MultimediaWiki (here’s edit history if you want to check that)—I should probably RE some codec and write more pages there for the old times’ sake. So Cook design details were still fresh in my mind when I heard about CELT details…

So CELT codes just single channels or stereo pairs—nothing unusual so far, many codecs do that. It also uses MDCT—even more codecs do that. It codes envelope, uses parametric bit allocation and vector quantisation—wait a bit, I definitely heard about this somewhere before (yes, it sounds suspiciously like ITU G.719). Actually I pointed out that to Xiph guys (there was Monty present as well) immediately but it was dismissed as being nothing similar at all (“we transmit band energies instead of relying on quantisers”—right, and quantisers in audio are rarely chosen depending on energy).

Let’s compare the coding stages of two codecs to see how they fail to match up:

CELT transmits band energy—Cook transmits quantisers (that are still highly correlated with band energy) and variable amount of gains to shape output frame in time domain;
CELT transmits innovation (essentially coefficients for MDCT minus some predicted stuff)—Cook transmits MDCT coefficients;
CELT uses transmitted band energy and bits available for innovation after the rest of frame is coded to determine number of bits for each band and mode in which coefficients are coded (aka parametric bit allocation)—Cook uses transmitted quantisers and bits available after the rest of frame is coded to determine number of bits for each band and mode in which coefficients are coded;
CELT uses Perceptual Vector Quantization (based on Pyramid Vector Quantizer—boy, the won’t cause any confusion at all)—Cook uses fixed vector quantisation based on amount of bits allocated to band and static codebook;
CELT estimates pitch gains and pitch period—that is a speech codec stuff that Cook does not have;
CELT uses MDCT to restore the data—Cook does the same.

Some of you might say: “Hah! Even if it matches at some stages actual coefficient coding is completely different!! And you forgot that CELT uses range coder too.” Well, I didn’t say those two formats were exactly the same, just that their design is very similar. To quote the immortal words from Bell, Cleary and Witten paper on text compression, the progress in data compression is mostly defined by larger amounts of RAM available (and CPU cycles available). So back in the day hardly any audio codec could afford range coder (invented in 1979) except for some slow lossless audio coders. Similarly PVQ was proposed by Thomas Fischer in 1986 but wasn’t employed because it was significantly costlier than some fixed codebook vector quantisation. So while CELT is undeniably more advanced than Cook, the main gains are from using methods that do the same thing more effectively (at expense of RAM and/or CPU) instead of coming up with significantly different scheme. An obligatory car analogy: claiming that modern internal combustion engine car is completely new invention compared to Ford Model T or FIAT 124 because they have more ~~bells and whistles~~electronics even while principal scheme remains the same—while radically new car would be an electric one with no transmission or gearbox and engines in each wheel (let’s forget such scheme is very old too—electric cars of such design roamed Moon in 1970s).

So overall, Opus is almost synonymous with CELT and CELT has a lot of common in design with Cook (but greatly improved) so this allows Cook to be called RealOpus or Opus of its era.

BTW when implementing the decoder for this format in Rust I’ve encountered a problem: the table for 6-bit stereo coupling was never tested because its definition is wrong (some code definitions repeating with the same bit lengths) and looks like the first half of it got corrupted. Just compare for yourselves.

libavcodec version (lengths array added for the reference):

static const uint16_t ccpl_huffcodes6[63] = {
    0x0004,0x0005,0x0005,0x0006,0x0006,0x0007,0x0007,0x0007,0x0007,0x0008,0x0008,0x0008,
    0x0008,0x0009,0x0009,0x0009,0x0009,0x000a,0x000a,0x000a,0x000a,0x000a,0x000b,0x000b,
    0x000b,0x000b,0x000c,0x000d,0x000e,0x000e,0x0010,0x0000,0x000a,0x0018,0x0019,0x0036,
    0x0037,0x0074,0x0075,0x0076,0x0077,0x00f4,0x00f5,0x00f6,0x00f7,0x01f5,0x01f6,0x01f7,
    0x01f8,0x03f6,0x03f7,0x03f8,0x03f9,0x03fa,0x07fa,0x07fb,0x07fc,0x07fd,0x0ffd,0x1ffd,
    0x3ffd,0x3ffe,0xffff,
};

static const uint8_t ccpl_huffbits6[63] = {
    16,15,14,13,12,11,11,11,11,10,10,10,
    10,9,9,9,9,9,8,8,8,8,7,7,
    7,7,6,6,5,5,3,1,4,5,5,6,
    6,7,7,7,7,8,8,8,8,9,9,9,
    9,10,10,10,10,10,11,11,11,11,12,13,
    14,14,16,
};

NihAV corrected version (extracted from the reference of course):

const COOK_CPL_6BITS_CODES: &[u16; 63] = &[
    0xFFFE, 0x7FFE, 0x3FFC, 0x1FFC, 0x0FFC, 0x07F6, 0x07F7, 0x07F8,
    0x07F9, 0x03F2, 0x03F3, 0x03F4, 0x03F5, 0x01F0, 0x01F1, 0x01F2,
    0x01F3, 0x01F4, 0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x0070, 0x0071,
    0x0072, 0x0073, 0x0034, 0x0035, 0x0016, 0x0017, 0x0004, 0x0000,
    0x000A, 0x0018, 0x0019, 0x0036, 0x0037, 0x0074, 0x0075, 0x0076,
    0x0077, 0x00F4, 0x00F5, 0x00F6, 0x00F7, 0x01F5, 0x01F6, 0x01F7,
    0x01F8, 0x03F6, 0x03F7, 0x03F8, 0x03F9, 0x03FA, 0x07FA, 0x07FB,
    0x07FC, 0x07FD, 0x0FFD, 0x1FFD, 0x3FFD, 0x3FFE, 0xFFFF
];

This entry was posted on Sunday, October 14th, 2018 at 3:16 pm and is filed under Audio. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

8 Responses to “RealAudio Cook aka RealOpus”

Paul says:

October 14, 2018 at 3:31 pm

So you have sample(s) of cook that do not decode correctly with lavc?
Luca Barbato says:

October 14, 2018 at 3:33 pm

Nice comparison (and thanks for the table ^^)
Kostya says:

October 15, 2018 at 1:03 am

@Paul No, it’s just due to different design libavcodec uses those sets only when it’s needed and NihAV being written in Rust prefers to initialise all codebooks at once. Hence my surprise that the decoder failing at the init stage because of wrong codes definition.
Michael says:

October 16, 2018 at 5:01 pm

>So while CELT is undeniably more advanced than Cook, the main gains are from using methods that do the same thing more effectively (at expense of RAM and/or CPU) instead of coming up with significantly different scheme.

I spent some time optimizing Cook for Rockbox and then some more for Opus, and while Opus is definitely more complex than Cook, it is not an especially complex codec.

Some comparisons of CPU frequency needed for real time decode on ARM9E:

cook_stereo_96.ra: 24.48 MHz
opus_128k.opus: 36.26MHz
lame_096.mp3: 19.75MHz
vorbis_096.ogg: 20.04MHz

Even these numbers are misleading though, since Opus runs its MDCT in pure c, while everything else runs highly optimized ASM transforms (only written for power of 2 transform sizes). I think if I ever had time to finish the Opus non-power-of 2 transform for ARMv5, I’d probably get Opus down to at least 28-30 MHz, maybe lower.
Kostya says:

October 17, 2018 at 12:54 am

Thanks for the benchmarks. Also I suspect Opus is a bit more asymmetrical than others (i.e. encoding/decoding time ratio) because of PVQ search (Vorbis encoders seem to use static codebooks to deal with such complexity) but I might be wrong.
derf says:

March 4, 2019 at 6:50 pm

As VQ searches go, the PVQ codebook search is pretty fast. You can really only search a trained, table-based codebook by iterating through and checking all of the codevectors. However, the PVQ codebook has some structure to it. You can use that structure to speed up the search. For a codebook with N dimensions and K “pulses”, the search complexity in our encoder is O(N*min(N,K)). Meanwhile the codebook size grows as O(K^N), and Opus can use an N larger than 200. PVQ is slower than scalar quantization (O(N)), but much faster than comparably-sized static codebooks (O(N*(codebook size))).

For this reason, codecs that use static codebooks tend to use a small set of small codebooks. Sure those are fast, but they also lose some coding efficiency. Because we do not have to store (and search) large static tables, we can have lots of codebooks of any size, up to four billion codewords (we even experimented with larger during the design process). That scale and flexibility is one of the things that contributes to the coding performance of Opus: when we allocate bits, we get pretty close to the number we ask for, and we do not lose much efficiency (in CELT) to splitting large vectors into smaller ones or resorting to multistage VQ to keep codebook size under control.

SILK is a slightly different story. Even though it uses (another variant of) PVQ for the excitation, SILK has a multistage VQ for the LPC parameters, and a few other static VQ tables for the pitch parameters. Those things are less amenable to an algorithmic VQ like PVQ, so that is what you have to do. SILK’s encoder/decoder asymmetry is also much larger than CELT’s. Even though SILK decoding is very fast (much faster than CELT), SILK encoding tends to be slower than CELT, mostly due to the noise-shaping quantization search.

As for transmitting band energies instead of quantizers… while I agree that the two are often highly correlated (that is the point!), the former actually tells you something about the signal while the latter is purely side information. If these codecs used the quantizers to determine the band energies, then you might have an argument. Instead, G.722.1 transmits both: an amplitude index, and then *four* bits per band to determine the “categorization”. That is a huge amount of side information for two things that are “highly correlated”. Cook also transmits both, though I have not looked closely at the average number of bits each uses.

Before Opus, the usual trend in general-purpose audio codecs (since MP3) was to have a sophisticated psychoacoustic model in the encoder that carefully allocates bits and transmits that allocation as side information. Without that sophisticated model, the results are decidedly subpar (see: the Xing MP3 encoder, or the old native FFmpeg Vorbis encoder). G.722.1 and Cook are no different.

We (by which I mean Monty) had already learned when writing a Vorbis encoder that you need to preserve the energy in each band when quantizing in order to sound good. Opus is certainly not the first codec to do that by explicitly coding that energy. However, Jean-Marc’s idea for CELT was that just using the band energies to derive the bit allocation for the rest of the signal is already so good that you can skip having the sophisticated psychoacoustic model entirely. Even if theoretically the model would come up with a slightly better allocation, you save so many bits by not transmitting two separate but highly correlated things that it doesn’t matter. CELT does have a few knobs you can use to tweak the allocation, and the more recent Opus releases do use them to avoid some specific artifacts (at an average side-information cost of much less than one bit per band, I believe). However, it still doesn’t have anything like the kind of psychoacoustic models that were popular in MP3’s heydey. We even had people (e.g., Gian-Carlo) try to add such a model during the CELT development process, without success. If your CELT encoder makes random decisions for every single piece of side information (except for the silence flag), the result still sounds better than MP3.

There are a few other things that CELT has that previous codecs like MP3 and Vorbis do not: folding, spreading, TF resolution switching, anti-collapse, short MDCT windows, crosstalk-free mid-side stereo, etc. Some of those (anti-collapse) are fixing problems created by the CELT design itself (coding a single per-band energy for a whole frame even when split into multiple short transform blocks), so there was no reason for other codecs to have them. Some (folding) were just different ways of solving problems others had solved (although folding is amazingly effective given its simplicity when compared to things like SBR). Others were original and unique, however.

I do not claim that Opus was single-handedly more innovative than the previous 30 years of audio coding research. Certainly many of the ideas had their seeds in other codecs, just like other codecs have now started to borrow ideas from Opus. There were a few good ideas in there, though. They were not just of the “we have more CPU and memory now, so let’s add bigger blocks and 30 extra prediction modes” variety like you see in video coding, either. I haven’t even mentioned any of my own ideas (pretty much everything above on the CELT side came from Jean-Marc).
Kostya says:

March 5, 2019 at 12:20 am

Thank you for very insightful and informative comment but you still got some minor details wrong.

> G.722.1 transmits both: an amplitude index, and then *four* bits per band to determine the “categorization” … Cook also transmits both.

Cook has up to 8 gains controlling overall frame amplitude (which is normal for 1024-sample frame IMO) and then quantisers per each band that are also used for bit allocation. See https://wiki.multimedia.cx/index.php/RealAudio_cook for some details.

> Before Opus, the usual trend in general-purpose audio codecs (since MP3) was to have a sophisticated psychoacoustic model in the encoder that carefully allocates bits and transmits that allocation…

If you look at the quite widespread codecs like AC-3, Indeo Audio/Music and Cook, they did not transmit actual band coding method (i.e. how many bits to read per coefficient of which codebook to use) directly but rather they stored a much smaller bit allocation information. And I’d guess psychoacoustic model was still needed to tell the coder which frequencies can be discarded or use less bits than the others. And yes, the fact CELT could get rid of it is a great achievement.

> There are a few other things that CELT has that previous codecs like MP3 and Vorbis do not: … short MDCT windows

Uhm? RFC describes short blocks as “Unlike other transform codecs, the multiple MDCTs are jointly quantized as if the coefficients were obtained from a single MDCT.” How is that different from AC3 or AAC that also group coefficients for short MDCT and quantise them together?

> Some (folding) were just different ways of solving problems others had solved …

Well, most non-Fraunhofer codecs (like E-AC-3) use similar technique for their SBR replacement. Of course it’s usually less complicated and does not have the same per-band control as CELT.

> Certainly many of the ideas had their seeds in other codecs, just like other codecs have now started to borrow ideas from Opus.

Has somebody failed to say “USAC”? I’d love to see a comparison between it and Opus.
NihAV: towards RealMedia encoding support « Kostya's Boring Codec World says:

March 10, 2023 at 12:34 pm

[…] AAC already). Version 2 is the one supporting joint-stereo coding. The codec is based on G.722.1 (the predecessor of CELT even if Xiph folks keep denying that) but, because Cook frames are 256-1024 samples instead of […]