Archive for the ‘Audio’ Category

AVC support in NihAV: semi-done

Saturday, September 14th, 2019

I’ve wasted enough time on AVC decoder for On2 family so while it’s not working properly for those special cases I’m moving to VP7 regardless.

For those who don’t know (or forgot; or never had a reason to care) On2 AAC is AAC-LC rip-off with some creative reconstruction modes added to the usual long/short windows. I’ve failed to understand how it works before and I fail to understand how it works still. But at least some details are a bit clearer now that I’ve analysed the whole codec from scratch with less guesswork.

The codec has three IDs that it recognizes: 0x500, 0x501 and 0x1234. First two are different only in the aspect that one handles singular packets and another one handles several packets glued together prefixed with size. The last ID is simply recognized but it does not have any special handling.

The tricky part is some special modes that do some heavy processing of data. For most modes you invoke IMDCT and that’s all, here you do some QMF-like filtering (probably for transients extraction), then you perform RDFT (previously I thought it was plain FFT but after long investigation it turned out to be RDFT after all) on quarters, merge those quarters using filters that look like convolution filters for four sub-bands, perform RDFT again on the whole block and add some transients. And after that you still may need to reverse the data before using permuted window for overlap-add operation. In other words it’s not fun and I lack education for recognizing all those algorithms used, why they’re used and where it goes wrong.

So hopefully I’ll return to it some day to fix it for good but now VP7 awaits (so I can at least formally declare Duck codecs family done and move to implementing missing bits in the framework itself).

RealAudio Cook aka RealOpus

Sunday, October 14th, 2018

Let’s start with a bit of history since knowing how things developed often helps to understand how they ended up like they are.

There is an organisation previously called CCITT (phone line modem owners might remember it) later renamed to ITU-T. It is known for standardisation and accepting various standards under the same confusing name e.g. PCM and A-law/mu-law quantisation are G.711 recommendation from 1972 while G.711.0 is lossless audio compression scheme from 2009 and G.711.1 is a weird extension from 2008 that splits audio into two bands, compresses low band with A- or mu-law and uses MDCT and vector quantisation on top band.

And there is also a “family” of G.722 speech codecs: basic G.722 that employs splitting audio into subbands and applying ADPCM on them; G.722.1 is a completely different parametric bit allocation, VQ and MDCT codec we discuss later; G.722.2 is a traditional speech codec better known as AMR-WB.

So, what’s the deal with G.722.1? It comes from PictureTel family of Siren codecs (which later served as a base for G.719 too). Also as I mentioned before this codec employs MDCT, vector quantisation and parametric bit allocation. So you decode envelope defined by quantisers, allocate bits to bands depending on those (no, it’s not 1:1 mapping), unpack bands that are coded using vector quantisation dependent on amount of bits and perform MDCT on them. You might be not familiar but this is exactly how certain RealAudio codec works. And I don’t think you can guess its name even if I mention that it was written by Ken Cooke. But you cannot say nothing was changed: RealAudio codec works with different frame sizes (from 32 to 1024 IIRC), it has different codebooks, it has joint stereo mode and finally it has multichannel coding mode based on pairs. In other words, it has evolved from niche speech codec to general purpose audio codec rivalling AAC and it was indeed a codec of choice for RealMedia before they have finally switched to AAC and HE-AAC years later (which was the first time for them using open standard verbatim instead of licensing a proprietary technology or adding their own touches on standards drafts as before—even DNET had special low-bitrate mode).

Now let’s jump to 2012 and VideoLAN Dev Days ’12. I gave a talk there about reverse engineering codecs (of course) and it was a complete failure so that was my first and last public talk but that’s not important. And before me Timothy Terriberry gave an overview of Opus. So I listen how it combines speech and general audio codec (like USAC which you might still not know under its commercial name xHE-AAC)—boring, how speech codec works (it’s Skype SILK codec they dumped to open source at some point and like with Duck TrueMotion VP3 before, Xiph has picked it up and adopted for own purposes)—looks like typical speech codec that I can barely understand how it functions, and then CELT part comes up. CELT is general audio codec developed by Xiph that is essentially what your Opus files will end as (SILK is used only at extremely low bitrates in files produced by the reference encoder—or so I heard from the person implementing a decoder for it). And a couple of months before VDD12 I actually bothered to enter technical details about Cook into MultimediaWiki (here’s edit history if you want to check that)—I should probably RE some codec and write more pages there for the old times’ sake. So Cook design details were still fresh in my mind when I heard about CELT details…

So CELT codes just single channels or stereo pairs—nothing unusual so far, many codecs do that. It also uses MDCT—even more codecs do that. It codes envelope, uses parametric bit allocation and vector quantisation—wait a bit, I definitely heard about this somewhere before (yes, it sounds suspiciously like ITU G.719). Actually I pointed out that to Xiph guys (there was Monty present as well) immediately but it was dismissed as being nothing similar at all (“we transmit band energies instead of relying on quantisers”—right, and quantisers in audio are rarely chosen depending on energy).

Let’s compare the coding stages of two codecs to see how they fail to match up:

  1. CELT transmits band energy—Cook transmits quantisers (that are still highly correlated with band energy) and variable amount of gains to shape output frame in time domain;
  2. CELT transmits innovation (essentially coefficients for MDCT minus some predicted stuff)—Cook transmits MDCT coefficients;
  3. CELT uses transmitted band energy and bits available for innovation after the rest of frame is coded to determine number of bits for each band and mode in which coefficients are coded (aka parametric bit allocation)—Cook uses transmitted quantisers and bits available after the rest of frame is coded to determine number of bits for each band and mode in which coefficients are coded;
  4. CELT uses Perceptual Vector Quantization (based on Pyramid Vector Quantizer—boy, the won’t cause any confusion at all)—Cook uses fixed vector quantisation based on amount of bits allocated to band and static codebook;
  5. CELT estimates pitch gains and pitch period—that is a speech codec stuff that Cook does not have;
  6. CELT uses MDCT to restore the data—Cook does the same.

Some of you might say: “Hah! Even if it matches at some stages actual coefficient coding is completely different!! And you forgot that CELT uses range coder too.” Well, I didn’t say those two formats were exactly the same, just that their design is very similar. To quote the immortal words from Bell, Cleary and Witten paper on text compression, the progress in data compression is mostly defined by larger amounts of RAM available (and CPU cycles available). So back in the day hardly any audio codec could afford range coder (invented in 1979) except for some slow lossless audio coders. Similarly PVQ was proposed by Thomas Fischer in 1986 but wasn’t employed because it was significantly costlier than some fixed codebook vector quantisation. So while CELT is undeniably more advanced than Cook, the main gains are from using methods that do the same thing more effectively (at expense of RAM and/or CPU) instead of coming up with significantly different scheme. An obligatory car analogy: claiming that modern internal combustion engine car is completely new invention compared to Ford Model T or FIAT 124 because they have more bells and whistleselectronics even while principal scheme remains the same—while radically new car would be an electric one with no transmission or gearbox and engines in each wheel (let’s forget such scheme is very old too—electric cars of such design roamed Moon in 1970s).

So overall, Opus is almost synonymous with CELT and CELT has a lot of common in design with Cook (but greatly improved) so this allows Cook to be called RealOpus or Opus of its era.


BTW when implementing the decoder for this format in Rust I’ve encountered a problem: the table for 6-bit stereo coupling was never tested because its definition is wrong (some code definitions repeating with the same bit lengths) and looks like the first half of it got corrupted. Just compare for yourselves.

libavcodec version (lengths array added for the reference):

static const uint16_t ccpl_huffcodes6[63] = {
    0x0004,0x0005,0x0005,0x0006,0x0006,0x0007,0x0007,0x0007,0x0007,0x0008,0x0008,0x0008,
    0x0008,0x0009,0x0009,0x0009,0x0009,0x000a,0x000a,0x000a,0x000a,0x000a,0x000b,0x000b,
    0x000b,0x000b,0x000c,0x000d,0x000e,0x000e,0x0010,0x0000,0x000a,0x0018,0x0019,0x0036,
    0x0037,0x0074,0x0075,0x0076,0x0077,0x00f4,0x00f5,0x00f6,0x00f7,0x01f5,0x01f6,0x01f7,
    0x01f8,0x03f6,0x03f7,0x03f8,0x03f9,0x03fa,0x07fa,0x07fb,0x07fc,0x07fd,0x0ffd,0x1ffd,
    0x3ffd,0x3ffe,0xffff,
};

static const uint8_t ccpl_huffbits6[63] = {
    16,15,14,13,12,11,11,11,11,10,10,10,
    10,9,9,9,9,9,8,8,8,8,7,7,
    7,7,6,6,5,5,3,1,4,5,5,6,
    6,7,7,7,7,8,8,8,8,9,9,9,
    9,10,10,10,10,10,11,11,11,11,12,13,
    14,14,16,
};

NihAV corrected version (extracted from the reference of course):

const COOK_CPL_6BITS_CODES: &[u16; 63] = &[
    0xFFFE, 0x7FFE, 0x3FFC, 0x1FFC, 0x0FFC, 0x07F6, 0x07F7, 0x07F8,
    0x07F9, 0x03F2, 0x03F3, 0x03F4, 0x03F5, 0x01F0, 0x01F1, 0x01F2,
    0x01F3, 0x01F4, 0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x0070, 0x0071,
    0x0072, 0x0073, 0x0034, 0x0035, 0x0016, 0x0017, 0x0004, 0x0000,
    0x000A, 0x0018, 0x0019, 0x0036, 0x0037, 0x0074, 0x0075, 0x0076,
    0x0077, 0x00F4, 0x00F5, 0x00F6, 0x00F7, 0x01F5, 0x01F6, 0x01F7,
    0x01F8, 0x03F6, 0x03F7, 0x03F8, 0x03F9, 0x03FA, 0x07FA, 0x07FB,
    0x07FC, 0x07FD, 0x0FFD, 0x1FFD, 0x3FFD, 0x3FFE, 0xFFFF
];

NihAV, RealMedia, Rust and Everything Else

Saturday, October 13th, 2018

Looks like it’s been about two months since I last wrote anything about NihAV but that does not mean I did not have anything to write about. On the contrary, I’m glad to report about significant progress in RealAudio support.

Previously I’ve reported about RealVideo 3 and 4 support (as for RealVideo 1/2 and ClearVideo before), so video part was covered quite well but audio part was missing and I went on to rectify the situation.

Now NihAV supports RealAudio 1.0 (speech codec), RealAudio 2.0 (speech codec), RealAudio DNET (a bit about it later), RealAudio 4.0 (speech codec from Sipro), RealAudio Cook (this one deserves a separate post so the next one should be about this codec) and RealAudio Lossless. So there are only three codecs missing now: RealAudio 8 (ATRAC3), RealAudio 9/10 (AAC) and RealVideo 6(HD). Of course I’m going to add support for those as well.

This is actually a good time to implement those. As you might know, there is a Holy Trinity of Licensors: D.vX, D*lby and DT$. They are famous for ‘nice’ licensing terms. While I’ve never had to deal with them, I’ve heard from people who did that they like licensing single product they’re most famous for at outrageous prices (i.e. it’ll cost you a magnitude more per unit using their technology than e.g. H.264 decoder) and it’s a viral license too because if you sell stuff not oriented for consumers then you have to force your customers into the same deal (it’s GPL—Greedy Private License) and you have to report your sales to them for obvious reasons. Funny how two of the companies were bought out already. Now let’s look at them in some details:

  • D.vX This one is remarkable since it licensed the product it had nothing to do with (aka M$MPEG-4 adapted for non-ASF containers and MPEG-4 ASP). At least it seems hardly relevant now unless I dig out some old movies.
  • D*lby This one is mostly known (outside cinema equipment) for codec with several names: ATSC A/52, RealAudio DNET, ETSI TS 102 366, D*lby Digital and even something you can make out of letters A C and 3 (I heard rumours that it does not like its trademarks mentioned so I’d better avoid directly naming it). At least the last patents for that format has expired and support for it can be implemented freely. And it also owns a company that manages licensing of AAC. Fun fact is that patents for MPEG2 NBC are expired so I can implement AAC-LC decoder just fine but that does not stop them for licensing it. How they do it? By refusing to license the separate parts and forcing a whole package of AAC-LC, HE-AACv1, HE-AACv2 and xHE-AAC onto you. I guess if the situation won’t change in twenty years all current stuff will expire but they’ll still license it along with Ultra-Enhanced-Hyper-Expanded-Radically-Extended High-Efficiency AAC (which will have nothing to do with all those previous formats).
  • DT$ A company similar to D*lby and its (former?) prime competition. Also known for single format with many extensions making it essentially a homebrew AAC. At least it seems to be exclusively DVD/Blu-ray format and I’m satisfied with Xine for playing the former and avoiding the latter completely.

And I want to talk a bit more about my RealAudio DNET decoder. Internally it’s called ts102366 for obvious reasons and I have just a primitive implementation for it (i.e. it seems to work and should handle multichannel fine but no extended features). The extension for more than 5.1 channels also seems to be HD-DVD/Blu-ray only so I don’t care, it’s quite rare in RealMedia format and other containers seem to contain it as contiguous stream so I’d need to introduce support for NAElementaryStream in demuxing code and also proper parser to split it into frames. Not worth the effort for me at this moment. Another fun fact is that bitstream comes in 16-bit words that can have any endianness. In my case I just had to detect the proper endianness from first two bytes and simply initialise bitstream reader in BE or LE16 mode depending on it (again, it’s funnier with DT$ format where you have three different bitstream reading modes and you might need two modes simultaneously in some cases; again, good thing I don’t have to care about that stuff). Also it’s still one of two codecs I currently have that support multichannel audio (Cook is the second of course and AAC will be third).

And finally some words about Rust issues I had to deal with.

Rust as a language is more or less fine but compiler sucks. I’ve ran into several issues while writing code.

First, I had a fixed array of Codebooks to initialise in RALF decoder (one of 15 codebooks, another one of 125 codebooks and yet another one of 10×11 codebooks). If I use simply mem::uninitialized() with filling it up it works fine. In debug mode. In release mode it segfaults at the end. Probably I should’ve used ptr::write() instead of assigning and it would work fine but I gave up and used a vector instead of an array even if it’s not as efficient. Obviously it’s all my fault and not Rust issue but still that was weird.

Second, when I tried to create a generic codebook reader that would accept table of codes of any primitive type (u8, u16 or u32) I ran into funnier issue of Rust compiler spewing weird errors like “cannot convert u16 to u32 because it’s not a primitive type”. Obviously it’s my mistake and it’s caught by a tool (that is still not in stable) so the developers don’t care (yes, Luca even bothered to file an issue on that). Still, I’d rather have a clearer error message in that case (e.g. “… because it’s X and not a primitive type”).

And finally, an example that is definitely rustc stupidity and not mine. Again, developers don’t consider this to be an issue but I do (and Luca seemed to agree with me since he opened an issue about it). Essentially, there is a thing called DCE (dead code elimination), so when compilers see that certain block won’t be executed they might print a warning and just check inside code for syntactic validity. Current rustc might ignore condition value and optimise code inside even if it clearly makes no sense (to the point where it crashed because of that on some nightly version, see the issue for details). And while you argue that one should not write such code, I had quite plausible use case for it: a macro that took 2- or 3-element array and did something to its values so if third value was present it had to do something special with it. But of course compilation failed because you tried to do if ARR.len() > 2 { a = ARR[2]; } with two-element array. But when I tried to check whether I got indexing correct by using large constants as indices, cargo check passed just fine—probably because const propagation did not go that deep inside my code (it was in a function called from a long chain in some sub-sub-sub-module and standalone example errors out fine). This feels quite unpolished to me.

Oh, and final final fun thing: the calls like foo.bar(foo.baz) would still fail borrow check probably because they can’t (I guess) formalise function calling convention i.e. “if function is called then first its arguments are evaluated and copied if needed in certain order, then function address is evaluated and called with the arguments”. BTW you still have the situation like this:

struct Foo { foo: u8 }
impl Foo {
    fn bar(&mut self) -> u8 { self.foo += 1; self.foo }
}

fn fee(a: u8, b: u8) {
    println!("{} {}", a, b);
}

fn main() {
    let mut foo = Foo { foo: 42 };
    fee(foo.bar(), foo.bar());
}

And if you don’t know what’s wrong here I’ll tell you: in C argument evaluation is implementation-defined because back in the day there were very different calling conventions and thus compiler needed to start with evaluating from last argument to first to store them in order instead of widespread pushing arguments in order to stack. So depending on ABI the function would be called either as fee(43, 44) or as fee(44, 43).

Now I see two ways out of it: either detect such situation where the same object is mutably called several times and give an error or, which is better IMO, make formal calling convention so the code won’t be undefined. And fix borrow checker while doing that.


Overall, Rust is a nice experience so far since it allows code to structure much better but sometimes you hit such silly issues that spoil all the fun.

Anyway, next post should be about RealAudio Cook, the Opus of its era.

Some Information on Micronas SC4 and VoxWare MetaSound

Sunday, April 24th, 2016

So I’ve looked at them.

Micronas SC4 seems to be rather unusual as it seems to bring elements of LPC to ADPCM. So it’s not just the old conventional “get nibble, multiply by step, output prediction, update index and step values”—it keeps a history of last 6 decoded samples and predictions and use them to calculate a new prediction value. Details might appear in the Wiki one day.

VoxWare MetaSound is three families of 2-3 codecs bundled under the same brand. I’ve not looked at technical details but they seem to have lots and lots of tables with floating point numbers (or just a bit of tables if you’ve looked at MetaSound first).
Here are the codecs:

  • RT24 2400bps “Real-Time” codec (ID is VOXa)
  • RT28 2844bps “Real-Time” codec (ID is VOXh)
  • RT29 2978bps “High Quality” codec (ID is VOXg)
  • VR12 1260bps Variable Rate codec (ID is VOXb)
  • VR15 1537bps Variable Rate codec(ID is VOXc)
  • SC3 3200bps “Embedded” codec (no ID)
  • SC6 6400bps “Embedded” codec (no ID)

Ask for support by grabbing j-b and demanding it to be supported. I know there are other players beside VLC but that’s the only project advertising that it “plays it all” even on T-shirts. It’s time to be responsible for your own words. And ask for Bink2 too while at it.

OptimFROG

Saturday, March 26th, 2016

You know, the greatest reverse engineer I know is Derek B. He’s managed to RE such codecs as Canopus HQX and Cineform HD in the most efficient manner ever—saying he’ll do it and patiently waiting until somebody else does it.

So here are some words about his favourite lossless audio codec. The most interesting thing about it is that it was actively developed in 2001-2006 and then it was suddenly resurrected in 2015. Also it’s one of few non-standard codecs (i.e. not made into standard) that has several articles written about it.

The codec actually consists of two different formats, seemingly an old one and a newer one (that looks like it supports all range of sample type). The former is notable for having signal reconstruction stage using floating point math (a thing you don’t see in codecs every day), the latter seems to employ various parameter reading and reconstruction methods. Coding is done using low precision range coder (large values are decoded using chunks of 8 or 12 bits). So nothing really interesting there.

P.S. I’m definitely not going to write a decoder for it. There are too many lossless audio codecs already, let all proprietary ones (in custom containers too) die in peace.

A Call for Modern Audio Codec

Wednesday, February 11th, 2015

We need a proper audio codec to accompany state of the art video codecs, so here’s an outline of codec features that should be present:

  • audio codec should make more of its context, it should have a system of forward and backward reference frames like B-pyramid in H.264 or H.265;
  • it should employ tonal compensation with that — track the frequency changes from the references (e.g. it may be the same note continued or changing pitch);
  • time domain prediction via FIR or IIR filters;
  • flexible subdivision into subframes like binary tree;
  • raw (or non-transformed at least) coding mode for transients or noise;
  • integer only bitexact transform that passes for MDCT under bad light;
  • high-bitdepth sound support (up to 64 bits per sample).

The project name is transGhost (hopefully no Monty will be hurt by this).

And if you point out this is stupid — well, audio codecs should have the same rights as video codecs including PTS/DTS differences and employing similar coding methods.

Blåtand-Passande-X

Sunday, November 23rd, 2014

So, finally there’s a post about some codec.

It is a specialised codec from Oxford Germanium Television (all names are changed just in case) that has 4:1 compression ratio and very niche use. It’s hard to find even a decoder for it so this analysis was done on ARM version of encoder (maybe I’ll be able to RE something more useful next time like VX).

The codec itself is rather simple: you take 4 samples from one channel, compress them, output the 16-bit result and repeat the same for the second channel. Encoding is rather simple too:

  1. feed input to 4-band QMF (with filter looking a lot like D4 wavelet to me);
  2. perform ADPCM on each band (this varies a bit for each band but it’s the same approach);
  3. generate output word (7 bits for band 0, 4 bits for band 1, 2 bits for band 2 and 3 plus a parity bit for them all).

Since I have no samples of it don’t expect a decoder from me any time soon (and I don’t have enough motivation to hook Android encoder directly to make it produce data). Not that anyone cares about it either.

On Some Annoying Audio Codecs Family

Tuesday, July 29th, 2014

For the reasons I can’t disclose I really hate DTS codecs. For those who don’t know there are about three and a half codecs in this family:

  • DTS Core
  • DTS Core extensions (bitrate extension, two extensions for more channels and an extension for upsampling e.g. 48 kHz -> 96 kHz)
  • DTS Lossless (which might depend on core and extend/replace its channels)
  • DTS LBR (aka Express profile)

You need to be Jean-Baptiste Kempf to love these formats: DTS Core uses annoyingly large tables, DTS Lossless relies on DTS Core part being decoded properly for it, DTS LBR is a special beast that I’ll describe below. And the best part — all those formats are poorly documented (tables are missing for DTS Core, something was missing for DTS Core X96k extension too, bitexact core reconstruction and some other things needed for real lossless decoder implementation are not documented, LBR is not much better either).

So, what makes DTS LBR special? Its coding mode of course. This is a weird codec that employs MDCT (nothing special so far), codes tones separately (that’s not so common) and spreads it all among many chunks for different resolutions that make it “scalable” or whatever.

Nevertheless this post is not about how horrible are all those codecs (if you have ever worked with them it’s obvious and Jean-Baptiste Kempf won’t believe anyway), it’s about obscure relations with other codecs.

When I looked at QDesign Music codec (unsupported by Libav currently) I found that it has suspiciously familiar coding scheme for tones (QDesign Music 1/2 also use tone detection in MDCT frames) — I’ve seen it in DTS LBR. And indeed, it seems the same guy created some codec called LBpack that was first to use that approach, then he was employed by QDesign and then by DTS. No wonder it looked similar.

Another piece of trivia — there was one guy working on so-called adaptive prediction and transform scheme. Later the prototype known as APT100 was turned into DTS Core. But looks like the same work gave birth to lesser-known codec APT-X (that I’m currently REing but that’s beside the point). And it’s not just the name — one codec employs QMF and ADPCM on subbands, another one employs QMF and optional ADPCM on subbands.

All that makes one wonder whether DTS Lossless is related to some lossless codec outside DTS (not necessarily APT Lossless but might be, no details are known about that one). Currently I cannot name any other lossless codec that employs the same coding approach (block coding with different coding for large and small coefficients plus non-adaptive filter). Of course such knowledge won’t change anything but it would be still interesting to know.

P.S. There are rumours that DTS LBR will be made scalable for adaptive streaming, what a fun that will be!
P.P.S. This post was written mainly to test how well new Mike’s setup works.

Voxware Codecs and Tags

Saturday, August 10th, 2013

If you look at the registry of WAV formats you can see this:


0x0069 WAVE_FORMAT_VOXWARE_BYTE_ALIGNED Voxware, Inc.
0x0070 WAVE_FORMAT_VOXWARE_AC8 Voxware, Inc.
0x0071 WAVE_FORMAT_VOXWARE_AC10 Voxware, Inc.
0x0072 WAVE_FORMAT_VOXWARE_AC16 Voxware, Inc.
0x0073 WAVE_FORMAT_VOXWARE_AC20 Voxware, Inc.
0x0074 WAVE_FORMAT_VOXWARE_RT24 Voxware, Inc.
0x0075 WAVE_FORMAT_VOXWARE_RT29 Voxware, Inc.
0x0076 WAVE_FORMAT_VOXWARE_RT29HW Voxware, Inc.
0x0077 WAVE_FORMAT_VOXWARE_VR12 Voxware, Inc.
0x0078 WAVE_FORMAT_VOXWARE_VR18 Voxware, Inc.
0x0079 WAVE_FORMAT_VOXWARE_TQ40 Voxware, Inc.
0x007A WAVE_FORMAT_VOXWARE_SC3 Voxware, Inc.
0x007B WAVE_FORMAT_VOXWARE_SC3 Voxware, Inc.
0x0081 WAVE_FORMAT_VOXWARE_TQ60 Voxware, Inc.

In reality there’s one codec with several variations (MetaSound) and a family of low-bitrate MetaVoice codecs. And it doesn’t really matter what ID you’ll use — codec extradata contains real tag used to distinguish one codec from another. That’s why we can have 0x0075 format reserved for Voxware RT29 speech codec but used by MetaSound instead.

Here’s the list of internal tags:

  • VOXa — MetaVoice RT24, 8 kHz, mono, 2.4kbps
  • VOXb — MetaVoice VR12, 8 kHz, mono, 1.2kbps (variable bitrate)
  • VOXc — MetaVoice VR15, 8 kHz, mono, 2.4kbps (variable bitrate)
  • VOXg — MetaVoice RT29HQ, 8 kHz, mono, 2.98kbps (called high-quality for some reason)
  • VOXh — MetaVoice RT28, 8 kHz, mono, 2.8kbps
  • VOXi — MetaSound AC08, 8 kHz, mono, 8kbps
  • VOXj — MetaSound AC10, 11 kHz, mono, 10kbps
  • VOXk — MetaSound AC16, 16 kHz, mono, 16kbps
  • VOXL — MetaSound AC24, 22 kHz, mono, 24kbps
  • VOXq-VOXz — MetaSound mono and stereo, various formats
  • VX01 — MetaVoice SC3, 8 kHz, mono, 3.2kbps (embedded)
  • VX02 — MetaVoice SC6, 8 kHz, mono, 6.4kbps (embedded)
  • VX03 — MetaSound, 8 kHz, mono, 6kbps
  • VX04 — MetaSound, 8 kHz, stereo, 12kbps

So, maybe RT29 does not exist and it should be RT28 instead; obviously RT29HW is a typo for RT29HQ and the second SC3 should be SC6 in the registry (and unfortunately there’s no information about TQ40/TQ60). But who is going to correct WAVE formats list because of facts?

P.S. It would be nice to receive samples for all MetaSound modes (encoder is still available and should work on older Windows systems).

A Quest Continues

Friday, June 28th, 2013

Well, after some distraction as writing semi-working On2 AVC decoder (it turned out that On2 has introduced some special modes there that differ only on signal reconstruction stage, too lazy to RE them) and recovering after heat wave I’ve returned to the VoxWare ElenrilSound decoder.

I hate parametric codecs — no matter how you screw calculations you’ll still get some output but it won’t be useful for debugging. At least I can use MPlayer2 + binary codec loader + gdb combination to extract runtime information from the reference decoder.

Now I’m trying to make at least one mode work properly, 16kHz@16kbps mono (aka VOXk) for now. Stereo reconstruction might be trickier so I’ll leave it for later but at least most modes differ only by the tables they use. So (in theory) I’ll need to make at least this mode work, add tables for other modes, fix stereo decoding, look at 8kHz@6kbps mode, curse and forget about it.

Good news — bit allocation works properly and bits are read exactly as in the reference decoder. Bad news — reconstructed output is not even close to the expected one, so the work continues…