General Thoughts about Reverse Engineering Speech Codecs

Spoiler: they are not nice.

Speech codecs are probably the worst from my REing point of view. Why? Not because they are particularly hard to RE but rather because they are unpleasant. Here’s my list of reasons:

They are math-heavy. Of course you need to know mathematics to understand most of the codecs but I had no DSP courses at the university yet I can understand how video codecs work even in fine details, the same with many audio codecs. With speech codecs I have only general ideas about how they work.
Even worse, there are hardly any conventions on how to do things and in result codecs are built in a process that puts designing ARM SoCs to shame.
Even worse, because codecs are math heavy they have to be implemented with efficiency in mind which results in horrible fixed-point math usually in 16-bit variables.
And because all of this wasn’t enough, if codec supports several bitrates it might have additional postprocessing functions for lower bitrates in the best case. In the usual case it’s a different codec.

And that all is the source of REing problems. Bitstream format is usually easy to find and parse, the functions that do something with it are not easy to understand at all. I often end up not understanding what the function does let alone what concept it implements. I might recognize only some of them like LPC filter.

So why are speech codecs so badly designed? In my opinion it comes from the design decision. The initial idea was to use as little bits as possible by having a synthetic model and transmitting only its parameters. It worked great (at least compared to MPEG-4 with its synthetic scene and audio description aka the key parts of standard that people pretend do not exist at all). So you have human throat which is basically a variable tube and vowels are modulated tone, consonants are modulated noise. Transmit filter coefficients (original LPC, LSF, parcor form or something else), noise flag and pitch frequency and you’re done, right? It works fine for some sounds but the quality is not that great and it fails with some sounds completely (the sounds often used by French for example, so there’s still a need for French Speech Codec or j-bc for short). How to improve it? By adding impulses to “excite” the model (i.e. tell when to start/stop the sounds). Not good enough? Add pitch tilting! Still not good enough? Add postprocessing filter (and it was mandatory there long before video codecs). And what if we want to code not just voice and higher frequency range audio too? Well, add…

And thus it became pile of hacks over pile of hacks with a side dish of hacks. And each stage can be done in several different ways which only adds to confusion. That’s not even starting to talk about smart ways to save bits by splitting frame into several subframes and omitting coding some information for selected subframes (it can be interpolated from other subframes information after all). Or using codebooks and vector quantisation. Or how to generate noise for silent frames. Or using better coding than just writing fixed amount of bits for every element. Or…

I’ve finished looking at Lernout&Hauspie CELP+SBC codecs (as usual I don’t understand most of the things they do there but maybe I’ll still document them) and this plus my past experience made me write this post. Next is VoxWare MetaVoice and maybe Micronas SC4. And something saner afterwards. Or maybe it will be the usual nothing.

This entry was posted on Saturday, April 23rd, 2016 at 1:59 pm and is filed under Useless Rants. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.