Why codecs are designed like this and why they are not very interchangeable

Sometimes I have to explain the role of various codecs and why it’s pointless in most cases to adapt compression tricks from image codecs to audio codecs (and vice versa) and even from lossy to lossless codecs in the same content. If you understand that already then you’ll find no new information here.

Yours truly
Captain Obvious

Let’s start with the classification: codecs can be categorised by the content they work on (audio, image, video, text, some other kind of data) and by how well they preserve the original content (lossy and lossless). Here I’ll review only audio and video codecs (and image codecs a bit) but no text. The only things worth mentioning are that it’s predominantly lossless compression (I remember seeing only one lossy text compressor in my life—some IBM tool for forming a digest of the text by selecting only the most important lines of text) and that it follows the same principles as described below with data amount being rather small (compared to audio/video) and compressing it as a whole instead of (semi)independent pieces.

Now another fact: any compression scheme can be represented as modelling input data and coding the representation of what that model outputs. Models may have preprocessing stage (which transforms input data according to some representation), transform stage (where you represent input data in a different way) and prediction stage (where your model takes some previously seen data into account and outputs the difference between predicted and actual input value). The stages of the model can be lossy or lossless.

Here are some examples:

you often have preprocessing stage in general compressors and archivers, for example when they detect table data (e.g. what looks a series of 32-bit integers) they replace absolute values with differences in hope that compressing small deltas should be more effective;
in some old codecs you could see “lossy RLE” effect where small variances of colour were replaced with a same colour value in order to achieve better compression;
transform stage in very common in lossy audio codecs: you get input data in frequency domain and you transform it into spectral domain (which can be compressed further);
similarly in video codecs you operate not on a whole image but on a series of blocks, nowadays mostly with an integer approximation of 2D DCT applied;
prediction stage in lossless audio codecs is straightforward: you take 10-1024 previous samples, calculate some value from them and output the difference between that calculated number and the actual input one. In the best case you’ll have zero difference and a sequence of zeroes is easy to compress;
in image coding you can predict pixel value from its neighbours, in video you can also use motion compensation to predict a whole block from some reference frame;
you can also call RLE a kind of prediction: it expects next symbol to be the same as the previous ones.

And after you obtain the values from the model you code them using some conventional method: fixed bit-fields (sometimes bytes), variable-length codes from a fixed codebook (common in lossy codecs), variable-length adaptive universal codes (like Rice code, very common in lossless codecs), arithmetic coding or its equivalent (various binary coders, asymmetric numeral systems and such).

Here are some examples of codec designs from that point of view:

FLAC (i.e. lossless audio compression) is done by applying either RLE (you can fill a whole block with the same value) or predicting samples using 1-32 previous samples and coding differences with fixed-parameter Rice codes (you can split it into several partitions to improve compression);
MP3 (i.e. lossy audio compression) is done by applying transforms (QMF and MDCT) that output lossy representation of a signal (with some frequencies masked and such), then those coefficients are quantised (another lossy stage) and compressed using static codebooks;
PNG (i.e. lossless image coding) first applies pixel prediction from its 1-4 neighbours and then runs deflate compression over the data (deflate is LZ77 parsing of the stream into literals and references to the previous data and coding those using fixed variable-length codes or codes generated specially for this block);
baseline JPEG (i.e. lossy image coding) splits image into 8×8 blocks, applies DCT and quantisation on them (obtaining transformed coefficients), applies lossless DC prediction and codes coefficients and runs of zeroes using some set of variable-length codes;
many lossless video codecs simply apply pixel prediction from 1-10 of their neighbours and code the difference using adaptive Rice codes;
baseline H.264 (lossy video codec) splits frame into 4×4 blocks (again, this is a simplified view), applies block prediction either by generating block contents from its neighbouring blocks (aka spatial or intra prediction) or by finding a good reference block in some previous frame (aka motion compensation), transforms and quantises the difference between base and input block, and finally codes coefficients using either context-adaptive codebooks or context-adaptive binary coding.

This is rather simplified scheme but as you can see the main difference is how data is represented inside the codec and if the model gives as little error as possible (i.e. you can predict the input really well using your model) then you can code it very efficiently.

Now what drives the model design? Mostly it’s the characteristics of input source (i.e. what range it has and how it behaves) and the requirements to the compressed data output (i.e. low large are the chunks you compress and if they allow to reference previously encoded chunks; plus some practical requirements like being able to encode and decode data on the fly or some other constraints imposed by the transmission medium or hardware).

So what are the typical sources and requirements?

general audio—you can represent it either as a waveform and code in relatively large blocks (seconds of audio) for lossless codecs or as fixed-size blocks of audio formed by frequencies where you don’t hear some frequencies well and can throw them away for typical lossy psychoacoustic audio codec;
speech—you start with a model that modulates pure tone or noise a lot like human throat and code parameters for that model. The frames are typically very short, just couple hundreds of sample.
continuous-tone images (aka photos and photorealistic images, not a drawing done with few colours)—lossless coding exploits the fact that pixel is rarely differs by much from its neighbours, lossy coding relies on less eye sensitivity to high frequencies (so you can transform a whole image or blocks of it, quantise frequencies and throw some of them out and code sparse results);
continuous-tone video—lossless video coding simply represents it as series of images, lossy coding exploits the fact that subsequent frames usually don’t differ by much and you can track moving objects on them (hence the idea of motion compensation and predicting motion vectors from neighbouring blocks and previous frames).

And don’t forget that you have very different amounts of data to deal with. For speech you usually have 8-64kB per second, for general audio it’s 176-192kB per second, for 1080p video it’s over seventy megabytes! You should also remember that you have different amount of data to process and different amount of already encoded data to use as a reference for encoding (which may improve compression drastically if the preceding data is similar to the current one). And while lossy video codecs can use up to dozens of reference frames (sometimes simultaneously), audio codecs are not permitted to do that and the frames should be coded more or less independently (the only exception is using previous frame in transform history for QMF or MDCT and some lossy codecs like Musepack or Opus that actually code frames as changes to the previous one).

Now it should be clear why it’s pointless to bring full design from one codec type to another (even if it’s just lossy/lossless codec change): different models require different input source type. Similarly you can make a lossless codec from a lossy one by preserving all information (if there are no bitrate constraints) or lossy codec from a lossless one (by replacing some elements with those that are easier to code e.g. replacing small differences with zeroes), but those codecs won’t outperform the other kind of codecs in their field for the same reason: their model is tuned for a certain kind of input and either it requires more bits to code the same data after transformation/prediction for lossy-to-lossless codecs or it starts with high enough bitrate (typical compression ratio for lossless coding is 2-4x, for lossy case it can be hundreds) and can’t be brought down to typical lossy bitrates without serious distortions. Side note: they might be a match in so-called visually lossless or near-lossless quality but I have not seen any comparisons for that.

And if you point out that modern video codecs have lossless mode, then remember it’s added there mostly to make the codec universal (since the need for archival is non-negligible and you’d not want to have a separate codec for that). A quick fact: in 2007 MSU lossless codec comparison FFV1 found to compress video content both better and faster than x264 in lossless mode (and mind you, there were several codecs with even better compression there).

Of course some coding tricks may be reused from one method to another if you have similar task—like coding residues after prediction in both audio and video compression. But reusing transforms does not work that good—you don’t use modulated noise for video coding (MPEG-5 LCEVC is the exception but it’s not a stand-alone video codec); you don’t normally use wavelet transform for audio coding (except for one Chinese lossless codec IIRC); even an attempt to bring PVQ from a speech codec to a video codec reminds me that aforementioned video codec is named after a person with erratic behaviour and a head trauma.

And now you know why different codecs achieve different compression ratio and why you should have domain knowledge about the data you’re trying to compress if you want to create your own compression scheme instead of trying to apply a coding scheme for a different kind of data as is.

This entry was posted on Monday, August 2nd, 2021 at 12:36 pm and is filed under Audio, Useless Rants. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

2 Responses to “Why codecs are designed like this and why they are not very interchangeable”

Paul says:

August 3, 2021 at 4:59 pm

I want to find wavelet audio codec, optionally lossless, is there such beast?
Kostya says:

August 3, 2021 at 11:45 pm

Yes, IEEE 1857.2 also known as Chinese AAC has wavelet transform as part of its coding.

Plus there are various experiments based on wavelet compression not reaching a proper codec stage.