Why wavelets is a bad choice for image coding

I’ve been teasing it for too long and finally the time for this post has come. I really have a reason to believe that wavelets are not the best choice for image compression, so here are my arguments.

Let’s start from the basics: Fourier transform and DCT is essentially representing a signal (function, data set, whatever) as a sum of sine waves with increasing frequencies, spanning the whole range; wavelet transform is representing a signal as splitting it into low- and high-pass parts (roughly speaking, averages and differences between neighbouring signal points) and repeating process recursively for each following band (i.e. split low-pass part into low-low-pass and high-low-pass parts, and split high-pass into low-high-pass and high-high-pass, then split each of those parts in the same way etc etc).

In theory this enables better signal analysis because you can localise transients, which is not possible with FFT/DCT (and this is the reason why WT was invented in the first place). I believe it makes a lot of sense for stochastic processes (like Brownian motion it was initially used for), maybe it is a good fit for audio (I don’t know DSP much but QMF looks suspiciously related and there are at least two wavelet-based audio codecs too; and de-noising looked promising…) but images are a different thing.

Just look at the sample image from Wickedpedia. The first thing you’ll notice is that nobody bothers to transform high-pass coefficients further. Wavelet filters are essentially used to split out low-pass recursively while high-pass parts are usually areas of zeroes and rather incompressible noise-like coefficients. That’s the reason why nobody bothers with trying to do much with those bands—just quantise it and code zero regions as effectively as possible (insert the joke about true wavelet compression not being tried yet if you like). Mind you, there’s nothing wrong with using it for scalability (as some formats did, like Indeo 4 and 5; and so-called mezzanine or intermediate codecs where scalability matters more than compression ration is probably the main niche where wavelets are still alive), it’s just the base (low-low-..-low-pass) image is better coded in a different way (to the point that sometimes wavelet-based formats store it in raw form instead or do not bother with compression beside basic Huffman coding).

And why is it so? There are two main reasons for that: wavelets are not good for representing images and wavelets are not good for coding.

The second reason is easier to explain so let’s start with it. No matter if you’re using lossy or lossless compression you’d want your results to be deterministic. We had enough fun with floating-point DCT implementations producing slightly different results and accumulating errors over time. Of course it’s more important for video but in either case working with floating-point numbers may result in subtle but noticeable differences (unlike modern formats that use bit-exact operations to ensure the output is the same if decoded according to the specification). Everybody seems to use the same LGT 5/3 transform for lossless compression and whatever for lossy transforms (and further lossy compression).

Now the first reason has to do with the theory behind data compression in general. Essentially, you need a model predicting your input data and coding system for encoding prediction error. The better you can predict input according to the model the fewer bits you need to spend on encoding error, the better compression ratio you have. For example, RLE has a simple model “next value will be the same as the previous value” and thus codes runs of the same value quite effectively—unlike the situation when input bytes differ from their immediate predecessors. Similarly with image compression—the better we can predict the image to the model the better compression we’ll have. PNG operates on two assumptions: that the image pixel will be not very different from its (already coded) neighbours and that the differences will have repeated patterns (in other words, pixel filtering and LZ77-based deflate). JPEG operates on the principles that human eye perceives luminance better than chrominance (so the latter can be sub-sampled) and that human eye is more perceptible to lower frequencies (so discarding high frequencies will not impact quality much while saving a lot of bits) and does not care about fine differences in their values either (hence quantisation). Wavelets are not a good tool for that (mind you, they work but they’re not the best solution).

In the current form wavelet-based compression represents image as the base version first down-scaled, say 8-16 times and then scaled back, overlaid with the scaled horizontal/vertical/diagonal differences between regions. And considering that humans are not picky, you can discard the smaller of those differences (and quantise larger ones to a smaller set of values for easier coding) and nobody will complain about smoother picture—at least in theory. In practice most images do not work like that, they have local features and round edges not well representable by this method. That’s why block-based image coding still works better for a recognisable picture: no matter how bad you quantise individual blocks, the error is still contained in those blocks instead of being smeared throughout whole image. And despite what you may think, Gibbs effect is still present in wavelet coding, it just manifests itself not in ringing artefacts but rather in “ants” and larger blurry blocks of different brightness. There is a different but related issue: when you want to code a certain region of image with a better quality than the rest then in the block-based case you merely need to signal that you use a different quantiser, for wavelets you have to resort to tricks like multiplying pixels in the selected region before coding and scaling them back during decoding in order to present details (at least that’s how JPEG-2000 works; maybe you can also achieve it with special wardsprecincts, bands and sub-layers combination but I cherish my ignorance). And of course it hurts overall compression in rather unpredictable way.

Speaking of which, compression is also somewhat dependent on the filter shape, and most of the wavelet filters filters are unnatural (in the sense they don’t look like what any natural process would produce). Sine waves at least resemble gradients (again, look at the picture in Wickedpedia), while e.g. Daubechies wavelets look much weirder. There’s a reason why the most used wavelets in compression look more like good old sinc function. Mind you, that does not mean they can’t work, it means they are not the most effective basis with all the consequences for compression efficiency versus image quality trade-off (here is a good opportunity to mention theoretically optimal KLT but it has the problem of being too computationally expensive and requires transmitting basis coefficients, so it’s still not the best practical solution).

And if you move from coding images to coding video, then you have yet another fundamental problem. Of course you may cheat and call Motion JPEG-2000 a video codec (ultimately, any video is a sequence of pictures). But if you want to achieve better compression for video stream, you need to exploit not merely spatial redundancy (i.e. what wavelet coding does) but temporal redundancy as well (i.e. dependencies between frames). That’s where wavelet compression hits the wall, walks back and pretends it simply does not want to move further. The current method of exploiting temporal redundancy is motion compensation—in essence, just finding bits of previous image and saying “if we move this block from the previous frame here it’s almost what we want to code” (nowadays several reference frames are searched and blocks may be transformed in some way before use too). This leaves you with a small residue to code, which obviously takes fewer bits than the whole block. Usually most of the frame can be replaced with such motion-compensated blocks, leaving isolated areas of changed areas to code. And guess exactly what kind of content is not good to code with wavelets. I’m aware of one attempt to work around it (by using overlapped motion compensation, when block gets in some way averaged with other blocks—which makes the residue somewhat smoother and easier to compress but at the same time there’s more of it, making it harder to compress; see how it all balances out?) and one alternative, namely 3-D wavelets (yes, just apply wavelet transform to a group of frames too; I guess it worked about as good as 3-D DCT and that’s why we hear about them both so often).

You can consider the fact that wavelet craze has passed about twenty years ago and nobody adopts new wavelet formats as an indirect proof of this approach not overcoming the aforementioned difficulties (again, there may be other explanations why we have AV1 and JPEG-XL instead of, say, Snow 2 and JPEG-3000 and I’m eager to hear them). Tangentially I’d like to point out that JPEG-2000 is a shitty format and it is probably responsible for the decline of them all. I’m not talking about patents, I’m talking about its insane flexibility (where data can be organised in too many ways—I’ve mentioned layers, tiles and precincts already) and going all-on on compression disregarding common sense. From what I know it has its use essentially in PDF and DCP. And considering that it takes five to ten seconds to decode a single PDF page coded with JPEG-2000 on my laptop (let alone on e-book reader), I’m inclined to believe that its use in digital cinema is an anti-piracy measure (if pirates manage to obtain and decrypt DCP file, it will take them so much time to decode it that Blu-Ray release will have happened before the process finishes). If you disagree, then please explain why ITU T.814 (High-throughput JPEG 2000) exists. Personally when I download such PDFs (scans of some really old books), I immediately convert them to DjVu—a document format also based on wavelets but where they actually thought about both decoding considerations and inherent limitations, after all they use high-definition binary layer for high-contrast details instead of trusting wavelets to code them effectively.

There you have it, I’ve given my reasons and I’m ready to hear counter-arguments. Bring on your popular wavelet-based codecs that compete with H.265 or AVIF, I’ll gladly amend this post.

P.S. Next time I’ll describe the IFS Fractal format as I’ve managed to make decoder work properly at last.

Leave a Reply