Why wavelets is a bad choice for image coding « Kostya's Boring Codec World

Why wavelets is a bad choice for image coding

I’ve been teasing it for too long and finally the time for this post has come. I really have a reason to believe that wavelets are not the best choice for image compression, so here are my arguments.

Let’s start from the basics: Fourier transform and DCT is essentially representing a signal (function, data set, whatever) as a sum of sine waves with increasing frequencies, spanning the whole range; wavelet transform is representing a signal as splitting it into low- and high-pass parts (roughly speaking, averages and differences between neighbouring signal points) and repeating process recursively for each following band (i.e. split low-pass part into low-low-pass and high-low-pass parts, and split high-pass into low-high-pass and high-high-pass, then split each of those parts in the same way etc etc).

In theory this enables better signal analysis because you can localise transients, which is not possible with FFT/DCT (and this is the reason why WT was invented in the first place). I believe it makes a lot of sense for stochastic processes (like Brownian motion it was initially used for), maybe it is a good fit for audio (I don’t know DSP much but QMF looks suspiciously related and there are at least two wavelet-based audio codecs too; and de-noising looked promising…) but images are a different thing.

Just look at the sample image from Wickedpedia. The first thing you’ll notice is that nobody bothers to transform high-pass coefficients further. Wavelet filters are essentially used to split out low-pass recursively while high-pass parts are usually areas of zeroes and rather incompressible noise-like coefficients. That’s the reason why nobody bothers with trying to do much with those bands—just quantise it and code zero regions as effectively as possible (insert the joke about true wavelet compression not being tried yet if you like). Mind you, there’s nothing wrong with using it for scalability (as some formats did, like Indeo 4 and 5; and so-called mezzanine or intermediate codecs where scalability matters more than compression ration is probably the main niche where wavelets are still alive), it’s just the base (low-low-..-low-pass) image is better coded in a different way (to the point that sometimes wavelet-based formats store it in raw form instead or do not bother with compression beside basic Huffman coding).

And why is it so? There are two main reasons for that: wavelets are not good for representing images and wavelets are not good for coding.

The second reason is easier to explain so let’s start with it. No matter if you’re using lossy or lossless compression you’d want your results to be deterministic. We had enough fun with floating-point DCT implementations producing slightly different results and accumulating errors over time. Of course it’s more important for video but in either case working with floating-point numbers may result in subtle but noticeable differences (unlike modern formats that use bit-exact operations to ensure the output is the same if decoded according to the specification). Everybody seems to use the same LGT 5/3 transform for lossless compression and whatever for lossy transforms (and further lossy compression).

Now the first reason has to do with the theory behind data compression in general. Essentially, you need a model predicting your input data and coding system for encoding prediction error. The better you can predict input according to the model the fewer bits you need to spend on encoding error, the better compression ratio you have. For example, RLE has a simple model “next value will be the same as the previous value” and thus codes runs of the same value quite effectively—unlike the situation when input bytes differ from their immediate predecessors. Similarly with image compression—the better we can predict the image to the model the better compression we’ll have. PNG operates on two assumptions: that the image pixel will be not very different from its (already coded) neighbours and that the differences will have repeated patterns (in other words, pixel filtering and LZ77-based deflate). JPEG operates on the principles that human eye perceives luminance better than chrominance (so the latter can be sub-sampled) and that human eye is more perceptible to lower frequencies (so discarding high frequencies will not impact quality much while saving a lot of bits) and does not care about fine differences in their values either (hence quantisation). Wavelets are not a good tool for that (mind you, they work but they’re not the best solution).

In the current form wavelet-based compression represents image as the base version first down-scaled, say 8-16 times and then scaled back, overlaid with the scaled horizontal/vertical/diagonal differences between regions. And considering that humans are not picky, you can discard the smaller of those differences (and quantise larger ones to a smaller set of values for easier coding) and nobody will complain about smoother picture—at least in theory. In practice most images do not work like that, they have local features and round edges not well representable by this method. That’s why block-based image coding still works better for a recognisable picture: no matter how bad you quantise individual blocks, the error is still contained in those blocks instead of being smeared throughout whole image. And despite what you may think, Gibbs effect is still present in wavelet coding, it just manifests itself not in ringing artefacts but rather in “ants” and larger blurry blocks of different brightness. There is a different but related issue: when you want to code a certain region of image with a better quality than the rest then in the block-based case you merely need to signal that you use a different quantiser, for wavelets you have to resort to tricks like multiplying pixels in the selected region before coding and scaling them back during decoding in order to present details (at least that’s how JPEG-2000 works; maybe you can also achieve it with special ~~wards~~precincts, bands and sub-layers combination but I cherish my ignorance). And of course it hurts overall compression in rather unpredictable way.

Speaking of which, compression is also somewhat dependent on the filter shape, and most of the wavelet filters filters are unnatural (in the sense they don’t look like what any natural process would produce). Sine waves at least resemble gradients (again, look at the picture in Wickedpedia), while e.g. Daubechies wavelets look much weirder. There’s a reason why the most used wavelets in compression look more like good old sinc function. Mind you, that does not mean they can’t work, it means they are not the most effective basis with all the consequences for compression efficiency versus image quality trade-off (here is a good opportunity to mention theoretically optimal KLT but it has the problem of being too computationally expensive and requires transmitting basis coefficients, so it’s still not the best practical solution).

And if you move from coding images to coding video, then you have yet another fundamental problem. Of course you may cheat and call Motion JPEG-2000 a video codec (ultimately, any video is a sequence of pictures). But if you want to achieve better compression for video stream, you need to exploit not merely spatial redundancy (i.e. what wavelet coding does) but temporal redundancy as well (i.e. dependencies between frames). That’s where wavelet compression hits the wall, walks back and pretends it simply does not want to move further. The current method of exploiting temporal redundancy is motion compensation—in essence, just finding bits of previous image and saying “if we move this block from the previous frame here it’s almost what we want to code” (nowadays several reference frames are searched and blocks may be transformed in some way before use too). This leaves you with a small residue to code, which obviously takes fewer bits than the whole block. Usually most of the frame can be replaced with such motion-compensated blocks, leaving isolated areas of changed areas to code. And guess exactly what kind of content is not good to code with wavelets. I’m aware of one attempt to work around it (by using overlapped motion compensation, when block gets in some way averaged with other blocks—which makes the residue somewhat smoother and easier to compress but at the same time there’s more of it, making it harder to compress; see how it all balances out?) and one alternative, namely 3-D wavelets (yes, just apply wavelet transform to a group of frames too; I guess it worked about as good as 3-D DCT and that’s why we hear about them both so often).

You can consider the fact that wavelet craze has passed about twenty years ago and nobody adopts new wavelet formats as an indirect proof of this approach not overcoming the aforementioned difficulties (again, there may be other explanations why we have AV1 and JPEG-XL instead of, say, Snow 2 and JPEG-3000 and I’m eager to hear them). Tangentially I’d like to point out that JPEG-2000 is a shitty format and it is probably responsible for the decline of them all. I’m not talking about patents, I’m talking about its insane flexibility (where data can be organised in too many ways—I’ve mentioned layers, tiles and precincts already) and going all-on on compression disregarding common sense. From what I know it has its use essentially in PDF and DCP. And considering that it takes five to ten seconds to decode a single PDF page coded with JPEG-2000 on my laptop (let alone on e-book reader), I’m inclined to believe that its use in digital cinema is an anti-piracy measure (if pirates manage to obtain and decrypt DCP file, it will take them so much time to decode it that Blu-Ray release will have happened before the process finishes). If you disagree, then please explain why ITU T.814 (High-throughput JPEG 2000) exists. Personally when I download such PDFs (scans of some really old books), I immediately convert them to DjVu—a document format also based on wavelets but where they actually thought about both decoding considerations and inherent limitations, after all they use high-definition binary layer for high-contrast details instead of trusting wavelets to code them effectively.

There you have it, I’ve given my reasons and I’m ready to hear counter-arguments. Bring on your popular wavelet-based codecs that compete with H.265 or AVIF, I’ll gladly amend this post.

P.S. Next time I’ll describe the IFS Fractal format as I’ve managed to make decoder work properly at last.

This entry was posted on Thursday, October 30th, 2025 at 1:28 pm and is filed under Image Formats, Useless Rants. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

6 Responses to “Why wavelets is a bad choice for image coding”

Tomas says:

November 6, 2025 at 5:53 pm

Part 1 JPEG-2000 is indeed a huge mess – it essentially does CABAC on every single bit. HTJ2K dials it way back and codes LSBs verbatim I think. This idea could be applied to JPEG-esque codecs, especially with range coding: just range code the number of bits and maybe the sign bit and the second MSB, and emit the LSBs verbatim.

What do you make of pyramidal JPEG? That is, passing the DC coefficients through another round of DCT and so on. It’s part of the JPEG spec but not implemented anywhere I think. It seems like an interesting middle-ground, decimating width and height by 8 instead of just 2 as DWT does, meaning fine details still get coded the same way normal DCT does.

Oh and the proliferation of “AI” chips will probably make KLT practical as a side-effect.
Kostya says:

November 7, 2025 at 12:07 am

HTJ2K introduces a different coder altogether, based on static codebooks and some adaptive bit-run coding (MEL coder) that does not involve range coder at all. Anyway, residue coding is not a new idea and it may be borrowed from elsewhere (including audio codecs, this trick is well-known there). It’s more the question of the trade-offs.

As for pyramidal coding, I suspect that even back in the day it was not a good trade-off (i.e. it made picture significantly worse—you’d probably spot DC variance in neighbouring blocks—for no significant savings and more decoding delay). The idea was employed in H.264 for transforming DCs of 4×4 luma subblocks in intra macroblock but it was an optional step and not all intra blocks benefitted from it. I suspect variable-size (and -shape) blocks won by being more suited for describing image and such hierarchical schemes do not play well with them.

Considering KLT, I don’t have enough information for that but I have doubts. You still have libavcodec/pca.c which implements such analysis (and is not used anywhere—very telling). From a glance it looks not like something well-aligned with the way current “AI” is done (multiplying matrices or tensors of low-precision floats). Maybe you can some quick-and-dirty process of finding an approximately good basis by matching signal against some known basis sets instead (that one looks much more feasible and appropriate here, at least to my uneducated eye).
Tomas says:

November 7, 2025 at 11:03 am

Seems I’m mistaken about the details of HTJ2K. Anyway, the PCA implementation in lavc is crap – it’s a good example of why NIH is bad. The correct way is to use a proper linear algebra library like Eigen. This goes for huge parts of FFmpeg. Sadly too many devs fail to realize code is a liability. Many kLOC could be removed via a few dozen lines in the build system.

KLT deserves a closer look. For one thing we wouldn’t be limited to powers-of-two transforms. It wouldn’t even be limited to a rectangular grid – a hexagonal grid is entirely possible. The concept of lapping could also be generalized.

RE: space, we could code transform coefficients in a sparse manner. We could handle intra prediction similarly: rather than picking from a fixed set of affine transforms, we could allow sparsely coded custom transforms.
Kostya says:

November 7, 2025 at 12:47 pm

NIH by itself is not always bad, implementing a thing independently makes you learn more about it and sometimes even discover an improvement. Hoarding, on the other hand, definitely is (and that’s what you have).

What I said about KLT stands though: theoretically it’s the best, in practice it’s still too high computation cost and metadata expenses (coding basis coefficients is not free). At least the former is mitigated by the hardware improvements, so it should be more practical in the near future. If you have beefy hardware you may prepare for it now, my main development laptop is still from 2010. And of course when you can play with KLT you can start devising new schemes for coding it like vector quantisation, prediction from the neighbours and so on. And if resources allow you can move to e.g. triangular transforms (triangles are easier to sub-divide than hexagons after all). The possibilities are limited only by time and available computing resources.
Maciej says:

November 13, 2025 at 6:48 pm

The issue with KLT is not only that it is data dependent and you’d need to send basis functions, but also numerical. The transform is formed by the eigenvectors of a decomposition of a covariance matrix. This matrix, however, needs to be estimated from data and while it can be done reliably given a sufficiently big pixel block, the estimator will suffer for smaller ones which translates to a suboptimal basis. As usual, there is a tradeoff: bigger blocks lead to better covariance and basis estimators (good for regions with consistent statistics), however to adapt to more dynamic areas, blocks should be smaller (similarly to quad-tree) which leads to increased estimator variance and so the basis inferred from it will suffer.
Kostya says:

November 14, 2025 at 1:23 pm

I see it as two problems: one inherent to most transforms (selecting a size that does not hurt details but allows good compression) and (KLT-specific one) selecting which blocks to group for the common basis.

Both of them can be solved with a lot of computing power though.

Why wavelets is a bad choice for image coding

6 Responses to “Why wavelets is a bad choice for image coding”

Leave a Reply