Rants on Data Compression

… When I was a young piglet I liked to read the rather famous paper by Bell, Cleary and Witten discussing general data compression and PPM. The best phrase there was that the progress in data compression is mostly defined by larger amounts of RAM available. I still believe those words to be true and below I present my thoughts on current state of data compression. Probably it’s trivial, well-known, obvious or wrong to anybody knowing a bit about data compression but well, it’s my blog and my discarded thoughts dumpster.

General data compression

Let’s start from the very end — entropy coding. There are two approaches: coding into integer amount of bits or coding as close to Shannon’s entropy limit as possible. For both we have been having optimal coding methods for about half a century (Huffman coding — 1952, arithmetic coding — mid-1970s). You cannot improve compression ratio here, so the following schemes are mostly tradeoffs sacrificing a bit of compression for speed gains (especially in form of (pseudo-)arithmetic coders operating only on binary). The only outstanding thing is so-called Asymmetric Numeral Systems but I suspect they are isomorphic to traditional entropy coders.

Now about let’s look at what feeds data to entropy coders. There are two main approaches (often combined): context modeling (probably the real foundation for current highest compression methods — PPM — was proposed in mid-1980s) and LZ77 (guess the year yourselves). Are there improvements in this area? Yes! The principle is simple — the better you can predict input the better you can code it. So if you combine different methods to better handle your data you can get some gains.

And yet the main compression gain here lies in proper preprocessing. From table or executable code preprocessing (table data usually differs only a bit between entries and for executables you can get some gains if you replace jump/call addresses with absolute values) to Burrows–Wheeler transform plus move-to-front plus RLE if needed etc.

Audio compression

You have four main targets here: general lossy compression, speech compression, lossless fast compression and lossless crazy compression.

General lossy compression follows the scheme established in 1990s or earlier: transform to frequency domain, grouping frequencies and coding frequency bands. Most of the methods are quite old and progress is defined mostly by how much RAM and CPU users are willing to sacrifice on it. For example, Celt (main part of Opus; the other part, SILK, is an ordinary speech codec) is not that much different in design from G.722.1 from late 1990s.

Speech coding follows canons from 1980s too — performing LPC, coding filter coefficients and other information enhancing signal reconstruction (pulse position, pitch tilt etc.).

Lossless fast compression (aka for normal usage) follows the suit too — you have LPC or some adaptive filters used for prediction plus residue coding (usually with Golomb/Rice codes from 1960s-1970s, BTW the original Golomb paper is AWESOME, they don’t write papers like that nowadays).

Lossless crazy compression (aka spend hours compressing it and as much for decompressing) employ the same suit but they have longer filters and usually even several filters of different size applied each after another plus better residue coding schemes.

Image compression

Here you have more variety of coding methods but most of them are very old (just look when Haar wavelet was proposed). Especially funny is that JPEG is still holding strong despite being more than twenty years old. I still remember so-called fourth generation image compression (separating image into region borders and textures to fill them and coding those), it didn’t lift off yet despite being introduced in late 1980s or so.

The only interesting development happens in lossless image compression but neither 2-D LZ77 (WebP) nor context modeling (FLIF) are particularly new ideas.

Video compression

Modern codecs are all so similar and they are usually ripoffs of H.26x (there are two exceptions — Thor, which is not a ripoff just because it was designed with openly acknowledging that some parts are taken from H.265, and Daala, which is more original and it’s discussed below).

So nowadays you have a very limited subset of ideas that were present in video codecs from 1990s — it’s boring macroblocks (now with quadtree partitioning instead of fixed size), motion compensation (now you have more reference frames to choose from though) and binary entropy coder (except for Thor, it went the way of RealVideo 3/4 with context-adaptive VLCs). Even the trend of adding special coding tools for special content doesn’t look that original (if you remember countless screen codecs and MPEG-4 Audio, *barf*).

The only exception for now is Daala that uses more original ideas but I fear it will end the same boring codec because it is not crazy enough to make a breakthrough. I believe it should do more crazy preprocessing at least and maybe better modeling, e.g. taking more than nearest neighbours into account (maybe even use something PPM-like for element coding and not just probabilities mixing). Look at JBIG for inspiration maybe 😉


Don’t expect miracles in data compression to happen anytime soon but couple of percent improvements for specialised fields at least in a decade is possible and even expected.

2 Responses to “Rants on Data Compression”

  1. Peter says:

    That Golomb paper is gold, thanks!

  2. Marcus says:

    I completely agree, especially about Daala, people seem to expect such a big jump in performance, but chroma subsampling, then DCT transforming, then outputting with an entropy encoder just can’t get much better, we need a breakthrough.