LZ77-based compressors — a story similar to lossless codecs

What do LZ77 compressors and lossless codecs have in common? They are both perform lossless compression and there are too many of them because everyone tries to invent their own. And like lossless audio codecs — quite often in their own container too.

In case you don’t know (shame on you!) LZ77 scheme parses input into pieces like <literal> <copy> <literal> ... Literal means “copy these input bytes verbatim”, copy is “we had that substring some time ago, copy N bytes from the history at offset M”.

The idea by itself is rather simple and thus it’s easy to implement some LZ77 parsing with the following coding, slap your name on it and present as some new algorithm. There are three branches of implementation goals there — fast (but somewhat decent) compression, high (but not so fast) compression and experimental research that may lead to implementations in the first two branches.

Fast compression schemes usually pack everything into bytes so no time is wasted on bit reading. Usually format is like this — if top three bits of the next byte are something, then read literal copy length, otherwise determine offset size, read it and copy string from the dictionary. Quite often there are small tweaks to make compression faster (like using hashes) or slightly better (using escape values to code long values and coding small offsets/lengths into opcode etc.). There are so many implementations like that and they still keep appearing. LZO, LZF, FastLZ, snappy, chameleon… And lots of old games used such compression for their resources (including video) too.

High compression schemes use much better compressing of the data produced by LZ77 parsing and spending more cycles on finding the best parsing of the input. It all started essentially with LZHUF when someone decided to employ Huffman codes instead of writing values in a fixed amount of bits. If you’ve never heard about LHA/LZH you need your Amiga box confiscated. This approach reached its peak with Deflate — by modern standards it’s not the best format to compress (i.e. not fast enough, does not compress high enough etc etc.) but it’s the standard available everywhere and in any form. Deflate uses custom per-block Huffman codes with their definition stored in compressed form as well so there’s hardly anything to improve there radically. And thus (patent expiration helped greatly too) another form of LZ77-based compression started to bloom — LZA (using modelling and arithmetic coding on LZ77 parsing results). Current favourite LZMA (and main RAR compression scheme) uses this approach too albeit in very sophisticated form — preprocessors to increase compression ratio on some kinds of known data, Markov models, you name it.

And here’s my rant — leave Deflate alone! It’s like JPEG of data compression — old and seemingly not very effective but it’s ubiquitous, well-supported and still has some improvement potential (like demonstrated by e.g. 7-zip and zopfli). I hate it to have as many compression schemes to support as video codecs. Deflate and LZMA are enough for now and I doubt there will be something significantly more effective appearing soon. Work on something lossy — like H.265 encoder optimisations — instead.

Comments are closed.