A small rant about compression

The recent news about OpenZL made me think about some tangential issue.

The approach by itself is nothing new really, a lot of archivers include pre-processing step for data (I don’t know if there are an earlier examples, but de-interleaving or delta-coding floating-point data might be only slightly younger than geo file in the Calgary Corpus, LZX includes translating call addresses into absolute offset for better compression etc); more advanced archivers implement flexible processing steps (e.g. RAR had its own custom VM for pre-processing data which was essentially a security nightmare cut-down 8086 instruction set, and ZPAQ which allows to define compression steps for data-specific compression that won’t require a new decoder—in other words, something very similar to OpenZL). There’s nothing wrong with the approach and it’s probably useful outside, say, genomic data compression, it’s just it raises two questions: what is the current general compression/resources accepted trade-off and what would be a good candidate for an open-source archiver?

The first question is obvious: with the times the available CPU power and RAM grows along with the amounts of data to compress. Back in the day gzip was the golden standard and bzip2 was something eating too much RAM and worked rather slow. A bit later .tar.bz2 started to replace .tgz for, say, distribution tarballs. Nowadays it’s .tar.xz or .tar.zstd, which makes me wonder if it’s really the sweet spot for now or if things will move to adapting a compression scheme that’s slower but offers better compression ratio.

The second question follows from the first one: what would be a good candidate, specifically for open-source applications. If you look around, there are not so many of those. You can divide existing formats (don’t confuse them with implementations) into several (sometimes overlapping) categories:

  • proprietary formats with an official open-source decoder at best (like RAR) or unofficial reverse-engineered one (e.g. RAD mythical sea creatures formats and LZNIB);
  • open-source compression libraries targeting fast compression (LZO, LZ4, FLZ, LZF, etc, etc);
  • old open-source compressors (compress, gzip, bzip2, zip);
  • various programs trying to bank on well-known name while not being related (bzip3 and anything with “zip” in its name really);
  • state-of-the-art compressors that require insane amounts of CPU and RAM (anything PAQ-based, NNCP);
  • corporate-controlled open-source formats (brotli, Zstandard).

The question is what would be a good candidate for the next de-facto compression standard. The current widespread formats are good since they’re easy to implement and there are many independent implementations in various languages, but how much can we trust the next generation—the one with flexible input pre-processing (the third question would be if that’s really the design approach mainstream compression formats will take).

For instance, I have nothing against LZMA but considering that its author is russian how much can we trust that he won’t be visited by FAPSI representatives and forced to make changes in LZMA3 design that will make Jia Tan green with envy? As for the formats coming from corporations, are you really going to rely on their goodwill? I think the story with LZW should serve as a warning.

The only reassuring thing is that it is still rather easy to design a new compression scheme and even achieve decent compression ratio and performance (unlike training a neural network or even designing a video codec to rival H.265), so good candidates are likely to appear sooner or later.

Leave a Reply