One would wonder why. The answer is simple: I merely wanted to play with LZW encoding and created a GIF encoder. Then I needed to debug it and thus ended up with a decoder as well. And while I was at it I added animated GIF support for both decoding and encoding.
I’m not going to repeat on how it works, it’s been written in all possible textbooks and tutorials on data compression (and then some), I’m more interested in implementing an efficient encoder for it. I should remind you that LZW compression was patented (at least by Sperry and IBM) and that patent introduced a lot of chaos in multimedia before MP3 or H.264 were a thing.
The algorithm simply tells you to find a match for the input string in the dictionary and add a new string to the dictionary. You can keep the actual strings in the dictionary (simple but not very effective), you can keep the reference to it in the input (simple and saves a lot on dictionary space but not very effective and you can’t feed it input by chunks), you can keep the same dictionary structure as the decoder and try to match input to the code sequence (the most compact way to store the dictionary but not so straightforward to use; more on a practical extension of this approach at the end of this post). Since I do not care about wasting whopping two megabytes of RAM on the dictionary, I implemented a simple trie: each node contains a code ending at that node and 256 pointers to the next next trie nodes (each code/pointer fits into 16-bit integer with zero as “invalid value” pointer). Of course it can be made more compact but I prefer simplicity and clarity.
Now, the question is whether the compression efficiency can be improved somehow. Since the decoder expects encoder to work in a certain manner we can’t change format. The tricks that work good for LZ77 like lazy matching do not work that great for LZ78: you have dictionary state changed at each step and you don’t know which entry will be more useful in the future. Also emitting shorter matches means we also litter the dictionary with duplicates, which does not help compression in the long run either.
Thus the only thing we can use is creative dictionary clearing. E.g. we can emit reset code after every N symbols encoded to keep the code length constant (that’s the trick various libraries used to produce the compliant stream without doing any compression and thus not infringing the patent), or you can keep the dictionary filled with what it was until the very end (and if data properties change your compression ratio hurts), or (as most encoders do) you simply reset the dictionary after it gets full. But can you do better?
I tried two heuristics. First, I tried segmenting data by having a sliding window and comparing the character occurrence in each half, if those differed by a lot then it was a good place to restart. From what I heard, this simple approach allows better compression for LZ77- and BWT-based compression schemes. In this case it helped only when compressing the data that could not be compressed at all (e.g. greyscale continuous-tone images), in other cases the results were worse than with the default strategy. The second approach was simple: keep the rolling average of last, say, sixteen match lengths and if it drops significantly when the dictionary is full then it’s time to clear it. Surprisingly enough it seemed to improve compression for several fractions of a percent in general cases (and sometimes worsened the ratio on hard-to-compress files by about the same amount) so I kept it.
And while talking about GIF and LZW, here are some words about FFmpeg
support of GIF: it’s rather laughable. While encoding to GIF has been supported from the early days (from 2002, to be less vague), actual LZW compression support was introduced only in 2009 (using a work of some Summer of Code student from 2007; the notorious patent on LZW compression had expired by that time). A bit later in the same year the encoder was improved to work with paletted input instead of RGB24 (which was maimed to use fixed palette anyway). In the same time the original animated GIF encoder+muxer (not to be confused with now improved single-image GIF encoder) kept existing with all the original deficiencies (RGB24 input, fixed palette, no compression) so by encoding to .gif
without specifying image2
muxer you’d get significantly worse results. I heard it’d been improved at last back in 2018 but I have not been caring before so why start now?
The interesting thing there is that LZW compression employed for both GIF and TIFF seems to be based on compress
approach with a hash table. Essentially the dictionary is stored in the same form as in the decoder (a reference to the previous code plus suffix character) but indices are hash codes instead of sequential entries and the encoder in a decoder mirror mode: it traverses the table to find the code that corresponds to the current code plus a new character (and if it’s not found, the prefix code is output and new code pair is added to the dictionary). Also it seems to compress data slightly worse than the straightforward implementation in some cases (e.g. in my tests one 320×240 PGM file got expanded to 83159 bytes instead of 83132 bytes and another 112×288 PPM file got compressed to 10371 bytes instead of 10366 bytes; the difference is negligible but it’s there). I suspect this is caused by hash collisions i.e. in some rare cases a shorter code and a longer one have the same hash and the former gets selected even when the latter should be better.
In either case, this might have been not a very useful exercise but it was fun and somewhat educational too. Now only to find a thing to do next…