You can blame Paul for interesting me enough to spend some time on research. He discussed how he implemented deflate support for librempeg and I got curious myself how other fast LZ77-based encoders do the matching.
The traditional approach is to have a hash chain (i.e. associate each dictionary position with a hash, usually calculated from three symbols, pointing to the next position with the same hash). It’s simple and works surprisingly well. Almost all compressors I looked at used either this approach or a hash map.
The only exception I saw was GLZA which described itself as counting symbol frequencies first and then building partial suffix tree just for the most common symbols.
And finally I encountered the paper by Sadakane and Imai from 2000 where the authors proposed two improvements: introducing the secondary hash for the collisions in the hash chain or, alternatively, using suffix sorting (in a suffix array). They report that using two-level hash improves performance significantly, especially for large dictionary sizes but it’s negligible for 64kB or less, while suffix sorting time seems to grow logarithmic from the dictionary size but it beats one-level hash at dictionary sizes of 256kB and two-level hash at dictionary sizes above 1MB.
So, apparently, this is yet another case when keeping it simple is advantageous. Compressors with larger dictionaries will benefit from a different approach, while deflate has maximum dictionary size at 32kB and anything but brute force search will be about equally good.
In either case, I don’t regret spending time on this quick research.
Current deflate encoder (local only non in the repo) had initially only run-length type of checks (for past 1-8 bytes),
and that made it extremely fast but not very good at compression ratio, once I added
also just prev stride/line copy check it reached “state of art” – almost, there is still 5% hidden somewhere where it compress 5% worse then reference encoder and speed it still extremely fast compared to reference encoder (more that 10x faster).
So I’m afraid its not that easy – trivial task, especially for screen-like content (with flat areas) and when there is many ways to combine different literals with distance symbols.
I’m not interested in brute-force search naive like algorithms – there is nothing of value there. This is compressing multimedia content (usually just gray, or rgb triplet) not some weird ascii text or genome. Anyway some kind of packed representation of all information of current chunk like presented by some kind of suffix automata is way to go, but I’m not that good at implementing efficient/fast/useful one and information is very limited on open web.
So looking for very fast (algorithm) and efficient encoder that can at least reach current compression ratio of zlib (it have different compression levels, for simplicity just concentrate on default zlib compression level). (also looking at libdeflate but I think local code is much faster than libdeflate and compress worse than libdeflate)
Also, about using hash dict/map. I’m afraid that will hurt current extremely fast performance of deflate encoder.
So looking at more generic solution/approach as that would work out for all family of similar compressors (even I currently work only on deflate).
That’s the problem of deflate format – it’s hard to estimate the cost of coding symbols.
As for combining, there is a simple rule called lazy matching: do not code the current match until you try the next one, maybe literal plus next match will compress better. Add couple of simple rules (the lower distance the better and the longer the match the better) and you’ll have something.
Also I don’t expect a universal solution for all cases. Maybe you can improve compression ratio without sacrificing much speed by keeping a small LRU list with last matches instead of full hash, maybe you can check last 2-3 match positions if they provide good matches again and such. Maybe it’s a place for skip lists… In either case you’ll have to sacrifice a lot of speed for diminishing compression gains, so pick a couple of sweet spots and see what you can do about matching/parsing for each of them.
There is obvious rule to apply that I overlooked partially and your post completely revealed it to me. Do not add new literal+distance combination if it will cost more than simple literal in overall bitstream. That is why current encoder beats default zlib encoder by 30% percent in compression ratio in some cases. Will research how best to cover this step by not sacrificing speed significantly.
It’s a part of tokeniser so I enable different tokenisers depending on compression level: fast means greedy approach (i.e. longest match it is), better compression level means lazy tokenisation (i.e. postpone decision until the match for the next symbol is known and can be compared), best compression level means optimal tokenisation (which means testing all possible matches to find the best way to code sequence, IIRC it’s five to ten times slower though).
Similarly I use static or dynamic codes depending on compression level, so maybe you should not pursue one universal solution either and pick different methods for different profiles (e.g. RLE only plus static codes for veryfast, RLE plus selected matches plus dynamic codes for fast, more matches plus greedy method plus dynamic codes for medium-level compression and so on). Of course I’ll be happy to hear about any results from you.
I’m aware of different presets/profiles, but I want to beat all of them by doing something novel yet known. So the fastest profile can beat even the slowest “brute-force” approach.
I have some rough idea how to do it with FFT but yet needs to confirm its doable.
And I’m aware that convolution is a big hammer that helps solving many things, it’d be interesting if you manage to apply it here successfully.
Its more about variable length+offset cross-correlation, but even that would be slow.
Using suffix automaton on say 2*width values is much faster but still more than several times slower than current situation with fixed checks.
I made encoder even faster, by not comparing bytes directly all the time, but instead comparing chunks of bytes – for prev stride case non-literal codes.
But I gave up, for now, on improving compression by 5-10% percent, that in some cases zlib beats native code. I just couldn’t figure how to make lazy matching fast and reliable.
So for synthetic flat areas frames the deflate native encoder is much faster (even faster then fastest compression_level 1 which give much worse compression size) and compress consistently better than zlib. For less flat areas and real non-synthetic footage its 5%-15% compression worse than default zlib compression level but still at much faster speed.
That’s still a result to be proud about.