Out of curiosity I decided to check how partitioning input before creating a codebook affects encoding speed. So I’ve added a mode to Cinepak encoder that partitions vectors by luma variance and creates a part of common codebook just for them. The other two modes are median cut (the simplest one but also with mediocre output) and ELBG (that uses median cut to create the initial codebook—also if it’s not full that means we have all possible entries and do not need to perform ELBG at all).
Here are rough results on encoding several different files (and using different number of strips): median cut worked for 11-14 seconds, ELBG took 110-160 seconds, new mode (I decided to call it fast) takes 43-62 seconds. I think even such approximate numbers speak for themselves. Also there’s an interesting side effect: because of the partitioning it tends to produce smaller codebooks overall.
And while we’re speaking about quantisation results, here’s the first frame of waterfall
sample encoded in different modes:
As you can see, median cut produces not so good images but maybe those artefacts will make people think about the original Cinepak more. Fast mode is much nicer but it still has some artefacts (just look at the left edge of the waterfall) but if you don’t pay too much attention it’s not much worse than full ELBG.
Are there ways to improve it even further? Definitely. For starters, the original encoder exploits the previous codebook to create a new one while my encoder always generates a new codebook from scratch (in theory I can skip median cut stage for inter strips but I suspect that ELBG will work much longer in this case). The second way is to fasten up the ELBG itself. From what I could see it spends most of the time determining to which cluster each of the points belong. By having some smarter structure (something like k-d tree and some caching to skip recalculating certain clusters altogether) it should be possible to speed it up in several times. Unfortunately in this case I value clarity more so I’ll leave it as is.
P.S. I may also try to see how using thresholds and block variance to decide its coding mode affects the speed and quality (as in this case we first decide how to code blocks and then form codebooks for them instead of forming codebooks first and then deciding which mode suits the current block better; and in this case we’ll have smaller sets to make codebooks from too). But I may do something different instead. Or nothing at all.
Interesting. Speeding up ELBG using a smarter structure would help more encoders than just cinepak (in lavc that is).
The median cut output looks especially terrible.
Another way to speed up encoding could be to quantize DC and AC separately, then form a combined codebook. As in pretend the input is 4:1:0, compute say a 16 entry 3D codebook from that, and a separate 16 entry 4D AC luma codebook, then form the Cartesian product of the two. Quantizing is also sped up this way. Finish up by pruning unused entries. This is similar to how codec2 works internally, having multiple codebooks.
Well, I also have generic ELBG implementation (that is currently used only in Cinepak encoder) and I considered adding an external searcher (because it’ll dependent on data type). For example, a simple local search (i.e. check just codebooks entries or clusters in the vicinity and if the distance is small enough search no further) cuts encoding time from ~160 to ~100 seconds with slight quality degradation. At least this search can be reused for mapping input to the codebook entries as well.
As for codebook decomposition, that’s an interesting idea but I suspect it works good mostly for homogeneous vectors (i.e. in audio codecs). I might try it though (or not, the future will tell).
We should expect vectors to be somewhat homogeneous. Most material isn’t aligned to the 4×4 grid.
Think SSIM. Split the overall color of each block (subsample to 4:1:0, three values per block) from the structure of them (luma after subtracting the average -> AC, four values per block). For grayscale it gets extra easy because the “colors” can be binned with a simple histogram thing since it’s then just a scalar per block.
The optimal “split” between DC and AC codebook size depends on the material. In the extreme a 256:1 DC:AC split makes sense for very blurry material. With high contrast grayscale (black text on white background) a 1:256 split make sense.
I meant that in audio codecs it’s vectors made of samples of the same type so it’s easier to split them compared to luma+chroma vector. What you propose (generating codebooks for (avg(Y), U, V) and the deviation from the mean makes sense but the question about selecting the sizes of those intermediate codebooks still remains. Maybe it’s better just to generate 256 entries in each and leave only the most popular ones during the join.
P.S. I suspect that median cut quality can be improved greatly by using the same dimensions as you proposed (Yavg, U, V, Y0-Yavg, …, Y3-Yavg) but I’m not sure if I’ll bother about it.