I’ve been working on TM encoder then and now and finally I have some things to say about it.
First of all, general state of the things: the encoder works and produces valid output for both methods 1 and 3 (the encoding is still not perfect but hopefully it can be fixed), it still lacks audio encoding (I need to add WAV reading support to the encoder and extend my decoder to test the output).
Second, I also decided to add an auto-selection option which allows encoder to decide whether to use method 1 or method 3 for the frame. It simply decides which one to use depending on the percentage of most common pair and the number of unique pairs present in total. It does not seem to have any practical use but it may be handy to test decoders that expect only one coding method to be present in the stream.
And now let’s move to the most interesting thing in all this format (at least to me): codebook generation. TrueMotion (1 and 2X) is a rare example of a codec using Tunstall coding (the only other known codec is CRI P256), essentially an inverse Huffman coding where a fixed-length code corresponds to a sequence of symbols.
The original codebook construction goes something like this: add all symbols to the codebook, while the space allows replace most probable entry with new strings using this old entry as a prefix. E.g. for {0
1
2
} alphabet (with 0
being the most probable symbol) and size 8 codebook initially you’ll have just the same {0
1
2
}, then {00
01
02
1
2
} and finally {000
001
002
01
02
1
2
} (and you can add another code there to make it full).
Of course it’s rather impractical in this form as not all sequences will be encountered in the data and you still need to code smaller sequences (e.g. how would you code exactly four zeroes with the above codebook?). Thus I decided to do it a bit differently: I only add new sequences without deleting old ones and I also keep a (limited) statistics on the sequences encountered (from two to twelve symbols) so first I add all encountered pairs of symbols, then select most commonly occurring sequence and add all known children of it (i.e. those with an additional pair of symbols at the end), mark it as ineligible candidate for the following search and repeat the process again until the codebook is full. If somebody cares about implementation details, I used a trie for holding such information as it’s easy to implement and understand; and during update process I keep a list of trie nodes for the previously encountered sequences up to maximum depth so I can update all those sub-sequence statistics in one pass over input.
Does it make a difference? Indeed it does. I took the original LOGO.DUK
(the only video with a different codebook), decoded it and re-compressed using the default codebook all other videos are using as well as the using the one generated specifically for it. Here are the results:
- original
.duk
size—2818868 bytes; - re-compressed file size—2838062 bytes;
- re-compressed with file-specific codebook—2578010 bytes.
That’s using the same method 3 as the original file. With method 1 file sizes with the standard or custom codebook are 2622758 and 2490058 bytes respectively.
As you can see, the difference is noticeable. Of course it requires two passes over input and many megabytes of memory to store the sequence statistics, but the results may be worth it. In theory the compression may be improved even further if you know how to generate a codebook that allows splitting frame data into unique chunks but that sounds a lot like an NP-hard problem to me.
Anyway, I got what I wanted from it so it just requires some bugfixing, audio encoding support, polishing and documenting. After that I can dump its source code for all zero users and forget about Duck codecs until something even more exotic manages to re-surface.