As I mentioned in the introduction post, Bink Audio is rather simple: you have audio frames overlapped by 1/16th of its size with the previous and the following frame, data is transformed either with RDFT (stereo mode) or DCT-II (per-channel mode), quantised and written out.
From what I can tell, there are about three revisions of the codec: version 'b'
(and maybe 'd'
) was RDFT-only and first two coefficients were written as 32-bit floats. Later versions shaved three bits off exponents as the range for those coefficients is rather limited. Also while the initial version grouped output values by sixteen, later versions use grouping by eight values with possibility to code a run for the groups with the same bit width.
The coding is rather simple, just quantise bands (that more or less correspond to the critical bands for human ear), select bitwidth of the coefficients groups (that are fixed-width are not related to the band widths) and code them without any special tricks. The only trick is how to quantise the bands.
Since my previous attempts to write a proper psychoacoustic model for an encoder failed, I decided to keep it simple: the encoder simply tries all possible quantisers and selects the one with the lowest value of A log2 dist+λ bits. This may be slow but it works fast enough for my (un)practical purposes and the quality is not that bad either (as much as I can be trusted on judging it). And of course it allows to control bitrate in rather natural way.
There’s one other caveat though: Bink Audio frames are tied to Bink Video frames (unless it’s newer Bink Audio only container) and thus the codec should know the video framerate in order to match it. I worked around it by introducing yet another nihav-encoder
hack to set audio timebase from the video so I don’t have to provide it by hand.
So that’s it. It was a nice experiment and I hope (but not expect) to think of something equally fun to do next.