Here I’m going to give a brief review of how Bink formats (container, video and audio codecs) are designed and how it affected the overall encoder design in NihAV
.
Bink container is simple: there’s a global header, seek table with frame offsets and the frames themselves, containing video part and optionally audio part(s). The only problem is that you need to know the number of frames beforehand or employ some tricks like storing all data in memory and writing it all only at the end (or dump frame data on disk in a separate file and assemble the result file afterwards). I picked the former and made my muxer demand to know the number of frames beforehand (plus a semi-unclean hack in nihav-encoder
to scan the input video stream for the number of frames and pass that information to the Bink Video encoder).
Bink Video is rather simple too: the frame is split into planes, each plane is split into 8×8 blocks (in case of version 'f'
and later it is possible to code a row of blocks scaled twice but it’s the same intra-type blocks types inside), each block can be coded in several different ways:
- fill block with one colours;
- paint block with two colours using a custom mask;
- paint block using RLE and one of sixteen predefined scans;
- raw data block;
- DCT block;
- skip block;
- motion compensated block;
- motion compensated block with lossless coded residue;
- motion compensated block with DCT-coded residue.
Before version 'f'
the codec used the same frame for decoding and motion compensation, later versions use more conventional scheme with the previous and current frames and swapping them around.
Additionally data is split into several streams for different types of information (block type, motion vectors, DC coefficients for DCT-coded blocks and so on). That data comes in chunks when the old portion is fully read and it’s also interleaved with another kind of bitstream data. It’s not complicated but still deserves a separate post explaining how it works.
So the encoder is rather straightforward: split frame the same way, try different block coding modes and pick the best one (or at least good enough one), reconstruct frame after the block coding is decided, write output, repeat for the next frame.
In following posts I’m going to tell what tricks I used to speed-up encoding, how to implement forward DCT with next to none mathematical background and what coding Bink Video uses for DCT blocks and lossless residue coding.
And finally, Bink Audio. It is a simple perceptual codec that works on slightly overlapped frames (just one eighth of the frame is overlapped with the next/previous frames, not the full frame like in MDCT-based codecs) and either employs RDFT to transform interleaved stereo data or per-channel DCT-II. Transformed data is coded the same way for both codecs: first two coefficients are transmitted (almost) without any compression, the rest of coefficients are quantised per critical band and then for each group of eight or sixteen coefficients the number of bits to code the quantised coefficients are selected. The last step is no-brainer: just select the number of bits enough to fit the maximum value in the band and that’s all. Selecting a good quantiser for the band is what distinguishes a good perceptual audio encoder from mine (there’s a rumour that RAD people employed one of the LAME developers to make it good since they cared; update: one of the RAD guys claims it was rather John Miles—whom I would trust with anything audio or programming an Ultima game). Since I don’t trust my ears much, I’ll simply try some approaches without a sophisticated psychoacoustic model and see what produces not completely appalling results. I’ll write more about it when I have some concrete results.
No LAME developer involvement in Bink Audio that I’m aware of!
AFAIK the psychoacoustic model in Bink Audio was all John Miles (same guy who did the Miles Sound System). It’s basically a variant of the original MP3 psychoacoustic model as far as I can tell, derived from the same papers and reference materials as Fraunhofer’s (but I don’t think it’s gotten nearly as much love as Fraunhofer’s encoder has). Bink Audio is inherently variable bitrate only, there is no attempt made to hit a fixed (or even particularly consistent) data rate, which makes encoding simple since there is no real rate allocation loop.
I would rate the original Bink Audio to be roughly comparable with MPEG1 Audio Layer 2 (not 3). The transform stage is much more regular and finer resolution than MP2s filter bank, and the entropy coding is extremely simple, but it does have a psychoacoustic model in the encoder. It was designed for fast decoding alongside video playback on late 90s-era machines, not maximum quality. A few years ago we did a variant (Bink Audio 2) that essentially reorders the bits in the bitstream (but keeps the general entropy coder design) to be more SIMD-friendly, allowing even faster decoding. These days the main use (outside Bink videos) is for sound effects. Bink Audio is more expensive to decode than real simple schemes like ADPCM, but a lot smaller, and it’s substantially faster than all other perceptual audio codecs I’m aware of, even ones considered fairly lightweight by today’s standards like MP3. That makes it a nice choice for game audio where it’s not rare to have 100 SFX samples playing at the same time (a lot of complex cues are layered from 5+ simultaneous samples). A fast perceptual codec lets you keep compressed samples in memory and decode them on the fly with only minor CPU impact, which is still a factor especially on the older game consoles like PS4, Xbox 1 or Nintendo Switch.
The format’s single biggest weakness is that the block size is fixed and very long at typically 2048 samples. With no short block option, transients pre-echo a lot unless you dump a lot of coefficients on them.
Thank you very much for the information. Since such facts are not well-known (especially the rationale behind the design decisions), the rumours tend to fill the void.
I’d agree that conceptually Bink Audio is closer to Layer II but it’s a good thing since MP3 is a format that has rather bad internal design (driven more by the political/business needs than technical merits) and which has good encoders despite that, not because it was well-designed.
And considering the initial targets it’s no wonder it became applicable for sound effects (I still remember the intermediate times when there were Bink files with 4×4 video stream to serve as sound effects before the dedicated Bink Audio-only container was made). I guess only variable bitrate ADPCM codecs may rival it in this niche but that’s a different can of worms.