A Short Essay on Bitstream Reading « Kostya's Boring Codec World

A Short Essay on Bitstream Reading

So, it has come to this. How does bitstream reading might work. Here I’ll try to present several ways to read bits and variable-length codes.

Bit-by-bit readers

Initially bitstream reading might be implemented with simple get_bit() call that’s called several times in a row to read multi-bit values.

Let’s start with the least efficient way to read bits (I’ve written it in 1998 or so when I didn’t know about bit operations at all and I’ll stay ashamed of it for the rest of my life):

union byte {
   char byte;
   struct {
     int bit0: 1;
     int bit1: 1;
...
     int bit7: 1;
   } bits;
};

int bit_pos;
union byte byte;

int get_bit()
{
  int bit;
  if (bit_pos == 8) {
    byte.byte = getchar();
    bit_pos = 0;
  }
  switch (bit_pos) {
  case 0: bit = byte.bits.bit0; break;
  ...
  }
  bit_pos++;
  return bit;
}

Now, there’s an easier way to read single bits: you unpack bits into byte array and read single bits from it. On ARM there was (still is?) a special mode for mapping some memory area into another in exactly this way. No wonder the only code that operates bits in a byte array I saw was in Intel Audio Coder and Intel Music Coder (on x86 or course).

In any case this kind of readers should be avoided if you care about performance because you’ll end doing the same job twice: first you break input into bits and then glue some of the bits together. So let’s see how it can be done more efficiently.

Multi-bit readers

When you learn bit operations you’ll see that it’s easy to extract certain group of bits from the integer by applying mask and shifting (or shifting first and masking later). So if you keep bit position you can extract bits like this:

unsigned get_bits(uint32_t *src, int pos, int nbits)
{
  unsigned mask = (1 << nbits) - 1; //  or ~0U >> (32 - nbits)
  int word_pos = pos >> 5;
  int bit_pos  = pos & 31;
  uint32_t buffer = src[word_pos] >> bit_pos;
  if (bit_pos + nbits > 32)
    buffer |= src[word_pos + 1] << (32 - bit_pos);
  return buffer & mask;
}

There is only one non-trivial thing here: if you don’t have enough bits in the input word you need to load the second one too. The same approach works with bytes obviously.

Side note: quite often masks are stored in a precomputed array so finding sequence like { 0x0000, 0x0001, 0x0003, ... 0x7FFF, 0xFFFF } was the quickest way to locate bitstream reader in the binary specification.

So it’s all good and just but can it be done even better? Yes.

Multi-bit readers with cache

Memory accesses are actually expensive, less so on x86 chips, more so on the others. So when you read a lot of values from bitstream in a sequence it’s a good idea to store some unread bits in a cache which can be in register instead of memory and thus accessed much faster.

Bitreaders with cache operate like this:

struct Bitread {
  uintXX_t cache;
  int bits; // in cache
  int pos;
  uint8_t *src;
};

unsigned read_bits(struct Bitread *br, int nbits)
{
  unsigned ret;
  if (nbits < br->bits)
    refill_cache(br);
  ret = extract_value(cache, nbits);
  update_cache(br, nbits);
  return ret;
}

What is the deal with uintXX_t and unknown functions referenced here? Well, bitstream reader can have different contracts about cache size and mode of reading.

Just to remind you: bitstream can be read most significant bit first and least significant bit first. In the first case we read data from top of the cache and update it by shifting unread bits to the top as well:

unsigned extract_value(uintXX_t cache, int nbits)
{
  return cache >> (sizeof(cache) * 8 - nbits);
}

void update_cache(struct Bitread *br, int nbits)
{
  br.cache <<= nbits;
  br.bits   -= nbits;
}

In the second case we’ll need to read data like this:

unsigned extract_value(uintXX_t cache, int nbits)
{
  return cache & ((1 << nbits) - 1);
}

void update_cache(struct Bitread *br, int nbits)
{
  br.cache >>= nbits;
  br.bits   -= nbits;
}

And now for the elephant in the room: most implementations don’t like to read the same memory twice (because it complicates code in this corner case: you need to keep a special flag for repeated reading or something) so some bits in cache might be not available. And if you fill cache by single bytes it might happen that you have 1-7 bits in 32-bit cache already and can fill it with only additional 3 bytes, making full cache having 25-31 bit of data. If you have ever wondered why some bitstream readers limit amount of bits read at once to 17 or 25—here’s your answer.

Of course you can easily get more bits by reading data again and combining the result but it takes more MiNicycles^{^MiNicycle—an amount by which executing code may vary depending on CPU architecture, compiler used and phase of moon. Often the main argument point in benchmarks.} and thus it’s avoided in certain scenarios (the same stands for adding conditions inside the functions).

So one solution is to use cache size larger than the desired amount of bits that can be read at once. Or you can introduce secondary cache which is essentially the same thing only more explicit.

NihAV uses exactly this approach—bitreader that can read up to 32 bits at once with 64-bit cache that refills 32 bits of cache in one operation. And since I don’t care much about performance it does it depending on the mode selected during object creation:

    fn fill32be(&mut self, src: &[u8]) {
        let nw = (((src[0] as u32) < < 24) |
                  ((src[1] as u32) << 16) |
                  ((src[2] as u32) <<  8) |
                  ((src[3] as u32) <<  0)) as u64;
        self.cache |= nw << (32 - self.bits);
    }

    fn refill(&mut self) -> BitReaderResult<()> {
       if self.pos >= self.end { return Err(BitstreamEnd) }
        while self.bits <= 32 {
            if self.pos + 4 <= self.end {
                let buf = &self.src[self.pos..];
                match self.mode {
                    BitReaderMode::BE      => self.fill32be  (buf),
                    BitReaderMode::LE16MSB => self.fill32le16(buf, 32),
                    BitReaderMode::LE      => self.fill32le32(buf),
                    BitReaderMode::LE32MSB => self.fill32le32(buf),
                }
                self.pos  +=  4;
                self.bits += 32;
            } else {
                let mut buf: [u8; 4] = [0; 4];
                let mut newbits: u8 = 0;
                for i in 0..3 {
                    if self.pos < self.end {
                        buf[i] = self.src[self.pos];
                        self.pos = self.pos + 1;
                        newbits += 8;
                    }
                }
                if newbits == 0 { break; }
                match self.mode {
                    BitReaderMode::BE      => self.fill32be  (&buf),
                    BitReaderMode::LE16MSB => self.fill32le16(&buf, newbits),
                    BitReaderMode::LE      => self.fill32le32(&buf),
                    BitReaderMode::LE32MSB => self.fill32le32(&buf),
                }
                self.bits += newbits;
            }
        }
        Ok(())
    }

    fn read_cache(&mut self, nbits: u8) -> u32 {
        let res = match self.mode {
            BitReaderMode::LE => ((1u64 << nbits) - 1) & self.cache,
            _                 => (self.cache as u64) >> (64 - nbits),
        };
        res as u32
    }

Variable-length code reading

There are three ways to read variable-length codes: using lookup tables (LUTs), using tree and the stupid one.

The stupid way is the simplest one too:

sort your codes increasing by length (optional)
start with empty code
read another bit into code
check if the code you got matches one in the set
if it matches, return the match
if it does not, goto step 3

It’s stupid because you iterate over the same set over and over. It might not matter for the small set of short codes but you lose MiNicycles on larger ones. Obviously, it’s been used in RealVideo decoder along with LUTs.

Significantly better way is to decode code using a Huffman tree: you read bit, take the corresponding branch and if it’s not leaf repeat the cycle again. It’s straightforward and works quite well.

Now, the LUT approach can be described by the old joke: “How to catch six lions? Catch ten lions and release four of them back.” With LUTs you peek at the fixed amount bits in the bitstream, use that value as index in LUT, it tells you the value and the actual number of bits for the code which you have to discard.

For example, let’s have this simple codeset (MSB):

  #0 -> 00
  #1 -> 01
  #2 -> 1

As you can see two bits are enough to determine the code, so the LUT will look like that:

  00 -> code #0, 2 bits
  01 -> code #1, 2 bits
  10 -> code #2, 1 bit
  11 -> code #2, 1 bit

The same way for LSB mode codeset:

  #0 -> 00
  #1 -> 10
  #2 ->  1

will turn into

  00 -> code #0, 2 bits
  01 -> code #2, 1 bit
  10 -> code #1, 2 bits
  11 -> code #2, 1 bit

Quite often you have long codes and you want to keep LUT small. What do you do then? You mark the common prefixes for long codes as escape and point it to the secondary LUT, repeat as much as needed. Decoding is the same: peek LUT index, if it’s a code then discard N bits and return the value, else select the secondary LUT table and repeat again.

Here’s an example (MSB for clarity):

  #0 -> 000
  #1 -> 0010|0
  #2 -> 0010|1
  #3 -> 1110
  #4 -> 1111|0
  #5 -> 1111|10
  #6 -> 1111|11

with LUT limited to 4 bits. We’ll get:

  0000 -> code #0, 3 bits
  0010 -> escape for LUT #2
  1110 -> code #3, 4 bits
  1111 -> escape for LUT #3

LUT #2 (1 bit lookup):
  0 -> code #1, 1 bit
  1 -> code #2, 1 bit

LUT #3 (2 bits lookup):
  00 -> code #4, 1 bit
  01 -> code #4, 1 bit
  10 -> code #5, 2 bits
  11 -> code #6, 2 bits

You can implement it in different ways, like forcing all LUT sizes to be the same (more memory used, less hassle with amount of bits to peek at), storing LUT parameters separately or having one large LUT with offsets indicating which subpart to use.

In NihAV I use one LUT comprising 32-bit words in the following format:

(idx < < 8) | (escape_flag << 7) | (bits)

Depending on escape_flag you either have code with length bits and value syms[idx] or an escape value for LUT at offset idx that uses bits of lookup index.

I construct this LUT by filling LUT for all codes below hardcoded LUT size (10 bits currently) and gathering other codes into buckets where I store maximum code size and codes after the common prefix is removed and generate the secondary LUTs for them too (recursively). E.g. for the example above I'd have:

  bucket #0 - prefix 0010, maxlength 1, codes { 0 ; 1 }
  bucket #1 - prefix 1111, maxlength 2, codes { 0; 10; 11 }

And I fill default values with 0x0000007F so if during decoding I get bits = 0x7F I can report illegal code back. Hopefully now it’s clear how bitstream reading in general and particularly in NihAV works.

P.S. I doubt all this information will be interesting to anybody at all but oh well, somebody has asked for it.

This entry was posted on Monday, May 15th, 2017 at 1:51 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

10 Responses to “A Short Essay on Bitstream Reading”

koda says:

May 15, 2017 at 2:14 pm

I found this very interesting, thanks for sharing it
Luca Barbato says:

May 15, 2017 at 2:18 pm

Much appreciated 🙂
Paul says:

May 15, 2017 at 3:43 pm

How many MiNi cycles was needed to write this article?
Kostya says:

May 16, 2017 at 12:21 am

Since this is not a source code intended for compilation that metric is not applicable. I’ve wasted about 1.5-2 hours of real time to write the post though (I’m a slow writer but since my life sucks I had nothing better to do).

And if you wonder how much a MiNicycle is, the difference between computing mask as ~0U >> (32 - nbits) and loading it from a precomputed array is 2-3 MiNicycles.
Marcus says:

May 16, 2017 at 10:40 am

Did you ever check out my bitreader Kostya? it’s github.com/bumblebritches57/BitIO

Anyway, BitIO started out as a cached multi-bit reader, and it’s more or less stayed that way, most of the architectural changes have been around it in going from single input and output BitIO to separate input and output BitInput and BitOutput, but where the BitBuffers were still attached to source files, to the current architecture with separate inputs and outputs, and separate buffers in between them.

NGL, I’m really fuckin proud of this project.
Kostya says:

May 16, 2017 at 11:58 am

Now I’ve checked it (first time you mentioned it you forgot to provide a link). It looks interesting even if it’s not complete. Now you only need to provide a project that uses it for something interesting ;).

But I should point out that for performance considerations it’s better to inline the code so it would be even better if done STB-style. Though for me it’s such a trivial code that it’s easier for me to write it myself instead of searching for an appropriate library and adapting it to my needs.
Marcus says:

May 17, 2017 at 12:19 pm

Well, last time I put the URL in the website field, it was clickable from my name but yeah.

I’ve got a few projects that use it, a PNG decoder/encoder, AVC/H.264 encoder/decoder, FLAC encoder/decoder, but they’re all incomplete.

The biggest problem honestly is trying to read the math papers for Huffman trees, LZ77, CABAC, Asymmetric Binary System, and whatnot.

Yeah, I generally let Clang/LLVM handle inlining, for example ReadBits I don’t want inlined, just because it’s a user callable function so the calling overhead is worth is vs the increased binary size, but I just assumed it’d be better to do it that way, I may eventually look into inlining it.
Kostya says:

May 17, 2017 at 12:59 pm

My bad, I’ve failed to look at URL in your name.

And you probably should know that in most cases there’s not much math (if at all) needed to understand the working principles behind coding technologies and to implement them. Huffman coding can be understood if you have even a vague idea what probability is, same for LZ77. ANS/ABS and arithmetic coding is a bit trickier bit still quite easy. It’s usually lossless and speech codecs that can give you headache with their matrix decomposition, parcor coefficients and tilt compensation (and the literature on them assumes you know a lot). Still, in many cases it’s possible to understand what you should do and how without understanding why. So good luck with your projects, you might discover it’s easier than you thought.
Marcus says:

May 17, 2017 at 4:45 pm

It’s not so much the general concept that I struggle with (especially with Huffman, ABS is much more confusingly written tho)

But when I try reading the paper it get’s really confusing, just like the word choice, and they don’t really explain why you should do X; for example in Deflate, Huffman has 2 unused symbols, 257 and 258 for no apparent reason.

and they also expect you to know a lot of math that I don’t (I’m self taught), so I definitely struggle with it, but I am getting better at reading it, but i’m still not where I need to be.
Kostya says:

May 18, 2017 at 12:40 am

Quite often you can read an alternative paper (it’s not like associative coder by Buyanovskiy where you had about one paper and some posts from people trying to understand it). And the only piece of non-trivial math you should really know is log2/power. When I started to have interest in data compression I was still in school and didn’t know much math either (useful math at least), stuff like arithmetic coding was hard to comprehend and MPEG Audio Layer II seemed like a complex standard let alone MP3. I just kept reading various papers and tinker with code until I understood something and you can see the result.

As for Deflate, I look at the RFC and see that it simply combines various values into one table, so 0..255 is for literals (aka normal bytes you output literally), 256 is for end of block sign and values above 256 are for signalling match distance (i.e. for the standard LZ77 “copy back from the stream” operation).

That reminds me that I should finally implement inflation in NihAV myself.