NihAV: Some Progress to Report!

Finally the large chunk is finished: NihAV has finally got support for RealVideo 3 and 4!

Since I’ve learned a great deal more about codecs since the last time I wrote RealVideo 3/4 decoder (and specifications for both were leaked—they have mistakes but still clarify some things), I was able to write a new decoder that also seems to reconstruct frames better.

Some words on the design: I’ve split it into several parts as usual—common RV3/4 code, RV3/4 DSP, RV3 bitstream parser, RV3 DSP and RV4 bitstream parser and DSP. That’s the approach I’ve been using before and I’ll probably use it in future decoders as well. The only more or less interesting thing is how I did weighted motion compensation: instead of temporary buffer I allocate 16×16 frame that I use for storing temporary results and which is used later to average results (since motion compensation routines in RealVideo 3 and 4 differ while weighted averaging is the same it makes sense to split it into separate operation).

And now for the juicy part: benchmarks and performance. I’ve tested one of the RealVideo 4 trailers (namely swordfish.rmvb) and avconv -threads 1 -cpuflags 0 decodes it in 15 seconds, nihav-tool needs almost 25.

Now a breakup by categories (numbers are kilocycles reported by perf, avconv first, nihav-tool second):

  • Loop filter — 9.3k / 15.2k;
  • Motion compensation — 0.9k / 6.7k. Ouch!;
  • Intra prediction — 0.4k / 0.8k;
  • Transforms — 0.8k / 3.3k. Ouch!;
  • The rest (mostly bitstream decoding) — ~3k / ~7k.

So unoptimised Rust code is consistently twice as slow as semi-optimised C code and I’m more or less fine with that but some things are especially bad. Let’s take transforms: by themselves transform code is about as fast as its C version but I have an explicit function add_coeffs() for adding transformed coefficients to the output and it takes 2.7 kilocycles—the second-heaviest function!

Here’s the straightforward original version of that function that was even slower (closer to 4k cycles):

    pub fn add_coeffs(&self, dst: &mut [u8], mut idx: usize, stride: usize, coeffs: &[i16]) {
        for y in 0..4 {
            for x in 0..4 {
                dst[idx + x] = clip8((dst[idx + x] as i16) + coeffs[x + y * 4]);
            }
            idx += stride;
        }
    }

And current one, which is faster but not that fast unfortunately:

    pub fn add_coeffs(&self, dst: &mut [u8], idx: usize, stride: usize, coeffs: &[i16]) {
        let out = &mut dst[idx..][..stride * 3 + 4];
        let mut sidx: usize = 0;
        for el in out.chunks_mut(stride).take(4) {
            assert!(el.len() >= 4);
            el[0] = mclip8((el[0] as i32) + (coeffs[0 + sidx] as i32));
            el[1] = mclip8((el[1] as i32) + (coeffs[1 + sidx] as i32));
            el[2] = mclip8((el[2] as i32) + (coeffs[2 + sidx] as i32));
            el[3] = mclip8((el[3] as i32) + (coeffs[3 + sidx] as i32));
            sidx += 4;
        }
    }

It’s funny how all those seemingly useless things like .take(4) and assert!() or even using 32-bit math instead of 16-bit one increase performance.

There’s a similar story with loop filtering: rewriting vertical edge loop filter to use iterators shaved off about ten percent of run-time. But I can’t apply the same approach to horizontal edge filtering (or most of the motion compensation functions) because there I need to access several lines in parallel so I fear the most time in such function will be spent on zipping 6-7 input iterators together (plus an output one). Maybe somebody else has a desire to test such approaches but I don’t.

Overall, I can summarise my experience in writing RealVideo 3/4 decoder in Rust in these sentences:

  1. Rust is a nice language for structuring code;
  2. Rust is still not as fast as C;
  3. It seems that it’s better to avoid using direct index access and use iterators instead;
  4. It feels like Rust code performance would greatly improve if there was a way to tell compiler “okay, I guarantee that .chunks() would produce exactly that amount of chunks and of exactly that length” (no, I don’t know about .exact_chunks() in nightly and it won’t work with add_coeffs() above because last chunk can be smaller). And I’m not into experimenting with custom pixel line-accessing iterators.

Anyway, it’s time to move to audio codecs.

7 Responses to “NihAV: Some Progress to Report!”

  1. Luca Barbato says:

    maybe this would help?

    for (el, cf) in out.chunks_mut(stride).zip(coefs.chunks(4)) {
    assert!(el.len() >= 4);
    el[0] = mclip8((el[0] as i32) + (cf[0] as i32));
    el[1] = mclip8((el[1] as i32) + (cf[1] as i32));
    el[2] = mclip8((el[2] as i32) + (cf[2] as i32));
    el[3] = mclip8((el[3] as i32) + (cf[3] as i32));
    }

    exact_chunks() (or chunks_exact) is being refined and extended on

    https://github.com/rust-lang/rust/issues/47115

  2. Kostya says:

    Strangely, no. Your version is about twice as slow unless I re-introduce .take(4) and assert!(cf.len() >= 4). Then it’s just a bit slower.

    Anyway, you know what the function does, when it should be called and you like Rust and micro-optimisations. So go ahead and play.

  3. Paul says:

    So, even bidirectional prediction is working without artifacts?

  4. Luca Barbato says:

    Yep, only by adding take(4) after the chunks_mut() and adding 2 asserts gets you something faster.

  5. Kostya says:

    @Paul Yes, it does. Figuring out how to fix libavcodec is left as an exercise to the reader.

  6. Kostya says:

    @Luca Now compare with unsafe { *pix.get_unchecked_mut(off+x) = ... };

  7. […] Previously I’ve reported about RealVideo 3 and 4 support (as for RealVideo 1/2 and ClearVideo before), so video part was covered quite well but audio part was missing and I went on to rectify the situation. […]