NihAV: Progress Report

I’m still working (barely) on NihAV and I’ve managed to make my code decode both RealVideo 3 and 4. It’s not always correct, especially B-frames and some corner cases, but at least it produces a sane picture in most cases.

And this time I’d like to write about disadvantages of writing motion compensation functions in Rust instead of C.

Motion compensation is performed by either simply copying pixels from one block into another or by performing some interpolation that “shifts” image by a fraction of pixel (1/4th for RealVideo 4), so in our case the filter looks like:

dst[x] = clip8(src[x-2] - 5*src[x-1] + 52*src[x] + 20*src[x+1] - 5*src[x+2] + src[x+3] + 32 >> 6); // 1/4 of pixel
dst[x] = clip8(src[x-2] - 5*src[x-1] + 20*src[x] + 20*src[x+1] - 5*src[x+2] + src[x+3] + 16 >> 5); // 1/2 of pixel

And it should be applied in two directions. And we have two block sizes (8×8 and 16×16) too. So there are 2*4*4=32 different functions to implement. Why make them separate functions instead of single one? Because then you can substitute them with optimised versions that do just one kind of operation but do it fast. And here’s a good place to mention that Rust stable still can’t generate new function/variable names in macros (or interpolate idents in Rust terminology) which add minor annoyance to copy-pasting and correcting function names in all macro invocations.

And of course I don’t like copy-pasting much so I used macros to generate functions like this:

macro_rules! mc_func {
    (mc01; $name: ident, $size: expr, $ver: expr) => (
        fn $name (dst: &mut [u8], mut didx: usize, dstride: usize, src: &[u8], mut sidx: usize, sstride: usize) {
            let step = if $ver { sstride } else { 1 };
            for _ in 0..$size {
                for x in 0..$size {
                    dst[didx + x] = filter!(01; src, sidx + x, step);
                }
                sidx += sstride;
                didx += dstride;
            }
        }
        );
...
    (cm01; $name: ident, $size: expr, $ofilt: ident) => (
        fn $name (dst: &mut [u8], didx: usize, dstride: usize, src: &[u8], mut sidx: usize, sstride: usize) {
            let mut buf: [u8; ($size + 5) * $size] = [0; ($size + 5) * $size];
            let mut bidx = 0;
            let bstride = $size;
            sidx -= sstride * 2;
            for _ in 0..$size+5 {
                for x in 0..$size { buf[bidx + x] = filter!(01; src, sidx + x, 1); }
                bidx += bstride;
                sidx += sstride;
            }
            $ofilt(dst, didx, dstride, &buf, 2*bstride, $size);
        }
        );
...
}

mc_func!(mc01; luma_mc_10_16, 16, false);
mc_func!(mc01; luma_mc_10_8,   8, false);
mc_func!(cm01; luma_mc_11_16, 16, luma_mc_01_16);
...

Which can generate four functions for mc01 case (interpolate 8×8 or 16×16 block and in vertical or horizontal direction) and six functions for cm01 (because you pass final interpolation function as an argument to the macro). So it works but it’s still bulky.

And Luca Barbato of rust-av fame suggested to use traits. Rust traits can have associated constants and default implementations and the code looks like:

trait HFilt {
    const HMODE: usize;
    fn filter_h(src: &[u8], idx: usize) -> u8 {
        match Self::HMODE {
            1 => filter!(01; src, idx, 1),
            2 => filter!(02; src, idx, 1),
            3 => filter!(03; src, idx, 1),
            _ => src[idx],
        }
    }
}
trait VFilt { ditto }
trait MC: HFilt+VFilt {
    const SIZE: usize;
    fn mc(dst: &mut [u8], mut didx: usize, dstride: usize, src: &[u8], mut sidx: usize, sstride: usize) {
        if (Self::HMODE != 0) && (Self::VMODE != 0) {
            let mut buf: [u8; (16 + 5) * 16] = [0; (16 + 5) * 16];
            let mut bidx = 0;
            let bstride = Self::SIZE;
            sidx -= sstride * 2;
            for _ in 0..Self::SIZE+5 {
                for x in 0..Self::SIZE { buf[bidx + x] = Self::filter_h(src, sidx + x); }
                bidx += bstride;
                sidx += sstride;
            }
            bidx = bstride * 2;
            for _ in 0..Self::SIZE {
                for x in 0..Self::SIZE { dst[didx + x] = Self::filter_v(&buf, bidx + x, bstride); }
                didx += dstride;
                bidx += bstride;
            }
        } else if Self::HMODE != 0 {
            for _ in 0..Self::SIZE {
                for x in 0..Self::SIZE {
                    dst[didx + x] = Self::filter_h(src, sidx + x);
                }
                didx += dstride;
                sidx += sstride;
            }
        } else if Self::VMODE != 0 {
            ...
        } else {
            // simple block copy
        }
    }
}

macro_rules! mc {
    ($name: ident, $size: expr, $vf: expr, $hf: expr) => {
        struct $name;
        impl HFilt for $name { const HMODE: usize = $hf; }
        impl VFilt for $name { const VMODE: usize = $vf; }
        impl MC for $name { const SIZE: usize = $size; }
    };
}

And then you can instantiate all functions via simple mc!(MC13_16, 16, 1, 3); or such. The main annoyance is that you can use $size passed as macro argument to define array sizes but let foo: [u8; Self::SIZE] inside trait is not allowed. But it’s a very minor thing that does not affect code much.

Now let’s see if the implementations differ in performance and other metrics. I’ve decoded first couple hundreds frames of some RealVideo 4 file on CPU locked at 1.2GHz and here are the results.

Macros: 1647 cycles, top four luma MC functions taking 200, 190, 180 and 150 cycles.
Traits: 1774 cycles, top four luma MC functions taking 250, 230, 210 and 140 cycles.

Code size: macros version — 13kB, traits — 11kB (about 6kB of which is a common code).

And compilation times are 4m34s for macros version and 4m37s for traits version. So it’s not a zero-compilation-cost abstraction either but the cost is negligible.

Well, code with traits is slower but cleaner and smaller (and should be used only when there’s no optimised version; and I don’t care much about the speed either for now) so I’ll probably keep it.

4 Responses to “NihAV: Progress Report”

  1. Luca Barbato says:

    You could decorate with #[inline(always)] the filter_h/filter_v implementation and you should have the same speed (and the same code).

  2. MoSal says:

    > Rust stable still can’t generate new function/variable names in macros (or interpolate idents in Rust terminology)

    Are you hinting at “concat_idents!()” being useless at the moment? If yes:

    1. Everyone agrees.
    2. This is still not *fixed* in nightly.

    https://github.com/rust-lang/rust/issues/29599

    For the second half of the post, I think “const generics” and maybe specialization will make things a lot nicer.

    “const generics” are coming. But unfortunately, it will be some time before the feature is available on nightly, let alone stable.

    https://github.com/rust-lang/rust/issues/44580

  3. Kostya says:

    I suspect it’s been inlined already so this directive would make no difference there.

  4. Kostya says:

    > “const generics” are coming. But unfortunately, it will be some time before the feature is available on nightly, let alone stable.

    It’s not that important to me, I’m more used to macros than templates.