Rust: Optimising Decoder Experience

Okay, I’ve made some changes so hopefully the server will withstand the curiosity of more than two people if it will go like the last time.

So, after implementing Indeo 4/5 decoders for NihAV I nano-benchmarked it and my decoder was about twice as slow compared to libavcodec. And since neither has SIMD optimisations they should be good enough to compare.

The tested file was 00186002.avi — Indeo 4 sample with scalability feature(i.e. luma is split into four bands and uses Haar wavelet to compose the output plane) and duration over ten minutes. The results I got will be given in Linux perf sample counts as those should be representative enough.

avconv — 13.4 seconds, 10K cycles. About 24% spent in luma plane recombination (with Haar wavelet), about 40% of time is taken by bitstream decoding and the rest is mostly transforms and motion compensation.

nihav-tool — 31.6 seconds, 20K cycles. 30% spend in luma plane recombination, 48% of time is taken by bitstream decoding, 11% is for motion compensation and the rest is mostly transforms. Or in samples: recombination — 9900 (against 3300 in libavcodec), bitstream decoding (dirty estimate, it includes some DSP functions inlined) — 15800 against
5600. Motion compensation — 3500 against 1700. Transforms — 1300 against 1500 (they are not equivalent though, my code only transforms the block and output costs are hidden in bitstream decoding). Overall, my code is consistently worse. Is there any way to optimise it a bit?

Step 1. Committing explicit murder.

My DSP function for motion compensation for case 0 (no interpolation) uses this code:

            for _ in 0..h {
                for x in 0..w {
                    dst[didx + x] = src[sidx + x];
                }
                sidx += sstride;
                didx += dstride;
            }

Replacing it with

            for _ in 0..h {
                let mut dest = &mut dst[didx..didx+w];
                dest.copy_from_slice(&src[sidx..sidx+w]);
                sidx += sstride;
                didx += dstride;
            }

gives 2800 samples (new function plus memcpy time) instead of 3500 in the old function.

Conclusion: rustc is too stupid to recognize copies.

Step 2. Going after the biggest function.
The plane recombination function is very simple:

        for _ in 0..(h/2) {
            for x in 0..(w/2) {
                let p0 = src[idx0 + x];
                let p1 = src[idx1 + x]; // idx0 + w/2
                let p2 = src[idx2 + x]; // idx0 + (h/2) * stride
                let p3 = src[idx3 + x]; // idx0 + w/2 + (h/2) * stride
                // oidx1 = oidx0 + stride
                dst[oidx0 + x * 2 + 0] = clip8(((p0 + p1 + p2 + p3 + 2) >> 2) + 128);
                dst[oidx0 + x * 2 + 1] = clip8(((p0 + p1 - p2 - p3 + 2) >> 2) + 128);
                dst[oidx1 + x * 2 + 0] = clip8(((p0 - p1 + p2 - p3 + 2) >> 2) + 128);
                dst[oidx1 + x * 2 + 1] = clip8(((p0 - p1 - p2 + p3 + 2) >> 2) + 128);
            }
            idx0 += sstride;
            idx1 += sstride;
            idx2 += sstride;
            idx3 += sstride;
            oidx0 += dstride * 2;
            oidx1 += dstride * 2;
        }

Using intermediates (p0±p2, p1±p3) cuts it from 9900 samples to 9200.
But I like to live dangerously so I rewrite it into an unsafe mess with raw pointers like *d0_ptr.offset(1) = clip8(((d0 + d1 + 2) >> 2) + 128); d0_ptr = d0_ptr.offset(2);. Now it’s only 6700 samples. I try wrapping_add() instead of sum — 6600, no significant change. I roll back and cast p0-p3 to i32. 4700 cycles! I try wrapping (i.e. without checks) arithmetics now. 4600 cycles. Maybe a statistical deviation, maybe a real small improvement.

Then I try a bit smarter clipping function:

#[inline(always)]
fn mclip8(a: i32) -> u8 {
    if (a as u16) > 255 { !(a >> 16) as u8 }
    else { a as u8 }
}

4300 samples. Not libavcodec‘s 3300 but also not the original 9900.

Conclusions:

  1. the compiler is not smart enough yet;
  2. maybe there is/there should be a way to tell compiler “hey, this slice is guaranteed to be this size, drop the unneeded checks”;
  3. non-checking arithmetic is a bit too inconvenient to write (by design, I suppose) but it may help to squeeze some performance;
  4. having optimised routines for basic type downconversion with saturation would make life a bit easier;
  5. the compiler sucks at doing basic calculations with smaller types.

Step 3. Strategically placed inlines.

I strategically put some #[inline(always)] on certain bitstream reading functions. Overall bitstream reading time reduces to 14600 samples (from 15800).

Let’s see how well it works with constant arguments—I force motion compensation function to be called with explicit block size (it’s either 4 or 8). 2300 samples instead of 2800, not bad. In total I’ve managed to shave off about 8 seconds.

Conclusion: compiler does inlining okayish but some hints would improve the performance significantly.

Step 4. Improving bitstream reading.

Indeo 4/5 codebooks are represented as unary prefix and optional bits to read that may vary for each prefix (e.g. first predefined codebook for macroblock data starts with codes 0, 10xxxx, 110xxxxx and 1110xxxx). So after caching offsets for each possible prefix and using quick routine for determining prefix ((!bitread.peek(16).trailing_zeros() since the actual codes come LSB-first) instead of a loop I got bitstream handling time reduced to 11200 samples (from 14600). It’s nice that Rust has those count leading/trailing zeroes functions. This part is done differently in my decoder (it doesn’t build LUT for the codebook) so it’s hard to compare it directly to libavcodec but there’s still a lot of room for improvement.

Step 5. More pointers!

I rewrite some other DSP functions using pointers. Overall decoding time drops to 17.3 seconds (from 31.6 seconds originally and versus 13.4 seconds for avconv) so I call it a day.

Overall conclusion

It is possible to write fast code in Rust but the compiler is not smart enough to do all possible optimisations and you have to make sacrifices if you want your code to be safer by default (and I’m fine with it). Also it’s obvious how Rust makes the code that sacrifices checks for speed uglier (compare a + b and a.wrapping_add(b) or arr[i] and *ptr.offset(i as isize)). But in most cases it’s better to have a function in pure assembly language to perform the task instead. Maybe I’ll get to that stage eventually, maybe not. I still have NihAV to design and implement and piglets can shave themselves.

12 Responses to “Rust: Optimising Decoder Experience”

  1. Luca Barbato says:

    I’ll try to write that loop using iterators over the slices and zipping the whole thing together and see if the compiler takes the hint as it should to complete the ways to implement it (assuming I won’t melt before).

  2. Uther says:

    >maybe there is/there should be a way to tell compiler “hey, this slice is guaranteed to be this size, drop the unneeded checks”;

    Indeed, you can use “get_unchecked” in an unsafe block :
    https://doc.rust-lang.org/stable/std/primitive.slice.html#method.get_unchecked

  3. Kostya says:

    I’ll try it, thanks.

  4. N says:

    > maybe there is/there should be a way to tell compiler “hey, this slice is guaranteed to be this size, drop the unneeded checks”;

    You can pass reference to arrays which can be created from slices using arrayref crate.

  5. Kostya says:

    > …using arrayref crate.

    That’s the problem, it should be the language core. Passing reference to an array seems to work fine though.

  6. N says:

    >That’s the problem, it should be the language core.

    I totally agree. There is some RFCs which discuss it, but not much traction yet:
    https://github.com/rust-lang/rfcs/issues/1772
    https://github.com/rust-lang/rfcs/issues/1833

    Also:
    >non-checking arithmetic is a bit too inconvenient to write (by design, I suppose) but it may help to squeeze some performance

    There is Wrapping type in the std, which makes it a bit more convenient to write code in your case.

  7. David Pethes says:

    That’s a quite cool byte clipping trick, I don’t think I’ve seen it before (me goes and throws it in all toy encoders around).

  8. Kostya says:

    I’m pretty sure it’s a simple way to clip and that there might be even better one.
    The idea is rather obvious: all unwanted values are either negative (starting with 1111xxxx prefix or it’s a large positive value with 0xxxxx prefix and either has non-zero bits in high bytes. So you just need to check it and map 1xxxxx to 0 and 0xxxxx to 0xFF which is trivial bit operations.
    Mind you, this version was made for 16-bit input, full 32-bit input range clipping should have a different input check and shift constant.

  9. Kero says:

    Did you looked at Profile Guided Optimization ? It can be a cool way to tell llvm some bench/profile functions indicating how to reorder conditions tests.

    This is not a “new Rust” feature but why not, it can be a way to beat avconv.

    https://github.com/rust-lang/rfcs/issues/1220
    https://github.com/Geal/pgo-rust

  10. Kostya says:

    Not yet, and my goal is not to beat avconv but rather to see how well I can make my code perform without much additional finetuning (so far it was mostly inlining for obvious functions, using pointers in few obvious places and replacing very straightforward parts with some smarter code).

  11. […] the post about optimization, Kostya and many commenters (me included) discussed a bit about if there are better ways to […]