While I’m struggling to write a video player that would satisfy my demands I decided to see if it’s possible to make my H.264 decoder a bit faster. It turned out it can be done with ease and that also raises the question concerning the title of this post.
What I did cannot be truly called optimisations but rather “optimisations” yet they gave a noticeable speed-up. The main optimisation candidates were motion compensation functions. First I shaved a tiny fraction of second by not zeroing temporary arrays as their contents will be overwritten before the first read.
And then I replaced the idiomatic Rust code for working with block like
for (dline, (sline0, sline1)) in dst.chunks_mut(dstride).zip(tmp.chunks(TMP_BUF_STRIDE).zip(tmp2.chunks(TMP_BUF_STRIDE))).take(h) { for (pix, (&a, &b)) in dline.iter_mut().zip(sline0.iter().zip(sline1.iter())).take(w) { *pix = ((u16::from(a) + u16::from(b) + 1) >> 1) as u8; } }
with raw pointers:
unsafe { let mut src1 = tmp.as_ptr(); let mut src2 = tmp2.as_ptr(); let mut dst = dst.as_mut_ptr(); for _ in 0..h { for x in 0..w { let a = *src1.add(x); let b = *src2.add(x); *dst.add(x) = ((u16::from(a) + u16::from(b) + 1) >> 1) as u8; } dst = dst.add(dstride); src1 = src1.add(TMP_BUF_STRIDE); src2 = src2.add(TMP_BUF_STRIDE); } }
What do you know, the total decoding time for the test clip I used shrank from 6.6 seconds to 4.9 seconds. That’s just three quarters of the original time!
And here is the problem. In theory if Rust compiler knew that the input satisfies certain parameters i.e. that there’s always enough data to perform full block operation in this case, it would be able to optimise code as good as the one I wrote using pointers or even better. But unfortunately there is no way to tell the compiler that input slices are large enough to perform the operation required amount of times. Even if I added mathematically correct check in the beginning it would not eliminate most of the checks.
Let’s see what happens with the iterator loop step by step:
- first all sources are checked to be non-empty;
- then in outer loop remaining length of each source is checked to see if the loop should end;
- then it is checked if the outer loop has run not more than requested number of times (i.e. just for the block height);
- then it checks line lengths (in theory those may be shorter than block width) and requested width to find out the actual length of the inner loop;
- and finally inside the loop it performs the averaging.
And here’s what happens with the pointer loop:
- outer loop is run the requested amount of times;
- inner loop is run the requested amount of times;
- operation inside the inner loop is performed.
Of course those checks are required to make sure you work only with the accessible data but it would be nice if I could either mark loops as “I promise it will run exactly this number of times” (maybe via .take_exact()
as Luca suggested but I still don’t think it will work perfectly for 2D case) or at least put code using slices instead of iterators into unsafe {}
block and tell compiler that I do not want boundary checks performed inside.
Update: in this particular case the input buffer size should be stride * (height - 1) + width
i.e. it is always enough to perform operation in the way described above but if you use .chunks_exact()
the last line might be not handled which is wrong.
The former is rather hard to implement for the common case so I don’t think it will happen anywhere outside Fortran compilers, the latter would cause conflicts with different Deref
trait implementation for slices so it’s not likely to happen either. So doing it with pointers may be clunky but it’s the only way.