Rust inline assembly experience « Kostya's Boring Codec World

Rust inline assembly experience

Since I need something less exciting than the series about IAEA willingly ignoring the terrorists occupying the largest nuclear power plant in Europe (even during its mission visit there) or the series called “what russia destroyed in my home city today”, I’ve tried inline assembly support in recent stable Rust compiler, here’s a short report.

Since I’m working on a multimedia framework, my primary interest about inline assembly is how well I can add SIMD code for various codecs. My previous attempt was optimising the adaptive filter in Monkey’s Audio decoder and while it worked it looked ugly because of the way Intel named its intrinsics (if you like the names like _mm_madd_epi16 then our tastes are very different) and the verbosity (constant need to cast vectors to different types. So I decided to wait until non-experimental inline assembly support is ready.

This time I’ve decided to look how easy it is to make SIMD optimisations for my own H.264 decoder (and it needs them in order to be usable when I finally switch to my own video player). Good things: I’ve managed to speed up overall decoding about 20%. Bad things: a lot of things can’t be made faster because of the limitations.

For those who are not familiar, H.264 decoder contains a lot of typical operations performed on blocks with sizes 2×2, 4×4, 8×8 or 16×16 (or rectangular blocks made by splitting those in half) with operations being copying, adding data to a block, averaging two blocks and so on.

Writing the code by itself is nice: you can have a function with a single unsafe{ asm!(..); } statement in it and you let the compiler to figure out the details (the rather famous x86inc.asm is mostly written to deal with the discrepancies between ABIs on different platforms and for templating MMX/SSE/AVX code). Even nicer is that you can specify arguments in a clear way (which is much better than passing constraints in three groups for GCC syntax) and used named arguments inside the code. Additionally it uses local labels in gas form which are a bit clearer to use and don’t clutter debug symbols.

Now for the bad things: inline assembly support (as of rustc 1.62.1) is lacking for my needs. Here’s my list of the annoyances with ascending severity:

the problem with sub-registers: I had to fill XMM register from a GPR one so I wrote movd xmm0, {val} with val being a 32-bit value. The compiler generated a warning and the actual instruction in the binary was movq xmm0, rdx (which is copying eight bytes instead of four). And it’s not immediately clear that you should write it as movd xmm0, {val:e} in that case (at least Luca has reported this on my behalf so it may be improved soon);
asm!() currently supports only registers as input/output arguments while in reality it should be able to substitute some things without using registers for them—e.g. when instruction can take a constant number (like shifts, it’s very useful for templated code) or a memory reference (there are not so many registers available on x86 so writing something like paddw xmm1, TABLE[{offset}] would save one XMM register for loading table contents explicitly). I’m aware there’s a work going on that so in the future we should be able to use constant and sym input types but currently it’s unstable;
and the worst issue is the lack of templating support. For instance, I have functions for faster averaging of two blocks—they simply load a certain amount of pixels from each line, average them and write back. For 16×16 case I additionally unroll the loop a bit more. It would be nice to put it into single macro that instantiates all the variants by substituting load/store instruction and enabling certain additional code inside the loop when block width is sixteen. Of course I work around it by copy-pasting and editing the code but this process is prone to introducing errors (especially when you confuse two nearly identical functions—and they tend to become long when written in assembly). And I can’t imagine how to use macro_rules!() to either construct asm!() contents from a pieces or to make it cut out some content out of it. Having several asm!() blocks one after another is not always feasible as nobody can guarantee you that the compiler won’t insert some code between them to juggle the registers used for the arguments.

All in all, I’d say that inline assembly support in Rust is promising but not yet fully usable for my needs.

Update. Luca actually tried to solve templating problem and even wrote a post about it. There’s a limited way to do that via concat!() instead of string substitution and a somewhat convoluted way to fit some blocks inside one assembly template. It’s not perfect but if you’re desperate enough it should work for you.

This entry was posted on Saturday, September 3rd, 2022 at 5:16 am and is filed under NihAV, Rust. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

2 Responses to “Rust inline assembly experience”

Paul says:

September 5, 2022 at 12:35 pm

Better spend time on writing MIT shortpack and wavpack ’90 era lossless audio decoders.
Kostya says:

September 5, 2022 at 1:08 pm

Nah, I’ve spent enough time REing various lossless audio codecs. Most of them are harder to find than to RE and the situation with audio samples for them is even worse. So why bother?

Though you’re right that it’s better to spend time on something else and I’m going to do exactly that.