Archive for the ‘Rust’ Category

Rust inline assembly experience

Saturday, September 3rd, 2022

Since I need something less exciting than the series about IAEA willingly ignoring the terrorists occupying the largest nuclear power plant in Europe (even during its mission visit there) or the series called “what russia destroyed in my home city today”, I’ve tried inline assembly support in recent stable Rust compiler, here’s a short report.

Since I’m working on a multimedia framework, my primary interest about inline assembly is how well I can add SIMD code for various codecs. My previous attempt was optimising the adaptive filter in Monkey’s Audio decoder and while it worked it looked ugly because of the way Intel named its intrinsics (if you like the names like _mm_madd_epi16 then our tastes are very different) and the verbosity (constant need to cast vectors to different types. So I decided to wait until non-experimental inline assembly support is ready.

This time I’ve decided to look how easy it is to make SIMD optimisations for my own H.264 decoder (and it needs them in order to be usable when I finally switch to my own video player). Good things: I’ve managed to speed up overall decoding about 20%. Bad things: a lot of things can’t be made faster because of the limitations.

For those who are not familiar, H.264 decoder contains a lot of typical operations performed on blocks with sizes 2×2, 4×4, 8×8 or 16×16 (or rectangular blocks made by splitting those in half) with operations being copying, adding data to a block, averaging two blocks and so on.

Writing the code by itself is nice: you can have a function with a single unsafe{ asm!(..); } statement in it and you let the compiler to figure out the details (the rather famous x86inc.asm is mostly written to deal with the discrepancies between ABIs on different platforms and for templating MMX/SSE/AVX code). Even nicer is that you can specify arguments in a clear way (which is much better than passing constraints in three groups for GCC syntax) and used named arguments inside the code. Additionally it uses local labels in gas form which are a bit clearer to use and don’t clutter debug symbols.

Now for the bad things: inline assembly support (as of rustc 1.62.1) is lacking for my needs. Here’s my list of the annoyances with ascending severity:

  • the problem with sub-registers: I had to fill XMM register from a GPR one so I wrote movd xmm0, {val} with val being a 32-bit value. The compiler generated a warning and the actual instruction in the binary was movq xmm0, rdx (which is copying eight bytes instead of four). And it’s not immediately clear that you should write it as movd xmm0, {val:e} in that case (at least Luca has reported this on my behalf so it may be improved soon);
  • asm!() currently supports only registers as input/output arguments while in reality it should be able to substitute some things without using registers for them—e.g. when instruction can take a constant number (like shifts, it’s very useful for templated code) or a memory reference (there are not so many registers available on x86 so writing something like paddw xmm1, TABLE[{offset}] would save one XMM register for loading table contents explicitly). I’m aware there’s a work going on that so in the future we should be able to use constant and sym input types but currently it’s unstable;
  • and the worst issue is the lack of templating support. For instance, I have functions for faster averaging of two blocks—they simply load a certain amount of pixels from each line, average them and write back. For 16×16 case I additionally unroll the loop a bit more. It would be nice to put it into single macro that instantiates all the variants by substituting load/store instruction and enabling certain additional code inside the loop when block width is sixteen. Of course I work around it by copy-pasting and editing the code but this process is prone to introducing errors (especially when you confuse two nearly identical functions—and they tend to become long when written in assembly). And I can’t imagine how to use macro_rules!() to either construct asm!() contents from a pieces or to make it cut out some content out of it. Having several asm!() blocks one after another is not always feasible as nobody can guarantee you that the compiler won’t insert some code between them to juggle the registers used for the arguments.

All in all, I’d say that inline assembly support in Rust is promising but not yet fully usable for my needs.

Update. Luca actually tried to solve templating problem and even wrote a post about it. There’s a limited way to do that via concat!() instead of string substitution and a somewhat convoluted way to fit some blocks inside one assembly template. It’s not perfect but if you’re desperate enough it should work for you.

Rust needs proper stand-alone assembler support

Tuesday, July 27th, 2021

Back when I gave my arguments I why don’t consider Rust a mature language, one of those arguments was that is lacks proper assembler support and systems programming language requires it since some of the tasks you need to perform (including optimisation) require as low level access as you can get. Here I would like to argue why asm!{} may be enough for most cases it’s definitely not for mine.

Missing optimisation opportunity in Rust

Wednesday, May 12th, 2021

While I’m struggling to write a video player that would satisfy my demands I decided to see if it’s possible to make my H.264 decoder a bit faster. It turned out it can be done with ease and that also raises the question concerning the title of this post.

What I did cannot be truly called optimisations but rather “optimisations” yet they gave a noticeable speed-up. The main optimisation candidates were motion compensation functions. First I shaved a tiny fraction of second by not zeroing temporary arrays as their contents will be overwritten before the first read.

And then I replaced the idiomatic Rust code for working with block like

    for (dline, (sline0, sline1)) in dst.chunks_mut(dstride).zip(tmp.chunks(TMP_BUF_STRIDE).zip(tmp2.chunks(TMP_BUF_STRIDE))).take(h) {
        for (pix, (&a, &b)) in dline.iter_mut().zip(sline0.iter().zip(sline1.iter())).take(w) {
            *pix = ((u16::from(a) + u16::from(b) + 1) >> 1) as u8;

with raw pointers:

    unsafe {
        let mut src1 = tmp.as_ptr();
        let mut src2 = tmp2.as_ptr();
        let mut dst = dst.as_mut_ptr();
        for _ in 0..h {
            for x in 0..w {
                let a = *src1.add(x);
                let b = *src2.add(x);
                *dst.add(x) = ((u16::from(a) + u16::from(b) + 1) >> 1) as u8;
            dst = dst.add(dstride);
            src1 = src1.add(TMP_BUF_STRIDE);
            src2 = src2.add(TMP_BUF_STRIDE);

What do you know, the total decoding time for the test clip I used shrank from 6.6 seconds to 4.9 seconds. That’s just three quarters of the original time!

And here is the problem. In theory if Rust compiler knew that the input satisfies certain parameters i.e. that there’s always enough data to perform full block operation in this case, it would be able to optimise code as good as the one I wrote using pointers or even better. But unfortunately there is no way to tell the compiler that input slices are large enough to perform the operation required amount of times. Even if I added mathematically correct check in the beginning it would not eliminate most of the checks.

Let’s see what happens with the iterator loop step by step:

  1. first all sources are checked to be non-empty;
  2. then in outer loop remaining length of each source is checked to see if the loop should end;
  3. then it is checked if the outer loop has run not more than requested number of times (i.e. just for the block height);
  4. then it checks line lengths (in theory those may be shorter than block width) and requested width to find out the actual length of the inner loop;
  5. and finally inside the loop it performs the averaging.

And here’s what happens with the pointer loop:

  1. outer loop is run the requested amount of times;
  2. inner loop is run the requested amount of times;
  3. operation inside the inner loop is performed.

Of course those checks are required to make sure you work only with the accessible data but it would be nice if I could either mark loops as “I promise it will run exactly this number of times” (maybe via .take_exact() as Luca suggested but I still don’t think it will work perfectly for 2D case) or at least put code using slices instead of iterators into unsafe {} block and tell compiler that I do not want boundary checks performed inside.

Update: in this particular case the input buffer size should be stride * (height - 1) + width i.e. it is always enough to perform operation in the way described above but if you use .chunks_exact() the last line might be not handled which is wrong.

The former is rather hard to implement for the common case so I don’t think it will happen anywhere outside Fortran compilers, the latter would cause conflicts with different Deref trait implementation for slices so it’s not likely to happen either. So doing it with pointers may be clunky but it’s the only way.

Why Rust is not a mature programming language

Friday, September 18th, 2020

While I have nothing against Rust as such and keep writing my pet project in Rust, there are still some deficiencies I find preventing Rust from being a proper programming language. Here I’d like to present them and explain why I deem them as such even if not all of them have any impact on me.

NihAV: rust-clippy experience

Saturday, May 18th, 2019

As I’ve mentioned in the previous post, I’ve finally tried rust-clippy to see what issues and suggestions it will have on my code. The results are not disappointing if you take the tool name seriously.

NihAV: Lurching Forward!

Sunday, October 28th, 2018

Finally NihAV got full* support for RealAudio. Marketing asterisk there is for AAC support since only AAC LC is supported which means decoding only 22kHz from racp (as I said before, no SBR support) plus some other features were cut: it supports just multichannel audio but no coupling feature, noise codebook or e.g. LTP from AAC Main. Also for similar reasons I did not care to optimise its performance. I don’t care much about it and I don’t want to spend a lot of time on AAC. Here’s some fun statistics for you: AAC decoder is about 1800 lines long (IMDCT, window generation and bitstream parsing are common modules but the rest of AAC stuff is there), about 100 lines are spent on subband tables, 350 lines are spent on codebook tables and 250 lines are wasted on “proper” parsing of general MPEG-4 Audio descriptor with type-specific stuff like HILN or ALS configuration being simply left as unimplemented!() (a useful Rust macro that will report it and abort program if it reaches that point so you know when to start caring).

And now for the fun Rust part.

My codebook generator takes something that implements CodebookDescReader trait which is essentially a common interface that specifies how to get information for each codebook element (codeword, its length and associated symbol). While I had some standard implementations, they were not very useful since the most common case is when you have two tables of different types (8 bits are enough for codeword lengths and actual codewords may take 8-32 bits) and you want to use some generic implementation instead of writing the same code again and again.

In theory it should be easy: you create generic structure that stores pointers to those tables and converts them into needed types. It would work after some fiddling with type constraints but the problem is to make the return type what you want i.e. the function takes entry index number (as usize) and returns some more sensible type (so that when I read code later more appropriate type will be used).

And that is rather hard problem because generic return type can’t be converted from usize easily. I asked Luca of rust-av fame (he famously does not work on it) and he provided me with two and a half solutions that won’t work in NihAV:

  • Use unstable TryInto trait that can try to convert input type into something smaller—rejected since it’s unstable;
  • Use num_traits crate for conversion traits that work with stable—rejected since it’s an external crate;
  • Implement such trait yourself. Hmm…

So I NIHed it by taking a function that would map input index value to output type. This way is both simpler and more flexible. For example, AAC scalefactor codebook has values in the range -60..+60 so instead of writing new structure like I did before I can simply do

fn scale_map(idx: usize) -> i8 { (idx as i8) - 60 }
let mut coderead = TableCodebookDescReader::new(AAC_SCF_CODEBOOK_CODES, AC_SCF_CODEBOOK_BITS, scale_map);

And this works as expected.

There’s a small fun thing related to it. By old C habit I wrote initially &fn as argument type and got cryptic Rust compiler error about mismatched types:

    = note: expected type `&fn(usize) -> i8`
               found type `&fn(usize) -> i8 {codecs::aac::scale_map}`

Again, my fault but rustc output is a bit WTFy.

Anyway, the next is supposedly RealVideo HD or RealVideo 6 decoder and then I can forget about this codec family and move to something different like DuckMotion or Smacker/Bink.

And in the very unlikely case you were wondering why I’m so slow I can tell you why: I am lazy (big reveal, I know), I prefer to spend my time differently (on workdays I work, on Sunday I travel around, that leaves only couple of hours during workdays, some part of Saturday and a bit of Sunday if I’m lucky) and I work better when the stuff is interesting which is not always the case. Especially if you consider that there is no multimedia framework in Rust (yet?) so I have to NIH every small bit and some of those bits I don’t like much. So don’t expect any news from me soon.

NihAV, RealMedia, Rust and Everything Else

Saturday, October 13th, 2018

Looks like it’s been about two months since I last wrote anything about NihAV but that does not mean I did not have anything to write about. On the contrary, I’m glad to report about significant progress in RealAudio support.

Previously I’ve reported about RealVideo 3 and 4 support (as for RealVideo 1/2 and ClearVideo before), so video part was covered quite well but audio part was missing and I went on to rectify the situation.

Now NihAV supports RealAudio 1.0 (speech codec), RealAudio 2.0 (speech codec), RealAudio DNET (a bit about it later), RealAudio 4.0 (speech codec from Sipro), RealAudio Cook (this one deserves a separate post so the next one should be about this codec) and RealAudio Lossless. So there are only three codecs missing now: RealAudio 8 (ATRAC3), RealAudio 9/10 (AAC) and RealVideo 6(HD). Of course I’m going to add support for those as well.

This is actually a good time to implement those. As you might know, there is a Holy Trinity of Licensors: D.vX, D*lby and DT$. They are famous for ‘nice’ licensing terms. While I’ve never had to deal with them, I’ve heard from people who did that they like licensing single product they’re most famous for at outrageous prices (i.e. it’ll cost you a magnitude more per unit using their technology than e.g. H.264 decoder) and it’s a viral license too because if you sell stuff not oriented for consumers then you have to force your customers into the same deal (it’s GPL—Greedy Private License) and you have to report your sales to them for obvious reasons. Funny how two of the companies were bought out already. Now let’s look at them in some details:

  • D.vX This one is remarkable since it licensed the product it had nothing to do with (aka M$MPEG-4 adapted for non-ASF containers and MPEG-4 ASP). At least it seems hardly relevant now unless I dig out some old movies.
  • D*lby This one is mostly known (outside cinema equipment) for codec with several names: ATSC A/52, RealAudio DNET, ETSI TS 102 366, D*lby Digital and even something you can make out of letters A C and 3 (I heard rumours that it does not like its trademarks mentioned so I’d better avoid directly naming it). At least the last patents for that format has expired and support for it can be implemented freely. And it also owns a company that manages licensing of AAC. Fun fact is that patents for MPEG2 NBC are expired so I can implement AAC-LC decoder just fine but that does not stop them for licensing it. How they do it? By refusing to license the separate parts and forcing a whole package of AAC-LC, HE-AACv1, HE-AACv2 and xHE-AAC onto you. I guess if the situation won’t change in twenty years all current stuff will expire but they’ll still license it along with Ultra-Enhanced-Hyper-Expanded-Radically-Extended High-Efficiency AAC (which will have nothing to do with all those previous formats).
  • DT$ A company similar to D*lby and its (former?) prime competition. Also known for single format with many extensions making it essentially a homebrew AAC. At least it seems to be exclusively DVD/Blu-ray format and I’m satisfied with Xine for playing the former and avoiding the latter completely.

And I want to talk a bit more about my RealAudio DNET decoder. Internally it’s called ts102366 for obvious reasons and I have just a primitive implementation for it (i.e. it seems to work and should handle multichannel fine but no extended features). The extension for more than 5.1 channels also seems to be HD-DVD/Blu-ray only so I don’t care, it’s quite rare in RealMedia format and other containers seem to contain it as contiguous stream so I’d need to introduce support for NAElementaryStream in demuxing code and also proper parser to split it into frames. Not worth the effort for me at this moment. Another fun fact is that bitstream comes in 16-bit words that can have any endianness. In my case I just had to detect the proper endianness from first two bytes and simply initialise bitstream reader in BE or LE16 mode depending on it (again, it’s funnier with DT$ format where you have three different bitstream reading modes and you might need two modes simultaneously in some cases; again, good thing I don’t have to care about that stuff). Also it’s still one of two codecs I currently have that support multichannel audio (Cook is the second of course and AAC will be third).

And finally some words about Rust issues I had to deal with.

Rust as a language is more or less fine but compiler sucks. I’ve ran into several issues while writing code.

First, I had a fixed array of Codebooks to initialise in RALF decoder (one of 15 codebooks, another one of 125 codebooks and yet another one of 10×11 codebooks). If I use simply mem::uninitialized() with filling it up it works fine. In debug mode. In release mode it segfaults at the end. Probably I should’ve used ptr::write() instead of assigning and it would work fine but I gave up and used a vector instead of an array even if it’s not as efficient. Obviously it’s all my fault and not Rust issue but still that was weird.

Second, when I tried to create a generic codebook reader that would accept table of codes of any primitive type (u8, u16 or u32) I ran into funnier issue of Rust compiler spewing weird errors like “cannot convert u16 to u32 because it’s not a primitive type”. Obviously it’s my mistake and it’s caught by a tool (that is still not in stable) so the developers don’t care (yes, Luca even bothered to file an issue on that). Still, I’d rather have a clearer error message in that case (e.g. “… because it’s X and not a primitive type”).

And finally, an example that is definitely rustc stupidity and not mine. Again, developers don’t consider this to be an issue but I do (and Luca seemed to agree with me since he opened an issue about it). Essentially, there is a thing called DCE (dead code elimination), so when compilers see that certain block won’t be executed they might print a warning and just check inside code for syntactic validity. Current rustc might ignore condition value and optimise code inside even if it clearly makes no sense (to the point where it crashed because of that on some nightly version, see the issue for details). And while you argue that one should not write such code, I had quite plausible use case for it: a macro that took 2- or 3-element array and did something to its values so if third value was present it had to do something special with it. But of course compilation failed because you tried to do if ARR.len() > 2 { a = ARR[2]; } with two-element array. But when I tried to check whether I got indexing correct by using large constants as indices, cargo check passed just fine—probably because const propagation did not go that deep inside my code (it was in a function called from a long chain in some sub-sub-sub-module and standalone example errors out fine). This feels quite unpolished to me.

Oh, and final final fun thing: the calls like would still fail borrow check probably because they can’t (I guess) formalise function calling convention i.e. “if function is called then first its arguments are evaluated and copied if needed in certain order, then function address is evaluated and called with the arguments”. BTW you still have the situation like this:

struct Foo { foo: u8 }
impl Foo {
    fn bar(&mut self) -> u8 { += 1; }

fn fee(a: u8, b: u8) {
    println!("{} {}", a, b);

fn main() {
    let mut foo = Foo { foo: 42 };

And if you don’t know what’s wrong here I’ll tell you: in C argument evaluation is implementation-defined because back in the day there were very different calling conventions and thus compiler needed to start with evaluating from last argument to first to store them in order instead of widespread pushing arguments in order to stack. So depending on ABI the function would be called either as fee(43, 44) or as fee(44, 43).

Now I see two ways out of it: either detect such situation where the same object is mutably called several times and give an error or, which is better IMO, make formal calling convention so the code won’t be undefined. And fix borrow checker while doing that.

Overall, Rust is a nice experience so far since it allows code to structure much better but sometimes you hit such silly issues that spoil all the fun.

Anyway, next post should be about RealAudio Cook, the Opus of its era.

NihAV: Some Progress to Report!

Friday, August 24th, 2018

Finally the large chunk is finished: NihAV has finally got support for RealVideo 3 and 4!

Since I’ve learned a great deal more about codecs since the last time I wrote RealVideo 3/4 decoder (and specifications for both were leaked—they have mistakes but still clarify some things), I was able to write a new decoder that also seems to reconstruct frames better.

Some words on the design: I’ve split it into several parts as usual—common RV3/4 code, RV3/4 DSP, RV3 bitstream parser, RV3 DSP and RV4 bitstream parser and DSP. That’s the approach I’ve been using before and I’ll probably use it in future decoders as well. The only more or less interesting thing is how I did weighted motion compensation: instead of temporary buffer I allocate 16×16 frame that I use for storing temporary results and which is used later to average results (since motion compensation routines in RealVideo 3 and 4 differ while weighted averaging is the same it makes sense to split it into separate operation).

And now for the juicy part: benchmarks and performance. I’ve tested one of the RealVideo 4 trailers (namely swordfish.rmvb) and avconv -threads 1 -cpuflags 0 decodes it in 15 seconds, nihav-tool needs almost 25.

NihAV: Progress Report

Monday, July 2nd, 2018

I’m still working (barely) on NihAV and I’ve managed to make my code decode both RealVideo 3 and 4. It’s not always correct, especially B-frames and some corner cases, but at least it produces a sane picture in most cases.

And this time I’d like to write about disadvantages of writing motion compensation functions in Rust instead of C.

NihAV: progress report

Sunday, June 10th, 2018

Well, since I had no incentive to work on NihAV and recently the weather is not very encouraging for any kind of intellectual activity there was almost no progress. And yet now I have something to write about: NihAV has finally managed to decode non-trivial (i.e not fully black) RealVideo 3 I-frame properly (i.e. without any visible distortions). Loop filter is still missing but it’s a start. And it’s not a small feat considering one has to implement both coefficients decoding and intra prediction. So essentially it’s just motion vector juggling and motion compensation are all the things that are missing for P- and B-frames support. Maybe it will go faster from here (but most likely not).

And since doing that involved rewriting some C code into Rust here are some notes on how oxidising went:

  • match is a nice replacement for the cases when you have to partly remap values—in my case I had to adjust intra prediction directions in case top or left or bottom reference were missing and that means changing three or four values into other values, match looks more compact than several } else if (itype == FOO) { and does not lose readability either;
  • while in C foo = bar = 42; is a common thing, Rust does not allow this (I can understand why) and I’m surprised I ran into it only now (with intra prediction functions that assign the same calculated value to several output pixels at once);
  • loops in Rust are fine for basic use but when you need to deal with something more complex like for (i = 0; i < block_size; i += 4) or for (i = 99; i > 0; i--) you need either to write a simpler loop and remap indices inside or to remember it's Rust and permute range in less intuitive ways like for i in (0..block_size).filter(|x| x&3 == 0) and for i in (1..99+1).rev(). While this works and even somewhat conveys the meaning it's a bit unwieldy IMO;
  • and it might be a bit too esoteric but looks like I cannot easily write fn clip_u8(val: N) -> u8 that would take any primitive numeric type as input, do comparisons inside and return value either clipped to converted to u8. The best answer on how to do it I found was "you can't, it's against Rust practices". I don't need it much and I care even less, so I'll just mark it as a neutral language feature and forget about it.

And now the small but constantly irritating thing: arrays. While slices are nice and easy to use (including extracting sub-slices), in my area I often need a slice with arbitrary start and end bounds. To clarify my use case: quite often you need a piece of memory that's addressable with both positive and negative indices and those make sense on certain interval.

One of such common arrays is clipping array which essentially takes input index and returns it clipped usually to 0-255 range. So you have part [-255..-1] filled with zeroes, [0..255] filled with values in the same range and [256..511] filled with 255. I repeat, such clipping table is very common and useful thing that's currently not easy to implement in Rust.

Another less common case is the block of pixels we process may require information from its top, left and top-left neighbours—and those are addressed as src[-stride + i], src[-1 + stride*i] and src[-stride - 1]. Or a whole frame of GDI-related codec (no, not from Westwood) or even simple BMP/DIB that stores lines upside-down so after you process line 0 you have to move to line -1.

I currently deal with it by keeping an additional variable pointing to the current position in array that I use as a reference and from which I can subtract other numbers if needed, but it's a bit clunky and error-prone. Since Rust checks indices on slice access I wonder if extending it to work with e.g. negative indices is possible. IIRC FORTRAN and Pascal allowed you to define an array starting with arbitrary index, it might be possible in Rust too.

Oh well, I'll just keep using my approach meanwhile and waiting to see what rust-av does in this regard.