Archive for the ‘Rust’ Category

QfG5: now with some code

Tuesday, January 16th, 2024

So while I’m still trying to figure out various details of the engine, I decided to finally start writing some engine code myself instead of having a bundle of tools to do just one task.

The main annoyance is loading files in Rust. Since the game was designed for case-insensitive filesystems, the directories and files may have names in lowercase, uppercase or mixed. And since certain operating system chose the radically different approach to the Unicode encoding format for the filenames, Rust has OsString in order to handle those external names properly. As the result in order to find the proper name corresponding to the file (e.g. “data/qgm/160.qgm” may be really names “DATA/Qgm/160.QGM”) I have to split path, convert it to an internal string in order to perform case-insensitive comparison and reconstruct the full path again. It’s not complicated but still annoying.

After dealing with this nuisance, I’ve moved to decoding graphic formats. For instance, I could finally dump all sprites and room backgrounds to inspect them better. And in order to do that I could simply create an archive object for loading data, image buffer object to paint room background and/or sprites to, and some objects to decode and paint the actual data.

Here is room 260, the bedroom in Gnome Ann’s Inn that serves as the starting point for the multiplayer games (except that it was cut and it takes some effort to see it):

If you’re familiar with the game you can see that it is not a simple mirror image of the usual single-player bedroom (the pictures are different as well as the chest placement). A subtler thing to notice is that on my picture the floorboards are curved while they look straight in the game—those would be too if I managed to implement the cylindrical mapping the game uses to project the background (I hope to get to it eventually). But since this is not that much fun by itself, here’s a GIF of the room with some of the sprites in place (beware of the size, it can hardly fit on a floppy). If some of them look wrong it’s because they were taken from the single-player bedroom and mirrored (funny enough, there’s a sprite for the second chest taken from there as well).

Speaking about the sprites, I’ve finally figured out why some of the sprites are rotated and some are not. As it turns out, there’s the following hierarchy: each room may have several views (e.g. slightly different views of the arena, front and back parts in the Hall of Kings or overall docks panorama with a specific Pholus shop close-up) and each of the view may have up to a hundred sprites associated with it. So e.g. multi-player bedroom has number 260, its only view has number 2600, and torches in it have sprites number 260000 and 260001 while chests have sprites number 260095 and 260096 correspondingly (with no other sprite numbers between them). Room 900 is a map that uses the standard BMP file for its background while points of interest on it are individual sprites (as well as the map you can see if you look at the inventory map).

So all of the room-associated sprites (except for map markers) are stored in rotated form while various GUI-related sprites (with the numbers in 1xxxxx range) are not.

And if you care about specific GUI sprite numbers, here’s a short list:

  • 100000-100006: main window GUI elements
  • 100010-100181 (but not 100100): character and monster portraits (with various expressions if they have spoken lines);
  • 100100, 100200-101001, 101003: various GUI elements for dialogue windows
  • 101002: small icons for all inventory items (as well as spells and paladin abilities);
  • 101101-101607: animated items (spells, abilities) views for the inventory window;
  • 120000-130001: custom GUI elements for the specific cases (e.g. bulletin board background and notes, power indicator for throwing, main menu background and so on);
  • 140000-140048: GUI elements for the safe-cracking minigame (background, opening segments, dancing figures);
  • 143000-144005: GUI elements for the cut minigame Wizard’s Whirl;
  • 150000-155000: things that can be seen with the telescope in Erasmus’s castle;
  • 156000-157002: pizza diagrams at the Scientific Isle;
  • 160000: save/load game menu background;
  • 161000: map (which looks the same as 900099 but the latter is stored in rotated form);
  • 199909-199913: Wheel of Fortune minigame sprites (background and parrot, spinning wheels, arm and thrown knives are 3D models).

So the easy part is done, the next thing I’ll probably try is proper cylindrical projection for the room backgrounds and 3D object rendering. That should definitely take some time…

NihAV: updated for Rust 1.69

Thursday, July 27th, 2023

Since I had nothing better to do I decided to optimise my H.264 decoder a bit more, and that required a rather recent version of rustc that supports sym construct in asm!{} (so I can reference data tables in the inline assembly). Why this specific version though? I picked whatever was both recent enough to support the aforementioned feature (and older version had multiple micro version releases which hints on some problems with them) and not too recent either (again, I’m no beta tester of the compiler and I don’t need other shiny features).

And while at it I decided to make the code a bit more up to date. cargo-clippy is still annoying with its default warning about all-caps names and some lints that changed names and their suppressors no longer work. Getting rid of some leftover hints for the old versions of the compiler (like explicit drop()s for the objects borrowing code and some type hints) was nice though. Inline assembly is still only halfway done, especially considering that using const in it won’t be possible in stable for a long time and sym sucks compared to GCC inline assembly (it provides just a symbol name and you should magically know for yourself how the target platform works in order to make it possible to load it correctly; on AMD64 it’s rather simple but on aarch64 and on 32-bit ARMs that depends on target OS and PIC mode). Who would’ve thought that assembly may be platform-dependent! Looks like the current solution to that problem is to expose current configuration to the user so it’s up to you to check all environment variables and write the appropriate code. And of course even that solution will be available some time in the future since the developers haven’t thought about it at all.

Anyway, now my H.264 decoder features some more assembly optimisations and decodes video even faster than before. Though I fear it still takes too much CPU for the comfortable playback of my typical content so I’ll have to dabble in the hardware video acceleration. NihAV is a learning project after all.

What optimisation possibilities I miss in Rust

Friday, June 23rd, 2023

Since a certain friend of mine keeps asking what features I need in Rust and then forgets the answer, here I decided to write it all down. Hopefully it will become outdated sooner than later.

And I’d like to start with some explanations and conditions. I develop a certain multimedia project so I have certain common flows (e.g. processing 16×16 macroblocks in a frame) and I’d like to be able to optimise for them. Also I do not like to use the nightly/unstable version of Rust (as those unstable features may take an extremely long time to hit stable and they change in the process, as it happened to asm!{} support to give one example). And finally I do not accept the answer “there’s a crate X for that”—out of design considerations I prefer to avoid external dependencies (short explanation: those get out of control fast; my encoder and player projects depend only on my own crates for doing everything but the player additionally pulls sdl2 dependency and suddenly it’s 33 crates instead of 19; IIRC with a newer version of sdl2 crate the total number gets to fifty).

Anyway, here are the features I miss with some explanations why that should be relevant probably not just to me.
(more…)

NihAV experiments: multi-threaded decoder

Thursday, June 1st, 2023

In my efforts to have an independent player (that relies on third-party libraries merely for doing input and output while the demuxing and decoding is done purely by NihAV) I had to explore the way of writing a multi-threaded H.264 decoder. And while it’s not working perfectly it’s a good proof of a concept. Here I’ll describe how I hacked my existing decoder to support multi-threading.
(more…)

Rust inline assembly experience

Saturday, September 3rd, 2022

Since I need something less exciting than the series about IAEA willingly ignoring the terrorists occupying the largest nuclear power plant in Europe (even during its mission visit there) or the series called “what russia destroyed in my home city today”, I’ve tried inline assembly support in recent stable Rust compiler, here’s a short report.

Since I’m working on a multimedia framework, my primary interest about inline assembly is how well I can add SIMD code for various codecs. My previous attempt was optimising the adaptive filter in Monkey’s Audio decoder and while it worked it looked ugly because of the way Intel named its intrinsics (if you like the names like _mm_madd_epi16 then our tastes are very different) and the verbosity (constant need to cast vectors to different types. So I decided to wait until non-experimental inline assembly support is ready.

This time I’ve decided to look how easy it is to make SIMD optimisations for my own H.264 decoder (and it needs them in order to be usable when I finally switch to my own video player). Good things: I’ve managed to speed up overall decoding about 20%. Bad things: a lot of things can’t be made faster because of the limitations.

For those who are not familiar, H.264 decoder contains a lot of typical operations performed on blocks with sizes 2×2, 4×4, 8×8 or 16×16 (or rectangular blocks made by splitting those in half) with operations being copying, adding data to a block, averaging two blocks and so on.

Writing the code by itself is nice: you can have a function with a single unsafe{ asm!(..); } statement in it and you let the compiler to figure out the details (the rather famous x86inc.asm is mostly written to deal with the discrepancies between ABIs on different platforms and for templating MMX/SSE/AVX code). Even nicer is that you can specify arguments in a clear way (which is much better than passing constraints in three groups for GCC syntax) and used named arguments inside the code. Additionally it uses local labels in gas form which are a bit clearer to use and don’t clutter debug symbols.

Now for the bad things: inline assembly support (as of rustc 1.62.1) is lacking for my needs. Here’s my list of the annoyances with ascending severity:

  • the problem with sub-registers: I had to fill XMM register from a GPR one so I wrote movd xmm0, {val} with val being a 32-bit value. The compiler generated a warning and the actual instruction in the binary was movq xmm0, rdx (which is copying eight bytes instead of four). And it’s not immediately clear that you should write it as movd xmm0, {val:e} in that case (at least Luca has reported this on my behalf so it may be improved soon);
  • asm!() currently supports only registers as input/output arguments while in reality it should be able to substitute some things without using registers for them—e.g. when instruction can take a constant number (like shifts, it’s very useful for templated code) or a memory reference (there are not so many registers available on x86 so writing something like paddw xmm1, TABLE[{offset}] would save one XMM register for loading table contents explicitly). I’m aware there’s a work going on that so in the future we should be able to use constant and sym input types but currently it’s unstable;
  • and the worst issue is the lack of templating support. For instance, I have functions for faster averaging of two blocks—they simply load a certain amount of pixels from each line, average them and write back. For 16×16 case I additionally unroll the loop a bit more. It would be nice to put it into single macro that instantiates all the variants by substituting load/store instruction and enabling certain additional code inside the loop when block width is sixteen. Of course I work around it by copy-pasting and editing the code but this process is prone to introducing errors (especially when you confuse two nearly identical functions—and they tend to become long when written in assembly). And I can’t imagine how to use macro_rules!() to either construct asm!() contents from a pieces or to make it cut out some content out of it. Having several asm!() blocks one after another is not always feasible as nobody can guarantee you that the compiler won’t insert some code between them to juggle the registers used for the arguments.

All in all, I’d say that inline assembly support in Rust is promising but not yet fully usable for my needs.

Update. Luca actually tried to solve templating problem and even wrote a post about it. There’s a limited way to do that via concat!() instead of string substitution and a somewhat convoluted way to fit some blocks inside one assembly template. It’s not perfect but if you’re desperate enough it should work for you.

Rust needs proper stand-alone assembler support

Tuesday, July 27th, 2021

Back when I gave my arguments I why don’t consider Rust a mature language, one of those arguments was that is lacks proper assembler support and systems programming language requires it since some of the tasks you need to perform (including optimisation) require as low level access as you can get. Here I would like to argue why asm!{} may be enough for most cases it’s definitely not for mine.
(more…)

Missing optimisation opportunity in Rust

Wednesday, May 12th, 2021

While I’m struggling to write a video player that would satisfy my demands I decided to see if it’s possible to make my H.264 decoder a bit faster. It turned out it can be done with ease and that also raises the question concerning the title of this post.

What I did cannot be truly called optimisations but rather “optimisations” yet they gave a noticeable speed-up. The main optimisation candidates were motion compensation functions. First I shaved a tiny fraction of second by not zeroing temporary arrays as their contents will be overwritten before the first read.

And then I replaced the idiomatic Rust code for working with block like

    for (dline, (sline0, sline1)) in dst.chunks_mut(dstride).zip(tmp.chunks(TMP_BUF_STRIDE).zip(tmp2.chunks(TMP_BUF_STRIDE))).take(h) {
        for (pix, (&a, &b)) in dline.iter_mut().zip(sline0.iter().zip(sline1.iter())).take(w) {
            *pix = ((u16::from(a) + u16::from(b) + 1) >> 1) as u8;
        }
    }

with raw pointers:

    unsafe {
        let mut src1 = tmp.as_ptr();
        let mut src2 = tmp2.as_ptr();
        let mut dst = dst.as_mut_ptr();
        for _ in 0..h {
            for x in 0..w {
                let a = *src1.add(x);
                let b = *src2.add(x);
                *dst.add(x) = ((u16::from(a) + u16::from(b) + 1) >> 1) as u8;
            }
            dst = dst.add(dstride);
            src1 = src1.add(TMP_BUF_STRIDE);
            src2 = src2.add(TMP_BUF_STRIDE);
        }
    }

What do you know, the total decoding time for the test clip I used shrank from 6.6 seconds to 4.9 seconds. That’s just three quarters of the original time!

And here is the problem. In theory if Rust compiler knew that the input satisfies certain parameters i.e. that there’s always enough data to perform full block operation in this case, it would be able to optimise code as good as the one I wrote using pointers or even better. But unfortunately there is no way to tell the compiler that input slices are large enough to perform the operation required amount of times. Even if I added mathematically correct check in the beginning it would not eliminate most of the checks.

Let’s see what happens with the iterator loop step by step:

  1. first all sources are checked to be non-empty;
  2. then in outer loop remaining length of each source is checked to see if the loop should end;
  3. then it is checked if the outer loop has run not more than requested number of times (i.e. just for the block height);
  4. then it checks line lengths (in theory those may be shorter than block width) and requested width to find out the actual length of the inner loop;
  5. and finally inside the loop it performs the averaging.

And here’s what happens with the pointer loop:

  1. outer loop is run the requested amount of times;
  2. inner loop is run the requested amount of times;
  3. operation inside the inner loop is performed.

Of course those checks are required to make sure you work only with the accessible data but it would be nice if I could either mark loops as “I promise it will run exactly this number of times” (maybe via .take_exact() as Luca suggested but I still don’t think it will work perfectly for 2D case) or at least put code using slices instead of iterators into unsafe {} block and tell compiler that I do not want boundary checks performed inside.

Update: in this particular case the input buffer size should be stride * (height - 1) + width i.e. it is always enough to perform operation in the way described above but if you use .chunks_exact() the last line might be not handled which is wrong.

The former is rather hard to implement for the common case so I don’t think it will happen anywhere outside Fortran compilers, the latter would cause conflicts with different Deref trait implementation for slices so it’s not likely to happen either. So doing it with pointers may be clunky but it’s the only way.

Why Rust is not a mature programming language

Friday, September 18th, 2020

While I have nothing against Rust as such and keep writing my pet project in Rust, there are still some deficiencies I find preventing Rust from being a proper programming language. Here I’d like to present them and explain why I deem them as such even if not all of them have any impact on me.
(more…)

NihAV: rust-clippy experience

Saturday, May 18th, 2019

As I’ve mentioned in the previous post, I’ve finally tried rust-clippy to see what issues and suggestions it will have on my code. The results are not disappointing if you take the tool name seriously.
(more…)

NihAV: Lurching Forward!

Sunday, October 28th, 2018

Finally NihAV got full* support for RealAudio. Marketing asterisk there is for AAC support since only AAC LC is supported which means decoding only 22kHz from racp (as I said before, no SBR support) plus some other features were cut: it supports just multichannel audio but no coupling feature, noise codebook or e.g. LTP from AAC Main. Also for similar reasons I did not care to optimise its performance. I don’t care much about it and I don’t want to spend a lot of time on AAC. Here’s some fun statistics for you: AAC decoder is about 1800 lines long (IMDCT, window generation and bitstream parsing are common modules but the rest of AAC stuff is there), about 100 lines are spent on subband tables, 350 lines are spent on codebook tables and 250 lines are wasted on “proper” parsing of general MPEG-4 Audio descriptor with type-specific stuff like HILN or ALS configuration being simply left as unimplemented!() (a useful Rust macro that will report it and abort program if it reaches that point so you know when to start caring).

And now for the fun Rust part.

My codebook generator takes something that implements CodebookDescReader trait which is essentially a common interface that specifies how to get information for each codebook element (codeword, its length and associated symbol). While I had some standard implementations, they were not very useful since the most common case is when you have two tables of different types (8 bits are enough for codeword lengths and actual codewords may take 8-32 bits) and you want to use some generic implementation instead of writing the same code again and again.

In theory it should be easy: you create generic structure that stores pointers to those tables and converts them into needed types. It would work after some fiddling with type constraints but the problem is to make the return type what you want i.e. the function takes entry index number (as usize) and returns some more sensible type (so that when I read code later more appropriate type will be used).

And that is rather hard problem because generic return type can’t be converted from usize easily. I asked Luca of rust-av fame (he famously does not work on it) and he provided me with two and a half solutions that won’t work in NihAV:

  • Use unstable TryInto trait that can try to convert input type into something smaller—rejected since it’s unstable;
  • Use num_traits crate for conversion traits that work with stable—rejected since it’s an external crate;
  • Implement such trait yourself. Hmm…

So I NIHed it by taking a function that would map input index value to output type. This way is both simpler and more flexible. For example, AAC scalefactor codebook has values in the range -60..+60 so instead of writing new structure like I did before I can simply do

fn scale_map(idx: usize) -> i8 { (idx as i8) - 60 }
...
let mut coderead = TableCodebookDescReader::new(AAC_SCF_CODEBOOK_CODES, AC_SCF_CODEBOOK_BITS, scale_map);

And this works as expected.

There’s a small fun thing related to it. By old C habit I wrote initially &fn as argument type and got cryptic Rust compiler error about mismatched types:

    = note: expected type `&fn(usize) -> i8`
               found type `&fn(usize) -> i8 {codecs::aac::scale_map}`

Again, my fault but rustc output is a bit WTFy.


Anyway, the next is supposedly RealVideo HD or RealVideo 6 decoder and then I can forget about this codec family and move to something different like DuckMotion or Smacker/Bink.

And in the very unlikely case you were wondering why I’m so slow I can tell you why: I am lazy (big reveal, I know), I prefer to spend my time differently (on workdays I work, on Sunday I travel around, that leaves only couple of hours during workdays, some part of Saturday and a bit of Sunday if I’m lucky) and I work better when the stuff is interesting which is not always the case. Especially if you consider that there is no multimedia framework in Rust (yet?) so I have to NIH every small bit and some of those bits I don’t like much. So don’t expect any news from me soon.