HW accel for NihAV player: fully done

As mentioned in the previous post, I’ve managed to make hardware acceleration work with my video player and there was only some polishing left to be done. Now that part is complete as well.

The worst part was forking cros-libva crate. Again, I could do without that but it was too annoying. For starters, it had rather useless dependencies for error handling for the cases that either are too unlikely to happen (e.g. destroying some buffer/config failed) or rather unhelpful (i.e. it may return a detailed error when opening a device has failed but for the rest of operations it’s rather unhelpful “VA-API error N” with an optional error explanation if libva bothered to provide it). I’ve switched it to enums because e.g. VAError::UnsupportedEntrypoint is easier to handle and understand when you actually care about return error codes.

The other annoying part was all the bindgen-produced enumerations (and flags). For example, surface allocation is done with:

display.create_surfaces(
                bindings::constants::VA_RT_FORMAT_YUV420,
                None, width, height,
                Some(UsageHint::USAGE_HINT_DECODER), 1)

In my slightly cleaned version it now looks like this:

display.create_surfaces(
                RTFormat::YUV420,
                None, width, height,
                Some(UsageHint::Decoder.into()), 1)

In addition to less typing it gives better argument type check: in some places you use both VA_RT_FORMAT_ and VA_FOURCC_ values and they are quite easy to mix up (because they describe about the same thing and stored as 32-bit integer). VAFourcc and RTFormat are distinct enough even if they get cast back to u32 internally.

And finally, I don’t like libva init info being printed every time a new display is created (which happens every time when new H.264 file is played in my case) so I added a version of the function that does not print it at all.

But if you wonder why fork it instead of improving the upstream, beside the obvious considerations (I forked off version 0.0.3, they’re working on 0.0.5 already with many underlying thing being different already), there’s also CONTRIBUTING.md that outright tells you to sign Contributor License Agreement (no thanks) that would also require to use their account (which was so inconvenient for me that I’ve moved from it over a year ago). At least the license does not forbid creating your own fork—which I did, mentioning the original authorship and source in two or three places and preserving the original 3-clause BSD license.

But enough about it, there’s another fun thing left to be discussed. After I’ve completed the work I also tried it on my other laptop (also with Intel® “I can’t believe it’s not GPU”, half a decade newer but still with slim chances to get hardware-accelerated decoding via Vulkan API on Linux in the near future). Surprisingly the decoding was slower than software decoder again but for a different reason this time.

Apparently accessing decoded surfaces is slow and it’s better to leave processing and displaying them to GPU as well (or offload them into main memory in advance) but that would require too many changes in my player/decoder design. Also Rust could not optimise chroma deinterleaving code for chroma (in NV12 to planar YUV conversion) and loads/stores data byte-by-byte which is extremely slow on my newer laptop. Thus I quickly wrote a simply SSE assembly to deinterleave data reading 32 bytes at once and it works many times faster. So it’s good enough and I’m drawing a line.

So while this has been rather useful experience, it was not that fun and I’d rather not return to it. I should probably go and reverse engineer some obscure codec instead, I haven’t done that for long enough.

Leave a Reply