Hardware acceleration for NihAV video player

Since I was not fully satisfied with the CPU load from my H.264 decoder (and optimising it further is too tedious), I decided to take a look at VA-API hardware accelerated decoding once again (no Vulkan for me).

It turned out that documentation is not as lacking as I expected it to be, it’s just most of it was eaten by bindgen so e.g. you can get VAImage from the decoded surface but you have to look into source code for its definition because it’s just an alias for semi-hidden _VAImage. And even if you look at the original header files from libva, that documentation is rather scarce anyway.

Anyway, the flow turned out to be moderately simple: you create global context, allocate some surfaces, create pictures for the currently decoded frame (and keep pictures for all possible referenced frames too), attach several buffers to the currently decoded picture (global parameters, quantisation matrices, slice parameters and slice data), invoke some commands and hopefully you can create an image from the rendered surface and retrieve the required data.

Of course the process is somewhat confusing (for example, out of six 8×8 quantisation matrices only #0 and #3 should be passed to the decoder; some fields have different names from the H.264 specification and some you’re supposed to derive from e.g. layer value) but if you look at how others do it it’s still possible to figure out.

Then there was fun with the output formats: when you create image from the decoded surface you must provide its parameters (dimensions and format, one of the listed as supported by the decode) but in reality it either fails or creates it in a format it sees it. Which makes one wonder why bother with providing correct parameters if it ignores them anyway.

And finally decoding performance. Decoding can be done in synchronous or asynchronous mode. In the first case you submit data for decoding and wait until it’s decoded, it the second case you submit it for the decoding and do other things (e.g. submit more data to decode) while checking if some of the previous frames are decoded at last. And to my surprise synchronous mode is slower than using my software decoder in single-threaded mode. Thus I ended up fusing up decoder with the frame reorderer in hope that when reordering is done the first output frames are complete (and if not and there are enough frames waiting I wait for the decoding to end). This way it works reasonably well and with low CPU load.

The other issue was that such decoding is inherently single-threaded and can’t be used in my multithreaded player easily. So I ended up doing a kludge that some Luca 0 proposed: wrote a shim to pass calls to (and retrieve results from) the decoder that is running in a separate thread (as long as you create and use it in the same thread it’s fine, it’s moving instance from one process to a newly created other that that is not allowed).

Now there’s just one main annoyance left (with several minor ones): libva prints some information when a new context is created. Apparently there’s a call to set custom handler for such messages but it’s not exposed. That is why I’m going to fork cros-libva, improve some stuff I use (like the visibility of certain functionality, having proper enums instead of what bindgen generated etc), remove rather useless external dependencies and so on. I have no need for the advanced functionality (especially since it’s not going to be supported by my hardware anyway) but the basic stuff should be less horrible to use than it is now.

There’s still a lot of work left to do but at least I can use the player already.

Leave a Reply