I mentioned before couple of times that NihAV
has its own functioning H.264 decoder. And after my failed attempts to use hardware accelerated decoding instead, I spend some time trying to optimise it but eventually gave up. On one hand it’s fast enough for my needs, on the other hand it’s too tedious to optimise it further (even if I can spare time on it, I’d rather not).
To put it into perspective, initially it was about three times slower than libavcodec
one without SIMD optimisations, now it’s only about two times slower (with SIMD turned on it’s about five times as slow, feel free to laugh at me). But at the same time playing 720p content (and I have next to no files with larger resolution) in multi-threading mode takes 20-25% of the core so it’s not that bad.
So how the cycles are wasted and is there a potential for serious optimisation?
Bitstream decoding takes about seven percent of total time, motion compensation takes 29% of total time (a quarter of which is spent of 2×2/2×4 chroma blocks), in-loop filtering is 21%, intra prediction takes about three percent of time, and the rest is spent in various glue code that decides what DSP functions should be called and updates various contents. For instance, over 8% of total time is spent in a single function that fills edge deblocking strengths for a macroblock (the values are dependent on coded/uncoded block flags and motion vector values, no actual pixels are checked) and five percent of total time is spent calculating direct motion vector pair for various blocks.
I’m pretty sure I could shave off ten percent of decoding time by optimising loop filters and getting rid of the overhead generated by Rust compiler for motion compensation functions but that won’t change things much. I suspect that the main speed-up can be achieved by changing the data structures design for the decoder: introducing caches for the often-used information for the macroblock and its neighbourhood, eliminating checked accesses wherever possible, using some tricks to speed-up common calculations and so on. The problem that it’s not fun, takes too much time (even to see some results let alone complete it in full) and I have no need for it. Previously I made optimisations because it was fun or could speed-up decoding significantly (as single-threaded decoding was too slow for comfortable playback) but the low-hanging fruits have been picked, the decoder is fast enough for my practical needs and this I have no incentive to keep working on it whatsoever.
Nevertheless, I hope this demonstrates that you don’t need to be genius and an insane amount of time to write a semi-decent H.264 decoder even if it would take a joint effort of many talented people to make it very fast.