NihAV: The Fastest RealVideo 6 Decoder in Rust!

I guess the title shows how stupid marketing something can be if there’s just one contestant in the category—so in order to win you merely need to exist. Like in this case: NihAV can barely decode data and it’s not correct but the images are recognizable (some examples below) and that’s enough to justify this post title. Also I have just one sample clip to test my decoder on but at least RV6 is not a format with many features so only one feature of it is not tested (type 3 frames).

Anyway, here’s the first frame—and it is reconstructed perfectly:

The rest of frames is of significantly worse quality but with more details.

The next I-frame:

The first P-frame after it:

And one of B-frames between those two:

The sample is just short clips of similar sights with some Korean company logo at the end. At least I could recognize most of them without straining imagination much (especially after I fixed a typo in my 4×4 transform implementation).

Overall I’d say that RealVideo 6 decoder is simpler than RealVideo 4, it needs less code (now codebase for RV4 is more than twice as large as RV6 decoder codebase) and many things are done simpler too.

For example, RV3 and RV4 had context-adaptive coding for intra prediction direction, RV6 simply has fixed scheme (one bit for selecting if it’s shortlist index or explicit mode, then 1-2 bits for index or 5 bits for mode; final intra mode is derived like in H.EVC but a bit simpler). While RV3 and RV4 used weighted motion compensation for B-frames, RV6 uses just averaging. RV1-4 had arbitrary slices, RV6 codes each row of 64×64 blocks in a separate chunk of frame that can be started to decode when enough context is there (and H.EVC scheme is too flexible to my taste, I’d rather stick to one scheme instead of having to support both tiles and slices that may or may not code rows independently).

I’d argue this makes a lot of sense: if you want to offer your codec you should make sure it decodes in software with good speed (not looking at you, AV1) because hardware implementation do not follow immediately (if at all) so you’d better deliver something that works here and now. And if you target “high-definition” you have to deal with a lot of input data regardless how well you compress it. And compression gains come mostly from employing larger blocks instead of smarter coding methods (e.g. if you compare H.264 and H.265 you’ll see they use the same coefficient encoding method, just larger blocks for transform and prediction). Considering all that, employing simpler but tried scheme that can be used for parallel decoding makes sense. Of course if you rely on magic hardware to do all decoding for you then you can sacrifice simplicity for compression efficiency.

And there’s another thing from implementer’s point of view. In RealVideo 6 there’s a separation between coded bitstream and its reconstructed form, i.e. if you don’t need to reconstruct blocks correctly in order to be able to decode upcoming blocks. In H.EVC you have arithmetic coder and almost everything is coded with an adaptive context that is selected depending on the decoded values for its neighbours. From reverse engineering point of view that means if you don’t get all details right you won’t be able to decode anything properly past the point where you made a mistake. Here’s a simple example: in order to decode split flag in RealVideo 6 you simply read a bit; in ITU H.265 you need to check whether top and left neighbours are split more than current block and depending on that information (none/one/both of them is split more) select one of three contexts that will be used to decode actual split bit value (and obviously arithmetic decoder state is updated using probability from the bit context). It is not that bad at this level but decoding coded block patterns and individual coefficients might get painful. Especially if all you have is a bad binary specification.

Having said all that, I still have quite a bit to implement in order to have proper decoding:

  • Type 3 frames (not encountered in the only sample I have; but from the specification they seem to be very optional);
  • Recheck intra prediction mode—I have a feeling I forgot to fill some context properly;
  • On that note, I also need to double check the code for determining whether some neighbour of the block is present, this may be the other reason why intra prediction works wrong in some cases;
  • The references for motion compensation are a bit of mystery too: inter frames can have two forward references instead of just one and the part of the specification responsible for selecting the reference frames is especially hairy and hard to comprehend (it’s full of STL code and after I saw calls like CRefQueue<std::shared_ptr<CFrame>>::GetRefList(...) and std::vector<std::shared_ptr<CFrame>,std::allocator<std::shared_ptr<CFrame>>>::_M_emplace_back_aux<std::shared_ptr<CFrame> const&> I decided not to dig deeper);
  • And the most delightful part—deblocking. Here it’s quite simple and straightforward but so was deblocking in previous RealVideo codecs and it took a lot of time to debug the implementation.

I hope it won’t take that much time though—this decoder is for completeness sake only but I don’t want to leave it in half-working state. And I should start documenting it too.

There’s still a lot to do but some unset milestone has been reached and it’s just debugging work ahead. And since the only sample I have (containing a bit more than 1750 frames with 352×288 resolution) decodes in about two seconds I don’t feel a need to make it even faster.

I wonder what to do next—ducklings of varying beauty (from TM1 to VP7) or totally RAD family of codecs (Smacker and Bink)—but I’ll decide on the order later. Back to doing nothing!

3 Responses to “NihAV: The Fastest RealVideo 6 Decoder in Rust!”

  1. Paul says:

    I want code or specifications of everything.

  2. Kostya says:

    So do I but sometimes even binary specifications are not available any more. On unrelated note, has anybody any luck with Ver!nt RFB decoder?

  3. Paul says:

    Its bitstream are representing some complex structure. I can not get in touch with binaries. And I’m not clairvoyant enough to decipher it.