As I mentioned in the previous post, I’m polishing NihAV
and improving some bits here and there. In this post I’d like to describe what has been done and what should be done (but not necessarily will be done).
(more…)
Archive for the ‘NihAV’ Category
NihAV: Boring Details
Saturday, December 14th, 2019NihAV: the Last Quack
Thursday, October 31st, 2019Finally NihAV
got full-feature VP7 decoding support (well, except one very exotic case for a very exotic mode) so now I can move to other things like actually making various decoders bit-exact, fixing other bugs in them, adding missing pieces of code for player and even documenting stuff. I hope to give a presentation of my work on VDD 2020 or FOSDEM 2021 (whichever accepts it) and I want to have something decent to present by then.
Anyway, here’s a review of VP7.
(more…)
MidiVid codec family
Thursday, September 26th, 2019VP7 is such a nice codec that I decided to distract myself a little with something else. And that something else turned out to be MidiVid codec family. It turned out to be quite peculiar and somehow reminiscent of Duck codecs.
The family consists of three codecs:
- MidiVid — the original codec based on LZSS and vector quantisation;
- MidiVid Lossless — exactly what is says on a tin, based on LZSS and bunch of other technologies;
- MidiVid 3 — a codec based on simplified integer DCT and single codebook for all values.
I’ve actually added MidiVid decoder to NihAV
because it’s simple (two hundred lines including boilerplate and tests) and way more fun than working on VP7 decoder. Now I’ll describe them and hopefully you’ll understand why it reminds me of Duck codecs despite not being similar in design.
MidiVid
This is a simple hold-and-modify video codec that had been used in some games back in PS2/Xbox era. The frame data can be stored either unpacked or packed with LZSS and it contains the following kinds of data: change mask for 8×8 blocks (in case of interframe—if it’s zero then leave block as is, otherwise decode new data for it), 4×4 block codebook data (up to 512 entries), high bits for 9-bit indices (if we have 257-512 various blocks) and 8-bit indexes for codebook.
The interesting part is that LZSS scheme looked very familiar and indeed it looks almost exactly like lzss.c
from LZARI
author (remember that? I still do), the only differences is that it does not use pre-filled window and flags are grouped into 16-bit word instead of single byte.
MidiVid Lossless
This one is a special best as it combines two completely different compression methods: the same LZSS as before and something used by BWT-based compressor (to the point that frame header contains FTWB
or ZTWB
IDs).
I’m positively convinced it was copied from some BTW-based compressor not just because of those IDs but also because it seems to employ the same methods as some old BTW-based compressor except for the Burrows–Wheeler transform itself (that would be too much for the old codecs): various data preprocessing methods (signalled by flags in the frame header), move-to-front coding (in its classical 1-2 coding form that does not update first two positions that much) plus coding coefficients in two groups: first just zero/one/large using order-3 adaptive model and then values larger than one using single order-1 adaptive model. What made it suspicious? Preprocessing methods.
MVLZ has different kinds of preprocessing methods: something looking like distance coding, static n-gram replacement, table prediction (i.e. when data is treated as series of n-bit numbers and the actual numbers are replaced with the difference between previous ones) and x86 call preprocessing (i.e. that trick when you change function call address from relative into absolute for better compression ratio and then undo it during decompression; known also as E8-preprocessing because x86 call opcode is E8 <32-bit offset>
and it’s easy to just replace them instead of adding full disassembler to the archiver). I had my suspicions as n-gram replacement (that one is quite stupid for video codecs and it only replaces some values with some binary values that look more related to machine code than video) but the last item was a dead give-away. I’m pretty sure that somebody who knows open-source BWT compressors of late 1990s will probably recognize it even from this description but sadly I’ve not been following it that closely being more attracted to multimedia.
MidiVid 3
This codec is based on some static codebook for packing all values: block types, motion vectors and actual coefficients. Each block in macroblock can be coded with one of four modes: empty (fill with 0x80
in case of intra), DC only, just few coefficients DCT, and full DCT. As usual various kinds of data are grouped and coded as single array.
Motion compensation is full-pixel and unlike its predecessor it operates in YUV420 format.
This was an interesting detour but I have to return back to failing to start writing VP7 decoder.
P.S. I’ll try to document them with more details in the wiki soon.
P.P.S. This should’ve been a post about railways instead but I guess it will have to wait.
AVC support in NihAV: semi-done
Saturday, September 14th, 2019I’ve wasted enough time on AVC decoder for On2 family so while it’s not working properly for those special cases I’m moving to VP7 regardless.
For those who don’t know (or forgot; or never had a reason to care) On2 AAC is AAC-LC rip-off with some creative reconstruction modes added to the usual long/short windows. I’ve failed to understand how it works before and I fail to understand how it works still. But at least some details are a bit clearer now that I’ve analysed the whole codec from scratch with less guesswork.
The codec has three IDs that it recognizes: 0x500
, 0x501
and 0x1234
. First two are different only in the aspect that one handles singular packets and another one handles several packets glued together prefixed with size. The last ID is simply recognized but it does not have any special handling.
The tricky part is some special modes that do some heavy processing of data. For most modes you invoke IMDCT and that’s all, here you do some QMF-like filtering (probably for transients extraction), then you perform RDFT (previously I thought it was plain FFT but after long investigation it turned out to be RDFT after all) on quarters, merge those quarters using filters that look like convolution filters for four sub-bands, perform RDFT again on the whole block and add some transients. And after that you still may need to reverse the data before using permuted window for overlap-add operation. In other words it’s not fun and I lack education for recognizing all those algorithms used, why they’re used and where it goes wrong.
So hopefully I’ll return to it some day to fix it for good but now VP7 awaits (so I can at least formally declare Duck codecs family done and move to implementing missing bits in the framework itself).
NihAV: still ducking
Saturday, July 27th, 2019While it’s summer and I’d rather travel around (or suffer from heat when I can’t), there has been some progress on NihAV
. Now I can decode VP5 and VP6 files. Reconstruction still sucks because it takes a lot of effort to make perfect reconstruction and I’m too lazy to do that when simple demonstration that the decoder works would suffice.
Anyway, now I can decode both VP5 and VP6 files including interlaced ones. Interlacing in VP5/6 is done in very simple way like many other codecs: there’s a bit for each macroblock telling whether macroblock should be output in interlaced form or not.
Of course this being VPx family, they had to do it with some creativity. First you decode base interlaced bit probability, which is stored as 8-bit value while all other bit probabilities are stored in 7 bits. Then you derive actual probability for interlaced bit and decode it before any other macroblock information (including macroblock type—it’s that important). Probability is derived by companding base probability depending on whether last macroblock was interlaced (then probability is halved) or not (then it’s remapped to fit 128-255 range)—except for the first macroblock in a row which would use the base probability without modifications. And for VP6 you also have to use different starting scan order (band assignment for each coefficient, now it’s shuffled). This is so trivial that one would wonder why this has not been done in libavcodec
decoder yet.
There are three possible things to do next: polish current implementation, move to AVC (On2 AVC that is) or move to AVC (Duck VP7 which is AVC ripoff). But probably I’ll simply keep doing nothing instead.
NihAV: rust-clippy experience
Saturday, May 18th, 2019As I’ve mentioned in the previous post, I’ve finally tried rust-clippy
to see what issues and suggestions it will have on my code. The results are not disappointing if you take the tool name seriously.
(more…)
NihAV: after clean-up
Friday, May 17th, 2019Since the clean-up work on NihAV
is done and I progress with Truemotion VP3 decoder, it’s a good time to talk about what I’ve actually done—there’s even more material to write waiting in the queue.
The intent was to make all frame-related stuff thread-safe and improve efficiency a bit. In order to do the former I had to replace most of the references from Rc<RefCell<T>>
to Arc<T>
and while doing it I introduced aliases like type NAFrameRef = Arc<NAFrame>
and .into_ref()
methods to convert object into ref-counted version. This helped when I tried switching from one implementation of reference counter to another and will make it easy to switch again if I ever need that (hopefully not). Now about improved efficiency and how it’s related to the ref-counting.
There’s a straightforward way of dealing with frames: you allocate the picture, fill it, dispose, allocate a new one, etc etc. And there’s a more effective way: you allocate several pictures at once, select an unused one, fill, return to the pool when it’s not needed any more. That is where reference counting comes into play and where Rust default structures don’t help. Frame pool owns the reference and decoder gets a second copy. And Rust Arc
is intended for single ownership: when you try to access the shared object it will simply clone it so you end up working with a copy (which defies the purpose). So I had to NIH my own NABufferRef<T>
which keeps reference counts and still allows shared access even for writing (currently it does that in all cases but if I need to add some guards the API won’t have to be changed for that). The implementation is very simple: the structure contains a raw pointer to a structure that contains actual object and AtomicUsize
counter. The whole implementation is ~2.2kB relying just on std
crate.
And finally I’ve made a picture pool. The difference between picture and frame is all additional metadata picture should not care about (like timestamps, stream information and such). Because of the design decisions I have three different picture formats (implemented for 8-, 16- and 32-bit element sizes, Rust does not like aliasing after all), which means I need to provide decoder with all three picture pools because we can’t say in advance which one codec will use (if at all—the option to allocate new non-pooled pictures is still there). Also I want to keep those pools external in case the code around it wants to do keep more pictures in it (e.g. 2-3 pictures required by decoder and 25 pictures pre-buffered for the display). This resulted in a structure called NADecoderSupport
that contains picture pools and may have something else added late. Of course people might argue that it’s much better to have AVCodecContext
with a myriad of fields you can set directly or via utility functions but I’d rather not have one single structure. Though it might be a good place to put various decoder options there (so that decoder can ignore them at its leisure).
Since I said I did it to increase efficiency I should probably give some numbers too: RealVideo 3/4/6 decoders now use buffer pool (for three frames obviously) and reallocate it on format change. Decoding time got reduced by 4-5% from using the pool. Currently I don’t care about speed much but I may convert more decoders to it if the need arises.
In conclusion I want to say that even I did not enjoy doing that work much, it was needed and gave me some experience plus some improvements in code and design. So it was not a wasted effort.
P.S. I also installed rust-clippy
since it’s in stable now and tried to fix errors and warnings it reported. But that is a story for another post.
NihAV: now with TM2X support!
Thursday, April 11th, 2019I’m proud to say that NihAV got TrueMotion 2X support. For now only intra frames are supported but 75% of the samples I have (i.e. three samples) have just intra frames. At least I could check that it works as supposed.
First, here’s codec description after I managed to write a working decoder for it. TrueMotion 2X is another of those codecs that’s closer to TrueMotion 1 in design. It still uses the same variable-length codebook instead of Huffman coding (actually only version 5 of this codec uses bit reading for anything). It also uses “apply variable amount of deltas per block” approach but instead of old fixed scheme it now defines twenty-something coding approaches and tells decoder which ones to use in current frame. That is done because block size now can be variable too (but it’s always 8 in all files I’ve seen). And blocks are grouped in tiles (usually equivalent to one row of blocks but again, it may vary). The frame data obfuscation that XORs chunks inside the frame with a 32-bit key derived in a special way is not worth mentioning.
Second, the reference is quite peculiar too. It decodes frame data by filling an array of pointers to the functions that decode each line segment with proper mode, move to the next line and repeat. And those functions are in handwritten assembly—they use stack pointer register for decoder context pointer (that has original ESP
saved somewhere inside), which also means they do not use stack space for anything and instead of returning they simply jump to the next routine until the final one restores the stack and returns properly. Thankfully Ghidra allows to assign context argument to ESP
and while decompile still looks useless, assembly has proper references in the form mov EDX, dword ptr [ctx->luma_pred + ESP]
.
And finally, I could not check what binary specification really does because MPlayer could not run it. At first I tried running working combination of WMP+Win98 under OllyDbg in QEMU but it was painfully slow and even more painful to look at the memory state. In result I’ve managed to run TM2X decoder in MPlayer which then served as a good reference. The trick is that you should not try to run tm2X.dll
(it’s really hopeless) but rather to take tm2Xdec.ax
(or deceptively named tm20dec.ax
from the same distribution that can handle TM2X unlike its earlier versions), patch one byte for check in DLL init and it works surprisingly well after that.
So what’s next? Probably I’ll just add missing features for the second TM2X sample (the other two samples are TM2A), maybe add Bink2 deblocking feature—since I’d rather have that decoder complete—and move to improving overall NihAV
design. Frame management needs proper rework before I add more codecs—I want to change into a thread-safe version before I add more decoders. Plus I’ll need to add some missing bits for a player. There’s a lot of work still to do but I’m pleased that I still managed to do something.
NihAV: even more Bink2 support!
Wednesday, March 13th, 2019After managing to decode the first frame of KB2g
variant I had three options: try to decode the other frames, try to decode other variants or do nothing. While the third option is the most appealing and the first option is the most logical, I stuck with the second one. So now I can decode the first frame of KB2f
variant of Bink2 as well. Unfortunately the only (partial) KB2a
sample I know is not supported, probably it’s a beta version that was tried on one game like Bink version b
. Beside a small surprise in one place bitstream decoding was rather simple. Inter-frame support should not be that hard but it might get messy because of the DC and MV prediction.
And while talking about REing Bink I should mention that I’ve tried Ghidra while doing KB2f
work. It is a nice tool that sucks in some places (not having a good highlight for variables, decompiling SIMD code results in very questionable output, the system being Java-based and requiring recent JDK—that’s the worst issue really) it works and produces decent results (including the decompiler). Also since it has 16-bit decompiler support maybe I’ll manage to figure out how those clips in Monty Python & the Quest for the Holy Grail are stored.
I should start documenting it too.
Insignificant update: okay, now it decodes inter-frame data correctly too and the only thing left is to make it reconstruct them correctly. Also I’ve updated codec information on Multimedia Wiki. Actually now it works quite okay so I’m not going to pursue it further. I have no real interest in Bink2 decoding after all.
NihAV: some Bink2 support
Sunday, March 10th, 2019It took a long time but finally I can decode the first frame of Bink2 video (just KB2g
flavour though but it’s a start).
At least the initial observations were correct: Bink2 codes data in 32×32 macroblocks, two codebooks for AC zero runs, one codebook for motion vector components, simple codes with unary prefix for the rest.
If you wonder why it took so long—that’s because I’m lazy and spend an hour or less a day on it. Also while the codec is simple in design it’s a bit complicated in implementation. While previous version related on format sub-version to decide which feature to use, Bink2 uses frame flags to decide which feature to use. For instance, flag 0x1000
signals that there are two bit arrays coded that tell when to read an additional flag during CBP decoding that tells which one of two codebooks should be used during AC decoding later. And flag 0x2000
essentially tells to use different bitstream decoding (like motion vectors decoding or block type decoding). Or the fact that it employs DC and MV prediction that usually has four cases (top-left macroblock, top block, left block, some block inside) plus WMV1-like handling of DC prediction in inter-frames (i.e. it calculated DC for inter blocks and uses them for prediction). And of course DC prediction for inter blocks works a bit different. Plus it tries to track internal state by packing all flags into 32-bit word and updating it for each block (two bits are for signalling top row, one—for macroblock being the leftmost one, some bits are copied from frame flags etc etc). So there’s a lot of nuances to take care of.
And that’s not counting the fact that current Bink2 player can’t decode versions prior to KB2g
at all. Since I have some KB2f
samples along with an old Bink player that can handle them, I guess I’ll support them eventually.