Actimagine VX: another imperfect decoder

August 13th, 2020

So I’ve released my decoder for Actimagine VX and it’s far from perfect.

First problem is audio. While the codec itself it not that tricky (it turned out to be some LPC codec that takes 5-10 16-bit words per frame to code pulses and filter for 128-sample frame), but its data is stored right after video frame data so in order to decode audio first you need to decode video frame and feed the remains of input buffer to the audio decoder. Since I can’t do that in a sane way I could not test the decoder either and it’s there just for the informative purposes only.

The second problem is obviously video. I’ve managed to decode bitstream fine but reconstructed images are not bit-exact and in case of plane prediction this leads to ugly artefacts (essentially the target value wraps around and you have gradients from white to black or vice versa instead of almost flat dark or white regions). I’ve introduced a clipping which seems to help but this is not right and maybe I’ll fix it one day. Maybe even before Bink2.

And finally there are some problems with the demuxer. In theory VX files may have multiple tracks but my demuxer might not handle them at all and if it does then it’ll simply ignore anything but the first video stream.

So VX support is far from perfect but it serves its goal of proving that the format works as expected. And if it’s useful to anybody then it’s even better.

Some words about Bink2

August 9th, 2020

As you may know (but definitely not care), NihAV has some limited support for Bink2 video. The problem in fixing it is that known samples are usually 720p video or mode which makes it hard to debug decoding past few initial frames (okay, older versions have smaller known videos so they’re likely to be fixed sooner). And of course the encoder is available only to the RAD customers to which I don’t belong. So in result I’ve decided to look at Actimagine VX codec once again.

I’ve looked at it four years ago but I could just study it but not write a decoder because of the binary. Essentially this codec happens on BigN DS consoles so you have to deal with raw ARM7 or ARM9 binary that (as it turns out) sets up its own segments (and the problems arise when you see absolute addresses to the areas not present there). So you load binary at addresses e.g. 0x2000000-0x20e1030 but in reality it contains also segments 0x1ffe800-0x1fff000 and 0x27e0000-0x27e4000. Thankfully Ghidra can not just load raw ARM binary but also add aliases to data as new segments. This allowed me to work on the decoder again and now I have more or less complete understanding of it and semi-working decoder for it as well, here’s an example:

Sample decoded frame.

Essentially it’s a simplified variant of H.264 with the following features: frames are split into 16×16 macroblocks that can be further recursively divided horizontally or vertically down to 2×2 blocks. Block can be coded in 24 different modes that boil down to full-pel motion compensation from one of three previous frames (without a motion vector, with motion vector, or with motion vector and an offset value that should be added to each pixel), intra prediction on whole block or intra prediction in 4×4 blocks. Also whether you have residue coded is also part of the mode (e.g. mode 11 is intra prediction without residue and mode 22 is intra prediction with residue). Residue is coded in 8×8 blocks comprising six 4×4 coefficient blocks, each block is coded in a way reminding of H.264: there are numbers for total number of non-zero coefficients, number of last non-zero coefficients being plus-minus one and number of zeroes dispersed between non-zero coefficients. Those being coded with variable-length codes that I could not access earlier was the blocker but not any more.

And there’s one curious feature of this codec that made it worth REing: instead of using plane prediction like H.264, this codec fills block in a recursive way. It interpolates bottom-right corner as an average of top-right and borrom-left neighbour pixels (e.g. [15,-1] and [-1,15] for 16×16 block; it also adds a delta to it in certain decoding modes), then it calculates halfway-bottom right and halfway-right bottom pixels (e.g. [15,7] and [7,15] for 16×16 block), then a centre pixel, and then repeats the process for each quarter (or half for some rectangular blocks). This is less computationally intensive than ordinary plane prediction and it seems to give nice results too.

I mentioned before that my decoder is far from perfect (and you can see it for yourself on that picture) but I know how to debug and improve it. I’m not trying to say that piracy is okay, but being able to find some .nds image with a game that has VX videos and using it with DeSmuME with GDB stub would help to debug the decoder but piracy is bad and so it’s not a proper way to do things.

As for audio counterpart, I should mention this: curiously enough there’s an opensource decoder for later MobiClip formats that seems to contain working Sx decoder for an audio used in VX files (it’s a pity the person who did it could not finish VX as well—why should I do the work myself instead of letting other people do my work for me?!). Unfortunately it’s mostly translated assembly so while it should work it’s mostly sub_XXX() doing various accesses to various positions of large byte array of decoder state. I’ll probably add it as well for completeness sake and document the formats properly after I fix the decoder (which should happen during this year too).

A Quality Video Hosting

July 31st, 2020

A brief context: I watch videos from BaidUTube (name slightly altered just because) and my preferable way to do that is to grab video files with youtube-dl in 720p quality so I can watch them later at my leisure, in the way I like (i.e. without a browser), and re-watch later even if it’s taken down. It works fine but in recent weeks I’ve noticed that some of the downloaded videos are unplayable. Of course this can be fixed by downloading it again in slightly different form (separate video stream and separate audio streams muxed locally, youtube-dl can do that) but today I was annoyed enough to look at the problem.

In case it’s not obvious I’m talking about mp4 filed encoded and muxed at BaidUTube without any modifications by youtube-dl which merely downloaded it. So, what’s the problem?

Essentially MP4 file contains header with metadata telling at which offset and which size are frames for each codec and the actual data is stored in mdat atom. Not here. First you have lots of 12-byte sequenced 90 00 00 00 00 0X XX XX 00 02 XX XX, then moof atom (used in fragmented MP4) and then another mdat. And another. I’ve tried to avoid streaming stuff but even to me it looks like somebody put all fragments prepared for HLS streaming into single MP4 file making an unplayable mess.

Overall this happens only on few random videos and probably most of the browsers would not pick it (since VP9 or VP10 in WebMKV is the suggested format) so I don’t expect it to be fixed. My theory is that they decided to roll a new version of encoding software with a broken muxer library or muxing mode. And if you ask “What were they thinking? You should run at least some tests to see if it encodes properly.”, one wise guy has an answer to you: they weren’t thinking about that, they were thinking when how long until the lunch break and then when it’s time to go home. This is the state of enterprise software and I have no reasons to believe the situation will ever improve.

And there’s a fact maybe related to it. Random files starting from 2019 maybe also show the marker “x264 – core 155 r2901 7d0ff22” in the encoded frames while most of the files have no markers at all. While I don’t think they violate the license it still looks strange that a company known for not admitting that it uses open-source projects (“for their own protection” as it was explained once) lets such marker slip through.

Well, that was an even more useless rant than usual.

A Quick Look on LCEVC

July 29th, 2020

As you might’ve heard, MPEG is essentially no more. And the last noticeable thing related to video coding it did the last was MPEG-5 (and synthesising actors and issuing commands to them with MPEG-G and MPEG-4 standards unholy unity). In result we have an abuse of letter ‘e’—in HEVC, EVC and LCEVC it means three different things. I’ll talk about VVC probably when AV2 specification is available, EVC is slightly enhanced AVC and LCEVC is interesting. And since I was able to locate DIS for it why not give a review of it?

LCEVC is based on Perseus and as such it’s still an interesting concept. For starters, it is not an independent codec but an enhancement layer to add scalability to other video codecs, somewhat like video SBR but hopefully it will remain more independent.

A good deal of specification is copied from H.264 probably because nobody in the industry can take a codec without NALs, SEIs and HRD seriously (I know their importance but here it still feels excessive). Regardless, here is what I understood from the description while suffering from thermal throttling.

The underlying idea is quite simple and hasn’t changed since Perseus: you take a base frame, upscale it, add the high-frequency differences and display the result. The differences are first grouped into 4×4 or 8×8 blocks, transformed with Walsh-Hadamard matrix or modified Walsh-Hadamard matrix (with some coefficients being zeroed out), quantised and coded. Coding is done in two phases: first there is a compaction state where coefficients are transformed into byte stream with flags for zero runs and large values (or RLE just for zeroes and ones) and then it can be packed further with Huffman codes. I guess that there are essentially two modes: a faster one where coefficient data is stored as bytes (with or without RLE) and slightly better compressed mode with those values are further packed with Huffman codes generated per tile.

Overall this looks like a neat scheme and I hope it will have at least some success. No, not to prove Chiariglione’s new approach for introducing new codecs an industry can use without complex patent licensing, but rather because it might be the only recent major video codec built on principles different from H.26x line and its success may introduce more radically different codecs and my codec world will get less boring.

NihAV: released!

July 27th, 2020

NihAV was a fine joke that had been running for far too long. But today, on no particulate date at all, I release it for public to ignore or to briefly look and forget immediately. Some decoders (Bink2, ClearVideo and Vivo 2) are still far from perfect, some features have simple or sketchy implementations, but despite all of that here it is.

The official website is here, source code is here.

Many thanks to people from former Libav project for hosting.

Some words about NihAV tools

July 11th, 2020

Since the work on NihAV is nearing the point when I can release it to public without that much shame (main features I wanted to implement are there and I’ve even documentation for all public interfaces plus some overview, you can’t ask for more than that) I want to give $title.

nihav-tool

This is the oldest tool oriented mostly to test decoders functionality. By default it will try to decode every stream in a file and output it either into a wave file or a sequence of images (PPM for RGB video, PGMYUV for YUV). Beside that it can also not decode a stream (and if you choose to decode neither then it tests demuxer or dumps raw frames).

Here is the list of switches it understands:

  • -noout makes it decode data but not produce any output (good for testing decoding process if you don’t currently care about decoder output);
  • -an/-vn makes it ignore audio or video streams correspondingly;
  • -nm=count/pktpts/frmpts make nihav-tool write frame numbers as a sequence or using PTS from input packet or decoded frame correspondingly;
  • -skip=key/inter tells video codec (if it is willing to listen) to skip less significant frames and decode only keyframes or intra- and interframes but no B-frames;
  • -seek time tells the tool to seek to the given position before decoding;
  • -apfx/-vpfx prefix specify the prefix for output filename(s) which comes useful when decoding files in a batch;
  • -ignerr tells nihav-tool to keep decoding ignoring errors the decoders report;
  • -dumpfrm tells nihav-tool to dump raw frames. This is both useful for obtaining raw audio frames (I could not make avconv do that) and because of the way it is implemented (it dumps packet contents first and then tries to decode it) if you use it along with the decoder and it errors out you’ll have raw frame on which it errored out.

Additionally you can specify end time after giving input name if you don’t need to decode the whole file.

As you can see this is not the most feature-rich tool but it works good enough for the declared goal (hence I use it mostly a debug build of it).

nihav-player

This is another quick and dirty tool that appeared when I decided that looking at long sequences of images is not the best way to ensure that decoding goes right. So I wrote something that can pass in a bad light for a player since it can show moving pictures and play sound that sometimes even goes in sync instead of deadlocking audio playback thread.

Currently it’s written using patched SDL1 crate (removing dependencies on num and rand and adding YUV overlay support and audio interface that you can actually use from Rust; patches will be available in the same repository) because my primary development system is too old and I don’t want to mess with various libraries or finding which version of sdl2 crate would compile using my current version of Rust (1.31 or 1.33.

In either case it’s a temporary solution used mostly for visual debugging and I want to write a proper media player based on SDL2 that would play audio-only files just as fine (so I can move to dogfooding). After all, can you really call yourself a multimedia developer if haven’t written a single player?

nihav-encoder

And finally the tool that appeared out of need to debug encoders instead of decoders. Hopefully it will become more useful than that one day but at least its interface should give you the idea what it does and what it will do in the future.

I still consider one of the main problems with ffmpeg (the tool) and later avconv the positional order of arguments. Except when the order does not matter. If you’ve never been annoyed by the fact you should put some arguments before -i infile in order for them to take effect on input and the rest of arguments should be put before output file name—well, in this case you’re luckier than me. So I’ve decided to have it in a more free-form format.

nihav-encoder command line looks a list of options in no particular order and some of them take complex arguments and then you provide a comma-separated list in form --options-list option1,option2=value,option3=.... Here is the list of recognised options:

  • --list-{decoders,encoders,demuxers,muxers} obviously lists the corresponding category and quits after listing all requested lists and options (see the next item);
  • --query-{decoder,encoder,demuxer,muxer}-options name prints the list of options supported by the corresponding codec or (de)muxer. Of course you can request options for several different things to be listed by adding this option several times;
  • --input inputfile and --output outputfile;
  • --input-format format and --output-format format force (de)muxer to use the provided format when autodetection fails;
  • --demuxer-options options takes a comma-separated list of options for demuxer (BTW you can also force input format with e.g. --demuxer-options format=avi);
  • --muxer-options options takes a comma-separated list of options for muxer (BTW you can also force output format with e.g. --muxer-options format=avi);
  • --no-audio and --no-video tell nihav-encoder to ignore all audio or video streams correspondingly;
  • --start time and --end time tell nihav-encoder to start decoding at given time and end at given time. The times are absolute so --start 1:10:00 --end 1:11:00 will process just a second of data;
  • --istreamX options and --ostreamX options set options for input and output streams with given numbers (starting with zero of course). More about them below.

nihav-encoder has two modes of operation: query mode, in which you specify which e.g. demuxers or codec options you want listed, and the program quits after listing them; and transcode mode, in which you specify input and output file and what you want to do with them. Maybe I’ll add a probe mode but I’ve never cared much about it before.

So what happens when you specify input and output? nihav-encoder will try to see which streams can be output (e.g. when transcoding from AVI to WAV there’s no point to even attempt to do anything with video stream), then it will try to copy input streams to the output unless anything else is specified. Of course you can specify that you want to discard some input stream with e.g. --istream0 drop. And for output streams you can also specify encoder and its parameters. For example my command line for testing Cinepak encoding looks like this:

./nihav-encoder –input laser05.avi –output cinepak.avi –no-audio –ostream0 encoder=cinepak,quant_mode=mediancut,nstrips=4

It takes input file laser05.avi, discards audio stream, encodes remaining video stream with Cinepak encoder that has options quant_mode and nstrips set explicitly, and writes the result to cinepak.avi.

As you can see, this tool has enough features to serve as a daily base transcoder but no complex features like taking input from several files, arbitrary mapping input streams from them to output streams and maybe applying some effects while at it. In my opinion that’s the task for some more complex application that builds a complex processing graph probably using a domain-specific language to specify inputs and outputs and what to do with them (and it should be a proper command file instead of command line that is impossible to type correctly even from the eighth try). Since I never had interest in GStreamer I’m definitely not going even to play with that. But a simple transcoder should serve my needs just fine.

Another reason for NihAV

July 4th, 2020

So instead of doing something productive like adding missing functionality bits and writing documentation I wasted my time on adding some QuickTime decoders. And while wasting time on adding SVQ1, SVQ3, QDMC and QDM2 decoders it became apparent why NihAV is a good thing to exist.

Implementing two of them was not a very big deal but implementing SVQ3 and QDM2 decoders took more than a week each because there are only two specifications available for them and both are equally hard to comprehend: the first one is the official binary specification, the second one is source code in libavcodec which is derived from the former.

The problem arises when somebody wants to understand how it works and/or reimplement the code and both SVQ3 and QDM2 decoder demonstrate two different aspects of that problem.

SVQ3 decoder is based on some draft of H.264 (or ex-MPEG/AVC if you’re from Piedmont) with certain extensions mostly related to motion compensation. Documentation for it was scarce and because of optimisations and integration with common H.264 decoder bits it’s hard to understand some of the things. One of those is intra prediction with two modes having SVQ3-specific hacks hidden in libavcodec/h264pred.c (those are 16×16 plane prediction mode giving transposed result and 4×4 diagonal down prediction being simplified and not relating on pixels not immediately top/left from the block) and another one is block coefficients decoding function. It took me quite a while to realize that it actually decodes three different kinds of blocks: single 4×4 block with zigzag scan, 4×4 block divided into two parts with interlaced scan, and 2×2 block. I’ve documented most of that in The Wiki (before that nobody has touched that page for almost ten years; sometimes I feel like I’m the only person contributing there).

QDM2 is horrible in different way. It is slightly improved translation of the original binary specification with hardly any idea how it works (there are still names like local_int_8 in the code). Don’t get me wrong, back in 2003-2005 when reverse engineering was done the only tools you had were debugger, disassembler (you’re lucky if it’s not the one provided by debugger) and no decompilers at all (IIRC rec appeared much later and was of limited usefulness, especially on multi-megabyte QT monolith—and that’s assuming you’re not doing that on Mac with even less tools available). I did some of such work back then as well so I understand how hard it is and how you’re happy that it works somehow and you can ship it and forget about it.

Another thing is that now it’s clear that QDMC and QDM2 are predecessors of DT$ LBR (aka Express) and use the same principles (QDMC simply coded noise and tones, QDM2 is almost like LBR but without some features like LPC or multichannel audio and with different chunk structure), but back in the day there was no documentation on LBR (or LBR itself for that matter).

But the main problem is that nobody has tried to understand the code since. It became a so-called category killer i.e. its existence prevents others from doing something similar. At least until some idiot tried to do another implementation in NihAV.

And here we have the reason for NihAV to exist: it advances the understanding of codecs for me (and I document results in The Wiki) resulting in different implementations that are (hopefully) easier to understand and sometimes even fix long-standing bugs. I hope this shall convince you that sometimes it’s good to have reimplementation of the decoder even if an existing implementation is good enough (as far as I remember the only time a decoder was rewritten in FFmpeg was when a reverse-engineered Indeo 3 decoder that crashed on damaged content almost every time was replaced with a reverse-engineered Indeo 3 decoder where a guy had the idea how it works).

But back to QDM2: while my decoder is not finished yet and I probably won’t bother with inter-frames in it (I’ve never seen any samples with those), it still decodes sweeps much better. That’s mostly because of the various bugs I’ve uncovered (also while discovering that Ghidra effectively does not allow to edit about a megabyte large decoder context). Since I have no incentive to produce a patch and people who created the decoder are long gone from the project, here are some spotted bugs: wrong coarse quantiser band selection (resulting in noise generated in wrong frequency range), reading bits past the chunk end (because is some cases checks are missing), ignoring group 4 tones because of the wrong conditions, some initial variables are set in the wrong way too. Nevertheless it mostly works and it was very useful for mapping the functions in the binary specification (fun fact: QDM2 decoder is located in QuickTime.qts while QDMC is located in QuickTimeInternetExtras.qtx).

NihAV: Conceptually Done!

June 7th, 2020

I’m happy to announce that NihAV has finally taken more or less complete form. Sure there are some concepts I wanted to play with (like raw streams handling) but I had no need for them so far so it can wait until much much later. But all major features required to build a transcoder are there as well as working transcoder itself.

As I wrote in the previous post I wanted to play with vector quantisation so first I implemented image palettisation but since that was not enough I implemented two encoders using vector quantisation: 15-bit MS Video 1 and Cinepak. I have no doubts that Tomas Härdin has written a much better encoder but why should that stop me from NIHing? Of course such encoder is not very useful by itself (and it was useless to begin with) so I needed a muxer to represent encoder output in some form. And then simply fiddling with parameters and recompiling became boring so I finally introduced generic options and in order to use those options without recompiling the binary every time I had to write a transcoder as well. But that means that now I can use NihAV to recode media into something else even if it’s just two crappy video encoders, MS ADPCM and PCM encoder with the large variety of supported output containers (AVI and WAV!). I called it conceptually done because all the essential concepts are there, not because there’s nothing left to do.

Now about video encoders. I’ll describe the NihAV design and how it works on a separate page, for now I just mention that while decoders are working on “frame in-picture/audio out” principle, encoders accept single picture or audio buffer for encoding and then may output a series of encoded packets. Why such asymmetry in design? Because decoders are expected to produce single output for single input (and frame reordering is handled externally) while most encoders are expected to have at least a single audio frame or couple of pictures of lookahead to make decisions about coding of a current input. For modern video codecs it may be a decision what frame type to assign or where to start a new scene, for audio codecs like AAC you may need to change current frame type if the following frame type has transients and previous frame type didn’t have them.

Anyway, back to the technical details about the encoders. MS Video 1 operates on 4×4 blocks that can be coded as skipped, filled with single colour, filled with two colours in a pattern, or split into 2×2 sub-blocks each filled with its own two colours in a pattern. Sounds perfect for median cut. Cinepak is much more complex. It splits frame into several strips and each strip is also split into 4×4 blocks that may be coded as skipped, single 2×2 YUV codeword (2×2 Y block and single U and V values) scaled twice or four YUV codewords from different codebook. Essentially for a good encoding you need to determine how to partition frame into strips optimally, split blocks into single and four-vector ones and find optimal codebooks for them separately. Since I wanted to write a working encoder mostly to check whether vector quantisation is working, I simply have fixed amount of strips and add every block as a candidate for both coding schemes without a following refining steps.

Here are some numbers if you really care about those. Input is laser05.avi (320×240 Indeo2 file with 196 video frames from the standard samples place). Encoding with MS Video 1 encoder takes about 4 seconds . Encoding Cinepak with median cut takes six seconds. Encoding Cinepak with ELBG and randomly-generated codebooks takes 36 seconds and result looks bad (but recognizable). Encoding Cinepak with ELBG that takes codebooks produced with median cut as the initial ones takes 68 seconds but the quality is higher than merely median cut and the output file is slightly smaller too.


Now with all of this done I should probably fix the knowingly bad decoders (RV6 and Bink2), add whatever missing decoders and features I see fit and start documenting it all. I have strong doubts about VDD this year but maybe I’ll be able to present my stuff at FOSDEM 2021.

NihAV: Now with Palette Support

May 31st, 2020

While NihAV had support for paletted formats before, now it has more use cases covered. Previously I could only decode paletted format and convert picture into some other format. Now it can handle palette in standard containers like AVI and MOV and even palette change in AVI (it’s done via NASideData which is essentially the same thing I NIHed more than nine years ago). In addition to that it can convert image into paletted format as well and below I’d like to give a brief review of methods employed.
Read the rest of this entry »

Monty Python & Quest for Holy Grail documented

May 19th, 2020

I’ve finally had time and documented my findings so if somebody is interested in it he can have at least something to start with.