Author Archive

Why Rust is not a mature programming language

Friday, September 18th, 2020

While I have nothing against Rust as such and keep writing my pet project in Rust, there are still some deficiencies I find preventing Rust from being a proper programming language. Here I’d like to present them and explain why I deem them as such even if not all of them have any impact on me.
(more…)

A Modest Proposal for AV2

Wednesday, September 16th, 2020

Occasionally I look at the experiments in AV1 repository that should be the base for AV2 (unless Baidu rolls out VP11 from its private repository to replace it entirely). A year ago they added intra modes predictor based on neural network and in August they added a neural network based loop filter experiment as well. So, to make AV2 both simpler to implement in hardware and improve its compression efficiency I propose to switch all possible coding tools to use misapplied statistics. This way it can also attract more people from the corresponding field to compensate the lack of video compression experts. Considering the amount of pixels (let alone the ways to encode them) in a modern video it is BigData™ indeed.

Anyway, here is what I propose specifically:

  • expand intra mode prediction neural networks to predict block subdivision mode and coding mode for each part (including transform selection);
  • replace plane intra prediction with a trained neural network to reconstruct block from neighbours;
  • switch motion vector prediction to use neural network for prediction from neighbouring blocks in current and reference frames (the schemes in modern video codecs become too convoluted anyway);
  • come to think about it, neural network can simply output some weights for mixing several references in one block;
  • maybe even make a leap and ditch all the transforms for reconstructing block from coefficients directly by the model as well.

In result we’ll have a rather simple codec with most blocks being neural networks doing specific tasks, an arithmetic coder to provide input values, some logic to connect those blocks together, and some leftover DSP routines but I’m not sure we’ll need them at this stage. This will also greatly simplify the encoder as well as it will be more of a producing fitting model weights instead of trying some limited encoding combinations. And it may also be the first true next generation video codec after H.261 paving the road to radically different video codecs.

From hardware implementation point of view this will be a win too, you just need some ROM and RAM for models plus a generic tensor accelerator (which become common these days) and no need to design those custom DSP blocks.

P.S. Of course it may initially be slow and work in a range of thousands FPS (frames per season) but I’m not going to use AV1 let alone AV2 so why should I care?

Revisiting lossless codecs…

Sunday, September 6th, 2020

I’ve decided to add a couple of lossless audio formats in a preparation for a long-term goal of having a NihAV-based player (the debug tool nihav-player that I currently have can’t really count for one especially considering how it does not play pure audio files and tends to deadlock in SDL audio thread).

So I’ve added nihav-llaudio crate with four most common formats for music I have, namely FLAC, Monkey’s Audio, TTA and WavPack. And I guess it’s time to revisit my opinion about various lossless audio formats now that I’ve (re)implemented support for some of them (I tried to summarise my views about them almost ten years ago). Let’s see what has changed since then:

  • I had a closer look at MPEG-4 ALS and it turned out to be rather interesting (and probably the only lossless audio codec with P-frames) but it also has somewhat insane options (like maximum prediction order of 1023 for LPC; or coding the whole file with just one I-frame and the rest being P-frames so no seeking is possible) and RLSLMS being broken (the reference decoder can’t decode the official reference samples) and it got no popularity at all;
  • TTA turned out to be very simple with a baffling rationale

    The sample count in a TTA1 frame is a multiple to 576 (sound buffer granule). Based on this, the “frame time” is defined as a constant 1.04489795918367346939. Thus, the sample count in a regular TTA1 frame determined as: regular TTA1 frame length = frame time * sample rate.

    I’m no mathematician so this does not form a coherent logical chain for me, I’d use something like “frame length in samples is sample rate rounded up to multiple of 576” instead of “sample rate multiplied by 256/245”. The main irritating point is that last frame contains less samples and you need to signal that it’s last frame (or merely check if you have enough bits left to decode a full frame after you decoded enough samples for the last frame). Oh, and TTA2 seems to be still in development.

  • And speaking about codecs in development, I don’t see new lossless audio codecs appearing after 2010. Either I got too old and don’t spot them or the interest has finally faded out. This might be because most people don’t buy music any more but rather rent it in some online store or use some streaming service. And those who still do probably use one of the old established codecs like FLAC.
  • And since I’ve mentioned it, my opinion on it has not changed and only got a bit more refined not that I have a decoder for it as well. Previously I thought FLAC is a simple format with a bad bitstream format that makes seeking hard. Now I know that FLAC is a simple format (fixed predictor or LPC up to order 32 and fixed Rice codes; the only thing that improves compression is splitting residues into partitions where optimal k for coding them can be selected) with horribly designed bitstream format.

    Normally lossless audio formats either store offsets for each frame or have an easily recognizable header, but FLAC is different. It’s obvious that the author was inspired by MPEG audio header design but those actually had frame sizes coded. Here in order to find where the frame ends you need either to decode it or calculate CRC for the data you read (and in the likely case of false positives also check that the data is followed by a valid header). One could argues that there’s often a seek table in FLAC file but for e.g. in luckynight.flac those entries are for multiples of ten seconds positions, making seeking to a more precise position a task of skipping frames (which is fun—see above).

  • WavPack is still the best designed format in my opinion though it would be nicer to have some initial header with various metadata instead of having it stored in the first block. Other than that still no objections.
  • And it turns out there’s lossless AAC compression that employs wavelet transform before LPC (it’s Chinese AAC though so who cares).

I remember reading somewhere (on Hydrogenaudio most likely) a brief story about development of several popular lossless audio codecs (even told by the author of one but I might be wrong). Essentially it’s not a NIH syndrome but very close: somebody develops a format, another guy finds a minor flaw the original developer refuses to address (my memory is hazy but I think there were such things mentioned as no plugin for some player or not supporting some tags) and develops another format. The amount of formats that came to existence because somebody wanted to create a format and could not keep it to himself is pretty large too.

But those days seem to be over and maybe I’ll reverse engineer some of those old codecs for documenting reasons as there’s very little risk that somebody would pick them up and make widespread now. Alternatively I can rant on newer formats sucking as well. Though why wait, let’s do it now:

  • AAC sucks because of the countless extensions and attempts to bundle various coding approaches under the same name (fun fact – “xHE-AAC” is actually pronounced as “MPEG-D you-suck”);
  • AV1 sucks because of the organisational structure and their decisions during (and after) the design stage;
  • AV2 is not here yet but it sucks for the same reason;
  • BlueTooth audio codecs suck in various ways (except SBC, it’s okay for the purpose), especially because of marketing them as high-definition and robust while in reality they rarely are;
  • Chinese codecs suck for being rip-offs of better-known codecs. It’s especially gross that one of them got standardised as IEEE 1857.2 AAC;
  • H.264 sucks because of countless extensions;
  • H.265 inherited some from H.264 and added the licensing situation on top of that;
  • MPEG-5 EVC sucks because it’s a Frankenstein monster constructed from bits from H.263-H.265;
  • Opus sucks for being designed for streaming case and used elsewhere;
  • Vector-based codecs suck because current tools are still not good enough to autovectorise complex shapes and recognize gradients.

Now back to doing nothing.

A Quick Review of Actimagine Video Codecs

Sunday, August 23rd, 2020

Now that (as I believe) I’ve fixed remaining reconstruction bugs in VX decoder, why not do a quick comparison of various video codecs developed by Actimagine and see how they differ (if at all).

There seem to be the following codecs:

  • Actimagine (VX)
  • Mobiclip (Mods)
  • Mobiclip (Moflex for 3DS also there’s a version of it for PC known as Mobiclip HD)

And while they all are based on H.264 with finer block partitioning, there are some differences as well.

Proper structure. The original VX codec used quantiser derived from FPS and all frames were encoded in the same way, while the latter codecs have I-frames and quantisers are transmitted for each frame (as delta for non-keyframes).

References and motion compensation. VX had three previous frames as reference ones, later codecs increased that number to five. VX had fullpel motion compensation, later codecs use halfpel MC.

Data coding. VX relied on Elias Gamma’ codes for all codes except coefficient coding, later codecs use codebooks for most coded values. Also while VX coded residue in 4×4 blocks in H.264 way (starting from the end and with tail of ones coded explicitly), newer codecs use separable transforms and the usual (zero run, coefficient level) coding. Additionally only nine coding modes out of twenty four have survived after VX (intra prediction, MC with motion vectors coded and splits).

Overall, while all those codecs are related, there are large differences between VX and later Mobiclip variants and the only differenced between Mobiclip variants are colourspace (Mods uses YCoCg model, HD uses the proper YUV model), quantiser being clipped to 12-52 range, and block mode codebooks being different.

As I mentioned before, somebody has reverse-engineered decoders for Mobiclip (and a quick check on codebooks used tells me that Mobiclip HD and 3DS versions are the same) so if somebody needs them it should not be that hard to write a decoder.

A look at some old game

Wednesday, August 19th, 2020

Sometimes I like to play old strategy games from my youth: Civilization II, Settlers II, WarCraft II and Reunion. You probably have never heard about it since it’s not from some famous studio but from some Hungarians and published by rather obscure publisher too.

The idea is about the same as in Settlers II but IN SPACE! In some near future an experimental spaceship somehow gets into an unknown star system, most of technologies are lost and now you have to colonise planets, fight with aliens and find your way back home. This game combines some planet-building with space exploration and ground battles (there are also battles in space but they’re fought without your involvement). And since it has a story you have events like getting a chance to get some technology or break the alliance between your enemies. So it’s an interesting mix overall and it explains why I still return to it time from time. Sadly the game was programmed in traditional Hungarian manner (remember, Hungarians are responsible for such popular software as Windows 95 or MPlayer) and its intro (a separate program) sometimes crashes and sometimes it even makes DosBox segfault. The main game is also prone to corruptions and crashes (yet I still play it sometimes).

Anyway, today I’ve stumbled upon a page of one guy who reverse-engineered image format used in this game just by fiddling with it. It turned out to be compressed with RLE similar to the one used in PCX (0x00-0xBF – normal pixel, 0xC0-0xFF – run of next byte value 0-63 times). Since the game had some animations as well I decided to look at them.

So intro uses mostly still images split into 640×100 strips (so they can fit into one segment if you remember those) that are scrolled and faded in and out. And there’s a special animation format for some in-game animations similar to the picture format (as expected). Animation file is a series of frames (without palette) that are coded with similar RLE but there are some quirks not encountered in still images. First of all, frames are coded as differences and codes in range 0x80-0xBF are used to signal how many pixels to skip. Second, it turns out that codes 0x80 and 0xC0 are actually escape codes and are followed by 16-bit value of actual skip or run length (and in case of 0xC0 code a pixel value after that). Again, since the format is so simple it could be found just by looking inside the animation files and messing with a decoder.

As for the other games mentioned in the beginning, Civ2 has GIF files mostly hidden inside resource .dlls plus Indeo 4 video (with transparency even!) and Settlers II and WarCraft II have videos in Smacker format.

Having said that, my pointless diversion to looking at game formats is over, back to doing nothing!

NihAV relicensed code registry

Monday, August 17th, 2020

Since I’ve got the second request for a decoder relicensing I’ve decided to keep an open list of the project that requested relicensing. This way it may satisfy somebody’s curiosity about which parts of NihAV piqued some interest and also keep a proof for a project that I granted them a new license for the code.

The page is right here.

Actimagine VX: another imperfect decoder

Thursday, August 13th, 2020

So I’ve released my decoder for Actimagine VX and it’s far from perfect.

First problem is audio. While the codec itself it not that tricky (it turned out to be some LPC codec that takes 5-10 16-bit words per frame to code pulses and filter for 128-sample frame), but its data is stored right after video frame data so in order to decode audio first you need to decode video frame and feed the remains of input buffer to the audio decoder. Since I can’t do that in a sane way I could not test the decoder either and it’s there just for the informative purposes only.

The second problem is obviously video. I’ve managed to decode bitstream fine but reconstructed images are not bit-exact and in case of plane prediction this leads to ugly artefacts (essentially the target value wraps around and you have gradients from white to black or vice versa instead of almost flat dark or white regions). I’ve introduced a clipping which seems to help but this is not right and maybe I’ll fix it one day. Maybe even before Bink2.

And finally there are some problems with the demuxer. In theory VX files may have multiple tracks but my demuxer might not handle them at all and if it does then it’ll simply ignore anything but the first video stream.

So VX support is far from perfect but it serves its goal of proving that the format works as expected. And if it’s useful to anybody then it’s even better.

Some words about Bink2

Sunday, August 9th, 2020

As you may know (but definitely not care), NihAV has some limited support for Bink2 video. The problem in fixing it is that known samples are usually 720p video or mode which makes it hard to debug decoding past few initial frames (okay, older versions have smaller known videos so they’re likely to be fixed sooner). And of course the encoder is available only to the RAD customers to which I don’t belong. So in result I’ve decided to look at Actimagine VX codec once again.

I’ve looked at it four years ago but I could just study it but not write a decoder because of the binary. Essentially this codec happens on BigN DS consoles so you have to deal with raw ARM7 or ARM9 binary that (as it turns out) sets up its own segments (and the problems arise when you see absolute addresses to the areas not present there). So you load binary at addresses e.g. 0x2000000-0x20e1030 but in reality it contains also segments 0x1ffe800-0x1fff000 and 0x27e0000-0x27e4000. Thankfully Ghidra can not just load raw ARM binary but also add aliases to data as new segments. This allowed me to work on the decoder again and now I have more or less complete understanding of it and semi-working decoder for it as well, here’s an example:

Sample decoded frame.

Essentially it’s a simplified variant of H.264 with the following features: frames are split into 16×16 macroblocks that can be further recursively divided horizontally or vertically down to 2×2 blocks. Block can be coded in 24 different modes that boil down to full-pel motion compensation from one of three previous frames (without a motion vector, with motion vector, or with motion vector and an offset value that should be added to each pixel), intra prediction on whole block or intra prediction in 4×4 blocks. Also whether you have residue coded is also part of the mode (e.g. mode 11 is intra prediction without residue and mode 22 is intra prediction with residue). Residue is coded in 8×8 blocks comprising six 4×4 coefficient blocks, each block is coded in a way reminding of H.264: there are numbers for total number of non-zero coefficients, number of last non-zero coefficients being plus-minus one and number of zeroes dispersed between non-zero coefficients. Those being coded with variable-length codes that I could not access earlier was the blocker but not any more.

And there’s one curious feature of this codec that made it worth REing: instead of using plane prediction like H.264, this codec fills block in a recursive way. It interpolates bottom-right corner as an average of top-right and borrom-left neighbour pixels (e.g. [15,-1] and [-1,15] for 16×16 block; it also adds a delta to it in certain decoding modes), then it calculates halfway-bottom right and halfway-right bottom pixels (e.g. [15,7] and [7,15] for 16×16 block), then a centre pixel, and then repeats the process for each quarter (or half for some rectangular blocks). This is less computationally intensive than ordinary plane prediction and it seems to give nice results too.

I mentioned before that my decoder is far from perfect (and you can see it for yourself on that picture) but I know how to debug and improve it. I’m not trying to say that piracy is okay, but being able to find some .nds image with a game that has VX videos and using it with DeSmuME with GDB stub would help to debug the decoder but piracy is bad and so it’s not a proper way to do things.

As for audio counterpart, I should mention this: curiously enough there’s an opensource decoder for later MobiClip formats that seems to contain working Sx decoder for an audio used in VX files (it’s a pity the person who did it could not finish VX as well—why should I do the work myself instead of letting other people do my work for me?!). Unfortunately it’s mostly translated assembly so while it should work it’s mostly sub_XXX() doing various accesses to various positions of large byte array of decoder state. I’ll probably add it as well for completeness sake and document the formats properly after I fix the decoder (which should happen during this year too).

A Quality Video Hosting

Friday, July 31st, 2020

A brief context: I watch videos from BaidUTube (name slightly altered just because) and my preferable way to do that is to grab video files with youtube-dl in 720p quality so I can watch them later at my leisure, in the way I like (i.e. without a browser), and re-watch later even if it’s taken down. It works fine but in recent weeks I’ve noticed that some of the downloaded videos are unplayable. Of course this can be fixed by downloading it again in slightly different form (separate video stream and separate audio streams muxed locally, youtube-dl can do that) but today I was annoyed enough to look at the problem.

In case it’s not obvious I’m talking about mp4 filed encoded and muxed at BaidUTube without any modifications by youtube-dl which merely downloaded it. So, what’s the problem?

Essentially MP4 file contains header with metadata telling at which offset and which size are frames for each codec and the actual data is stored in mdat atom. Not here. First you have lots of 12-byte sequenced 90 00 00 00 00 0X XX XX 00 02 XX XX, then moof atom (used in fragmented MP4) and then another mdat. And another. I’ve tried to avoid streaming stuff but even to me it looks like somebody put all fragments prepared for HLS streaming into single MP4 file making an unplayable mess.

Overall this happens only on few random videos and probably most of the browsers would not pick it (since VP9 or VP10 in WebMKV is the suggested format) so I don’t expect it to be fixed. My theory is that they decided to roll a new version of encoding software with a broken muxer library or muxing mode. And if you ask “What were they thinking? You should run at least some tests to see if it encodes properly.”, one wise guy has an answer to you: they weren’t thinking about that, they were thinking when how long until the lunch break and then when it’s time to go home. This is the state of enterprise software and I have no reasons to believe the situation will ever improve.

And there’s a fact maybe related to it. Random files starting from 2019 maybe also show the marker “x264 – core 155 r2901 7d0ff22” in the encoded frames while most of the files have no markers at all. While I don’t think they violate the license it still looks strange that a company known for not admitting that it uses open-source projects (“for their own protection” as it was explained once) lets such marker slip through.

Well, that was an even more useless rant than usual.

A Quick Look on LCEVC

Wednesday, July 29th, 2020

As you might’ve heard, MPEG is essentially no more. And the last noticeable thing related to video coding it did the last was MPEG-5 (and synthesising actors and issuing commands to them with MPEG-G and MPEG-4 standards unholy unity). In result we have an abuse of letter ‘e’—in HEVC, EVC and LCEVC it means three different things. I’ll talk about VVC probably when AV2 specification is available, EVC is slightly enhanced AVC and LCEVC is interesting. And since I was able to locate DIS for it why not give a review of it?

LCEVC is based on Perseus and as such it’s still an interesting concept. For starters, it is not an independent codec but an enhancement layer to add scalability to other video codecs, somewhat like video SBR but hopefully it will remain more independent.

A good deal of specification is copied from H.264 probably because nobody in the industry can take a codec without NALs, SEIs and HRD seriously (I know their importance but here it still feels excessive). Regardless, here is what I understood from the description while suffering from thermal throttling.

The underlying idea is quite simple and hasn’t changed since Perseus: you take a base frame, upscale it, add the high-frequency differences and display the result. The differences are first grouped into 4×4 or 8×8 blocks, transformed with Walsh-Hadamard matrix or modified Walsh-Hadamard matrix (with some coefficients being zeroed out), quantised and coded. Coding is done in two phases: first there is a compaction state where coefficients are transformed into byte stream with flags for zero runs and large values (or RLE just for zeroes and ones) and then it can be packed further with Huffman codes. I guess that there are essentially two modes: a faster one where coefficient data is stored as bytes (with or without RLE) and slightly better compressed mode with those values are further packed with Huffman codes generated per tile.

Overall this looks like a neat scheme and I hope it will have at least some success. No, not to prove Chiariglione’s new approach for introducing new codecs an industry can use without complex patent licensing, but rather because it might be the only recent major video codec built on principles different from H.26x line and its success may introduce more radically different codecs and my codec world will get less boring.