Author Archive

MOV — Matroska of its time

Saturday, October 24th, 2020

Disclaimer: all container formats suck, either by being too simple and tied to certain (types of) codecs, too ineffective (by wasting too many bytes of frame metadata and headers compared to other formats), or too flexible and complicated to implement in full. And there’s Ogg.

Let’s start our story with old times. Back in the day Electronic Arts made probably the only two good things it’s ever made. I’m talking of course about Deluxe Paint and IFF container format (and that happened 35 years ago; it’s a pity the company still exists). The chunked approach to storage was reused by certain other companies. Remember RIFF that was used for storing audio (RMI or WAV), video (AVI) or pictures (WebP) among other things? Remember RealMedia? Remember QuickTime MOV? That’s the thing we’re going to talk about.

As you can guess all these containers are just a series of chunks, some chunks being a container for another list of chunks. MOV and MP4 are the exception because they have atoms and boxes correspondingly. Anyway, this structure allows you to put virtually anything and while some formats used that in moderation (WAV essentially in top-level RIFF chunk with header and data chunks, there may be user metadata present and that’s about it) some others abused that possibility as much as possible (no points for guessing).

Let’s quickly review MOV structure: essentially you have moov atom with the description of the container data (which includes atoms for each individual track description of course) and mdat chunk with actual data if you’re lucky. Sounds simple? The devil hides in the deeply-nested atoms.

First, MOV has a lot of features for specific groups of things like:

  • streaming—tracks can be joined into groups to signal that only one of them should be played depending on language/quality/decoding capabilities (kinda like what RMVB was famous for);
  • mastering—all those atoms for matte, kropping! clipping and notorious edit lists;
  • DVD-like playback—the less said about track referencing feature the better.

But what I said that you have mdat chunk with actual data “if you’re lucky”? Because of the wonderful feature called data reference which tells you where the actual data is stored: it may be the same file, the same file but different resource fork (for classic MacOS), different file; you can even get some URL to the resource with the data. And of course different tracks may be stored in different data locations. Flexibility!

Then you get to actual frames (or “samples” as they’re called). For certain reasons frames of the same track can be clustered together in a block (or “chunk” as the specification calls it). And to make things better frames can have different duration stored in a different table in run-length form i.e. “first N frames have duration X, then next M frames have duration Y, then next K frames…”. So in order to extract proper frame with a correct timestamp you just need to find out in which block it resides and with which offset, sum durations for the previous frames, apply information from track metadata (don’t forget the edit list!) and you’re done. Except when you have audio in PCM format or compressed with some standard fixed compression scheme (A-law/μ-law, IMA ADPCM, MACE 3:1/6:1). Then you need to calculate duration from size. And that’s one of the things I really hate in containers: they should be codec-agnostic, in other words you should be able to seek to a certain time point (with a given precision of course), extract frame data and feed it to a decoder that does not know which container it came from either. And of course Matroska is the worst violator as it employs codec-specific ways to shave off some of the frame payload (so if you don’t know how to reconstruct the frame data the decoder won’t recognize it either).

Anyway, having such nice format with so many features not needed by anybody 99% of the time it was a perfect fit for MPEG-4 container format. So MOV was adopted, some of its terminology was changed (as mentioned before atoms->boxes) and lots of new features were added as well both in metadata and file structure (like those segments for streaming).

And if by this point the title is still not clear to you, QuickTime MOV format appeared long before Matroska but conceptually it had many of the features present in the latter as well. To put it crooked, if you replace atoms with EBML entities and add ways to reduce payload (e.g. by generating missing frame headers) you can rename MOV to MKV and nobody will notice a difference.

P.S. It’s just I’m writing a second iteration of NihAV-based video player which should not just play single file continuously and maybe deadlock in process of doing that because of some SDL audio issue but a proper player that can play multiple files, pause and seek. And while testing seeking in MOV files I’ve discovered some of those wonderful issues that inspired me to write this post.

Hacking to solve adventure game puzzles…

Tuesday, October 20th, 2020

I love adventure games (or simply quests as they’re known where I came from) but sometimes the best way to pass some moment there is to cheat.

One of such instances is The Legend of Kyrandia – Book One which is a nice game I play again sometime but there’s the infamous maze there that it not fun. And in the old times I could work around it by hex-editing a savegame to give me the ever-glowing fireberry, stones and some other quest items so I did not have to explore the full maze.

And I thought this was a thing of the past until I tried to play Galador also known as The Prince and the Coward as it’s been supported by ScummVM (or should it be called CabalVM now?) since couple of years ago and I haven’t played it yet. Mostly it’s fine but there are three moments which I could not pass at all: in two instances you must quickly pick up an object and in the third one you need to throw a stone at a certain place three times while the cursor jitters (and repeat that three times).

My reflexes never were that great to begin with but playing the game with a touchpad instead of mouse made it impossible: when a menu with actions appears after the game reacts on your click it’s already too late to select an action and click it. So one solution would be to connect mouse and try until you pass. But I got lazy during those years (back in the day I could do all those timed sequences in Space Quest II and IV while SQ3 required an utility to slow down computer for the escape from pirates sequence). So I did something different: hacked the source code to show what action happens when I actually select that “pick up” object action and added a handler so when I press 'p' key it’d do the action without bothering with menus (or crash if you don’t move cursor to the proper object). And similarly for the stone-throwing scene I removed jitter and mapped 'p' to pick up a new stone.

It’s not something I’m proud of but it should be a good demonstration how you can work around certain game limitations if you have an access to its engine source code—even if it’s not something trivial like maximum ammo/thousands of resources/infinite health.

NihAV: audio player done

Wednesday, October 7th, 2020

As I wrote in my previous post, I had functioning audio player nearing completion. And now I’ve finally added all features I wanted to add and can call it done.

While previously I mostly ranted on the bloat introduced by the components authors, here I’d like to describe the design and the reasoning behind it.
(more…)

NihAV: towards an audio player

Sunday, October 4th, 2020

So after weeks of doing nothing and looking at lossless audio codecs (in no particular order) I’m going back to developing NihAV and more particularly an audio player.
(more…)

A brief look at Sonarc

Tuesday, September 29th, 2020

Recently The Mike asked if I can look at this format. In case you didn’t know, The Multimedia Mike is one of the under-appreciated founders of opensource multimedia, involved both in reverse engineering codecs and maintaining infrastructure for about two decades (for example this particular blog has been here for fifteen years thanks to him and his maintenance efforts). So of course I had to look at it even if out of sheer respect.

Sonarc is probably the first known lossless audio codec as the copyright mentions year 1992 as the first date (Shorten and VocPack appeared in 1993). Spoiler: it turned out to be closer to Shorten in design.

This was harder to RE because it was larger (decompressor was three to four times larger than VocPack) and the original was written in Borland Pascal with all the peculiarities it brings. By those peculiarities I mean mostly Pascal strings. Well, the code for manipulating them is annoying to parse but not too bad, the main problem is that they are put in the same segment with code right before the function that uses it and that confuses Ghidra which for some reason selects the segment with standard library routines for them instead (and uninitialised variables are not assigned to any segment at all). The write() implementation is also no fun.

Side note: back in the day Turbo Pascal was probably the best programming language for DOS and back in school at least two my schoolmates were doing crazy things with it (and Delphi later) which I couldn’t (and I was writing in C as I still do today). Yet somehow the popularity of the language vanished and I haven’t heard anything about them becoming famous programmers (neither did I but they had better chances). And the only modern project written in Pascal that I’m aware of is Hedgewars.

Anyway, let’s talk about the format itself. Sonarc can compress raw PCM, .voc and .wav into either its own format or into .wav and it supports both 8- and 16-bit audio.

From what I saw it uses the same approach: optionally applying the LPC filter and coding the residues. Residues can be coded with two different approaches: old one for 8-bit audio and new one for 8- and 16-bit audio. Old 8-bit audio coding uses one of eight different static Huffman codebooks or can code residues as raw bytes (and I can’t remember that many other codecs doing the same except for MLP and DT$-HD Lossless probably because why compress audio in that case). New 8-/16-bit coding still uses fixed codebooks but in a different fashion: now they simply code the number of bits for the residue. It does not look like the data is split into segments but I may be wrong (I/O is still not the easiest thing to get around there).

Overall it’s not a bad codec for its time and e.g. FLAC has not come that far away from it in concepts (except that it uses Rice codes and has independent frames plus partitioning inside individual frame for better compression). I hope though there are no older lossless audio codecs out there to be discovered (CCITT G.711 infinite-law with its fixed 1:1 compression does not count).

A look at VP1 and VP2

Saturday, September 26th, 2020

One of the issues with On2 VPx family is that they started it from VP3 while having four different TrueMotion codecs before that (it’s like the company was called Valve and not Duck at that time). But I wanted to look at some lossless audio codecs and there’s VocPack or VP for short which has versions 1 and 2. Bingo!

This is a very old lossless audio codec that appeared in 1993 along with Shorten and, as it turns out, originated the second approach to lossless audio compression. While Shorten was a simple format oriented on fast decoding and thus used fixed prediction (either LPC filter or even fixed prediction scheme) and Rice codes for residues (the same scheme used in FLAC and TAK), VocPack employed adaptive filter and arithmetic coding (the approach carried by LA, Monkey’s Audio, OptimFROG and such). And it was made for DOS and 8-bit audio! Well, version 2 added support for 16-bit but it seems to compress only high 8 bits of the sample anyway while transmitting low bits verbatim.

And it turned out to be my first real experience of using Ghidra with DOS executables. The main troubles were identifying library functions and dealing with pointers. Since it was compiled with Borland C++ 3.0 (who doesn’t remember it?) it was rather easy to decompile but library functions were not recognized (DOS executables don’t get much love these days…) but searching the disassembly for int 21h with Ralf Brown’s interrupts list at hand it was easy to identify calls for file operations (open/read/write/seek) and from those infer the stdio library functions using them and finally the code using all those getc()s. And of course segmented model makes decompiling fun, especially when decompiler can’t understand segment/offset variables being used separately. In result sometimes you recognize offset but you have to look at the data segment yourself to see what it refers to; even worse, for some local variables Ghidra seemed to assume wrong segment which resulted in variables in disassembly and decompiled output pointing to non-existent locations. Despite all of that it was rather easy to understand what unpacker for VP1 does. VP2 has only packer and no unpacker publicly available (feel free to trace the author and buy a copy from him that supports unpacking) plus it depends on those wrongly understood global variables more which prevented me from understanding how encoding a residue works there. In theory you should be able to set data segment manually but I don’t see a point on spending more than a couple of hours on REing the format.

It was a nice distraction though.

Lossless audio codecs were more advanced than I thought

Wednesday, September 23rd, 2020

As I’d mentioned in a previous post on lossless audio codecs, I wanted to look at some of them that are still not reverse engineered for documentation sake. And I did exactly that so now entries on LA, OptimFROG and RK Audio are not stubs any more but rather contain some information on how the codecs work.

And if you look at LA structure you see a lot of filters of various sizes and structure. Plus an adaptive weight used to select certain parameters. If you look at other lossless audio codecs with high compression and slow decoding like OptimFROG or Monkey's Audio you’ll see the same picture: several filters of different kinds and sizes layered over each other plus adaptive weights also used in residuals coding. Of course that reminded me of AV2 and more specifically about neural networks. And what do you know, Monkey's Audio actually calls its longer filters neural networks (hence the name NNFilter.h in the official SDK and you can spot it in the version history as well leaving no doubts that it’s exactly the neural networks it is named after).

Which leads me to the only possible conclusion: lossless audio codecs had been using neural networks for compression before it became mainstream and it gave them the best compression ratios in the class.

And if we apply all this knowledge to video coding then maybe in AV4 we’ll finally see some kind of convolution filters processing whole tiles and then the smaller blocks removing spatial redundance maybe with some compaction layers like many neural network designs have (or transforms for largest possible block size in H.265/AV1/AVS2) and expansion layers (well, what do you think motion interpolation actually does?) and using RNNs to code residues left from all the prediction.

Why Rust is not a mature programming language

Friday, September 18th, 2020

While I have nothing against Rust as such and keep writing my pet project in Rust, there are still some deficiencies I find preventing Rust from being a proper programming language. Here I’d like to present them and explain why I deem them as such even if not all of them have any impact on me.
(more…)

A Modest Proposal for AV2

Wednesday, September 16th, 2020

Occasionally I look at the experiments in AV1 repository that should be the base for AV2 (unless Baidu rolls out VP11 from its private repository to replace it entirely). A year ago they added intra modes predictor based on neural network and in August they added a neural network based loop filter experiment as well. So, to make AV2 both simpler to implement in hardware and improve its compression efficiency I propose to switch all possible coding tools to use misapplied statistics. This way it can also attract more people from the corresponding field to compensate the lack of video compression experts. Considering the amount of pixels (let alone the ways to encode them) in a modern video it is BigData™ indeed.

Anyway, here is what I propose specifically:

  • expand intra mode prediction neural networks to predict block subdivision mode and coding mode for each part (including transform selection);
  • replace plane intra prediction with a trained neural network to reconstruct block from neighbours;
  • switch motion vector prediction to use neural network for prediction from neighbouring blocks in current and reference frames (the schemes in modern video codecs become too convoluted anyway);
  • come to think about it, neural network can simply output some weights for mixing several references in one block;
  • maybe even make a leap and ditch all the transforms for reconstructing block from coefficients directly by the model as well.

In result we’ll have a rather simple codec with most blocks being neural networks doing specific tasks, an arithmetic coder to provide input values, some logic to connect those blocks together, and some leftover DSP routines but I’m not sure we’ll need them at this stage. This will also greatly simplify the encoder as well as it will be more of a producing fitting model weights instead of trying some limited encoding combinations. And it may also be the first true next generation video codec after H.261 paving the road to radically different video codecs.

From hardware implementation point of view this will be a win too, you just need some ROM and RAM for models plus a generic tensor accelerator (which become common these days) and no need to design those custom DSP blocks.

P.S. Of course it may initially be slow and work in a range of thousands FPS (frames per season) but I’m not going to use AV1 let alone AV2 so why should I care?

Revisiting lossless codecs…

Sunday, September 6th, 2020

I’ve decided to add a couple of lossless audio formats in a preparation for a long-term goal of having a NihAV-based player (the debug tool nihav-player that I currently have can’t really count for one especially considering how it does not play pure audio files and tends to deadlock in SDL audio thread).

So I’ve added nihav-llaudio crate with four most common formats for music I have, namely FLAC, Monkey’s Audio, TTA and WavPack. And I guess it’s time to revisit my opinion about various lossless audio formats now that I’ve (re)implemented support for some of them (I tried to summarise my views about them almost ten years ago). Let’s see what has changed since then:

  • I had a closer look at MPEG-4 ALS and it turned out to be rather interesting (and probably the only lossless audio codec with P-frames) but it also has somewhat insane options (like maximum prediction order of 1023 for LPC; or coding the whole file with just one I-frame and the rest being P-frames so no seeking is possible) and RLSLMS being broken (the reference decoder can’t decode the official reference samples) and it got no popularity at all;
  • TTA turned out to be very simple with a baffling rationale

    The sample count in a TTA1 frame is a multiple to 576 (sound buffer granule). Based on this, the “frame time” is defined as a constant 1.04489795918367346939. Thus, the sample count in a regular TTA1 frame determined as: regular TTA1 frame length = frame time * sample rate.

    I’m no mathematician so this does not form a coherent logical chain for me, I’d use something like “frame length in samples is sample rate rounded up to multiple of 576” instead of “sample rate multiplied by 256/245”. The main irritating point is that last frame contains less samples and you need to signal that it’s last frame (or merely check if you have enough bits left to decode a full frame after you decoded enough samples for the last frame). Oh, and TTA2 seems to be still in development.

  • And speaking about codecs in development, I don’t see new lossless audio codecs appearing after 2010. Either I got too old and don’t spot them or the interest has finally faded out. This might be because most people don’t buy music any more but rather rent it in some online store or use some streaming service. And those who still do probably use one of the old established codecs like FLAC.
  • And since I’ve mentioned it, my opinion on it has not changed and only got a bit more refined not that I have a decoder for it as well. Previously I thought FLAC is a simple format with a bad bitstream format that makes seeking hard. Now I know that FLAC is a simple format (fixed predictor or LPC up to order 32 and fixed Rice codes; the only thing that improves compression is splitting residues into partitions where optimal k for coding them can be selected) with horribly designed bitstream format.

    Normally lossless audio formats either store offsets for each frame or have an easily recognizable header, but FLAC is different. It’s obvious that the author was inspired by MPEG audio header design but those actually had frame sizes coded. Here in order to find where the frame ends you need either to decode it or calculate CRC for the data you read (and in the likely case of false positives also check that the data is followed by a valid header). One could argues that there’s often a seek table in FLAC file but for e.g. in luckynight.flac those entries are for multiples of ten seconds positions, making seeking to a more precise position a task of skipping frames (which is fun—see above).

  • WavPack is still the best designed format in my opinion though it would be nicer to have some initial header with various metadata instead of having it stored in the first block. Other than that still no objections.
  • And it turns out there’s lossless AAC compression that employs wavelet transform before LPC (it’s Chinese AAC though so who cares).

I remember reading somewhere (on Hydrogenaudio most likely) a brief story about development of several popular lossless audio codecs (even told by the author of one but I might be wrong). Essentially it’s not a NIH syndrome but very close: somebody develops a format, another guy finds a minor flaw the original developer refuses to address (my memory is hazy but I think there were such things mentioned as no plugin for some player or not supporting some tags) and develops another format. The amount of formats that came to existence because somebody wanted to create a format and could not keep it to himself is pretty large too.

But those days seem to be over and maybe I’ll reverse engineer some of those old codecs for documenting reasons as there’s very little risk that somebody would pick them up and make widespread now. Alternatively I can rant on newer formats sucking as well. Though why wait, let’s do it now:

  • AAC sucks because of the countless extensions and attempts to bundle various coding approaches under the same name (fun fact – “xHE-AAC” is actually pronounced as “MPEG-D you-suck”);
  • AV1 sucks because of the organisational structure and their decisions during (and after) the design stage;
  • AV2 is not here yet but it sucks for the same reason;
  • BlueTooth audio codecs suck in various ways (except SBC, it’s okay for the purpose), especially because of marketing them as high-definition and robust while in reality they rarely are;
  • Chinese codecs suck for being rip-offs of better-known codecs. It’s especially gross that one of them got standardised as IEEE 1857.2 AAC;
  • H.264 sucks because of countless extensions;
  • H.265 inherited some from H.264 and added the licensing situation on top of that;
  • MPEG-5 EVC sucks because it’s a Frankenstein monster constructed from bits from H.263-H.265;
  • Opus sucks for being designed for streaming case and used elsewhere;
  • Vector-based codecs suck because current tools are still not good enough to autovectorise complex shapes and recognize gradients.

Now back to doing nothing.