Archive for the ‘Useless Rants’ Category

MOV — Matroska of its time

Saturday, October 24th, 2020

Disclaimer: all container formats suck, either by being too simple and tied to certain (types of) codecs, too ineffective (by wasting too many bytes of frame metadata and headers compared to other formats), or too flexible and complicated to implement in full. And there’s Ogg.

Let’s start our story with old times. Back in the day Electronic Arts made probably the only two good things it’s ever made. I’m talking of course about Deluxe Paint and IFF container format (and that happened 35 years ago; it’s a pity the company still exists). The chunked approach to storage was reused by certain other companies. Remember RIFF that was used for storing audio (RMI or WAV), video (AVI) or pictures (WebP) among other things? Remember RealMedia? Remember QuickTime MOV? That’s the thing we’re going to talk about.

As you can guess all these containers are just a series of chunks, some chunks being a container for another list of chunks. MOV and MP4 are the exception because they have atoms and boxes correspondingly. Anyway, this structure allows you to put virtually anything and while some formats used that in moderation (WAV essentially in top-level RIFF chunk with header and data chunks, there may be user metadata present and that’s about it) some others abused that possibility as much as possible (no points for guessing).

Let’s quickly review MOV structure: essentially you have moov atom with the description of the container data (which includes atoms for each individual track description of course) and mdat chunk with actual data if you’re lucky. Sounds simple? The devil hides in the deeply-nested atoms.

First, MOV has a lot of features for specific groups of things like:

  • streaming—tracks can be joined into groups to signal that only one of them should be played depending on language/quality/decoding capabilities (kinda like what RMVB was famous for);
  • mastering—all those atoms for matte, kropping! clipping and notorious edit lists;
  • DVD-like playback—the less said about track referencing feature the better.

But what I said that you have mdat chunk with actual data “if you’re lucky”? Because of the wonderful feature called data reference which tells you where the actual data is stored: it may be the same file, the same file but different resource fork (for classic MacOS), different file; you can even get some URL to the resource with the data. And of course different tracks may be stored in different data locations. Flexibility!

Then you get to actual frames (or “samples” as they’re called). For certain reasons frames of the same track can be clustered together in a block (or “chunk” as the specification calls it). And to make things better frames can have different duration stored in a different table in run-length form i.e. “first N frames have duration X, then next M frames have duration Y, then next K frames…”. So in order to extract proper frame with a correct timestamp you just need to find out in which block it resides and with which offset, sum durations for the previous frames, apply information from track metadata (don’t forget the edit list!) and you’re done. Except when you have audio in PCM format or compressed with some standard fixed compression scheme (A-law/μ-law, IMA ADPCM, MACE 3:1/6:1). Then you need to calculate duration from size. And that’s one of the things I really hate in containers: they should be codec-agnostic, in other words you should be able to seek to a certain time point (with a given precision of course), extract frame data and feed it to a decoder that does not know which container it came from either. And of course Matroska is the worst violator as it employs codec-specific ways to shave off some of the frame payload (so if you don’t know how to reconstruct the frame data the decoder won’t recognize it either).

Anyway, having such nice format with so many features not needed by anybody 99% of the time it was a perfect fit for MPEG-4 container format. So MOV was adopted, some of its terminology was changed (as mentioned before atoms->boxes) and lots of new features were added as well both in metadata and file structure (like those segments for streaming).

And if by this point the title is still not clear to you, QuickTime MOV format appeared long before Matroska but conceptually it had many of the features present in the latter as well. To put it crooked, if you replace atoms with EBML entities and add ways to reduce payload (e.g. by generating missing frame headers) you can rename MOV to MKV and nobody will notice a difference.

P.S. It’s just I’m writing a second iteration of NihAV-based video player which should not just play single file continuously and maybe deadlock in process of doing that because of some SDL audio issue but a proper player that can play multiple files, pause and seek. And while testing seeking in MOV files I’ve discovered some of those wonderful issues that inspired me to write this post.

Hacking to solve adventure game puzzles…

Tuesday, October 20th, 2020

I love adventure games (or simply quests as they’re known where I came from) but sometimes the best way to pass some moment there is to cheat.

One of such instances is The Legend of Kyrandia – Book One which is a nice game I play again sometime but there’s the infamous maze there that it not fun. And in the old times I could work around it by hex-editing a savegame to give me the ever-glowing fireberry, stones and some other quest items so I did not have to explore the full maze.

And I thought this was a thing of the past until I tried to play Galador also known as The Prince and the Coward as it’s been supported by ScummVM (or should it be called CabalVM now?) since couple of years ago and I haven’t played it yet. Mostly it’s fine but there are three moments which I could not pass at all: in two instances you must quickly pick up an object and in the third one you need to throw a stone at a certain place three times while the cursor jitters (and repeat that three times).

My reflexes never were that great to begin with but playing the game with a touchpad instead of mouse made it impossible: when a menu with actions appears after the game reacts on your click it’s already too late to select an action and click it. So one solution would be to connect mouse and try until you pass. But I got lazy during those years (back in the day I could do all those timed sequences in Space Quest II and IV while SQ3 required an utility to slow down computer for the escape from pirates sequence). So I did something different: hacked the source code to show what action happens when I actually select that “pick up” object action and added a handler so when I press 'p' key it’d do the action without bothering with menus (or crash if you don’t move cursor to the proper object). And similarly for the stone-throwing scene I removed jitter and mapped 'p' to pick up a new stone.

It’s not something I’m proud of but it should be a good demonstration how you can work around certain game limitations if you have an access to its engine source code—even if it’s not something trivial like maximum ammo/thousands of resources/infinite health.

NihAV: towards an audio player

Sunday, October 4th, 2020

So after weeks of doing nothing and looking at lossless audio codecs (in no particular order) I’m going back to developing NihAV and more particularly an audio player.

Lossless audio codecs were more advanced than I thought

Wednesday, September 23rd, 2020

As I’d mentioned in a previous post on lossless audio codecs, I wanted to look at some of them that are still not reverse engineered for documentation sake. And I did exactly that so now entries on LA, OptimFROG and RK Audio are not stubs any more but rather contain some information on how the codecs work.

And if you look at LA structure you see a lot of filters of various sizes and structure. Plus an adaptive weight used to select certain parameters. If you look at other lossless audio codecs with high compression and slow decoding like OptimFROG or Monkey's Audio you’ll see the same picture: several filters of different kinds and sizes layered over each other plus adaptive weights also used in residuals coding. Of course that reminded me of AV2 and more specifically about neural networks. And what do you know, Monkey's Audio actually calls its longer filters neural networks (hence the name NNFilter.h in the official SDK and you can spot it in the version history as well leaving no doubts that it’s exactly the neural networks it is named after).

Which leads me to the only possible conclusion: lossless audio codecs had been using neural networks for compression before it became mainstream and it gave them the best compression ratios in the class.

And if we apply all this knowledge to video coding then maybe in AV4 we’ll finally see some kind of convolution filters processing whole tiles and then the smaller blocks removing spatial redundance maybe with some compaction layers like many neural network designs have (or transforms for largest possible block size in H.265/AV1/AVS2) and expansion layers (well, what do you think motion interpolation actually does?) and using RNNs to code residues left from all the prediction.

Why Rust is not a mature programming language

Friday, September 18th, 2020

While I have nothing against Rust as such and keep writing my pet project in Rust, there are still some deficiencies I find preventing Rust from being a proper programming language. Here I’d like to present them and explain why I deem them as such even if not all of them have any impact on me.

A Modest Proposal for AV2

Wednesday, September 16th, 2020

Occasionally I look at the experiments in AV1 repository that should be the base for AV2 (unless Baidu rolls out VP11 from its private repository to replace it entirely). A year ago they added intra modes predictor based on neural network and in August they added a neural network based loop filter experiment as well. So, to make AV2 both simpler to implement in hardware and improve its compression efficiency I propose to switch all possible coding tools to use misapplied statistics. This way it can also attract more people from the corresponding field to compensate the lack of video compression experts. Considering the amount of pixels (let alone the ways to encode them) in a modern video it is BigData™ indeed.

Anyway, here is what I propose specifically:

  • expand intra mode prediction neural networks to predict block subdivision mode and coding mode for each part (including transform selection);
  • replace plane intra prediction with a trained neural network to reconstruct block from neighbours;
  • switch motion vector prediction to use neural network for prediction from neighbouring blocks in current and reference frames (the schemes in modern video codecs become too convoluted anyway);
  • come to think about it, neural network can simply output some weights for mixing several references in one block;
  • maybe even make a leap and ditch all the transforms for reconstructing block from coefficients directly by the model as well.

In result we’ll have a rather simple codec with most blocks being neural networks doing specific tasks, an arithmetic coder to provide input values, some logic to connect those blocks together, and some leftover DSP routines but I’m not sure we’ll need them at this stage. This will also greatly simplify the encoder as well as it will be more of a producing fitting model weights instead of trying some limited encoding combinations. And it may also be the first true next generation video codec after H.261 paving the road to radically different video codecs.

From hardware implementation point of view this will be a win too, you just need some ROM and RAM for models plus a generic tensor accelerator (which become common these days) and no need to design those custom DSP blocks.

P.S. Of course it may initially be slow and work in a range of thousands FPS (frames per season) but I’m not going to use AV1 let alone AV2 so why should I care?

A Quality Video Hosting

Friday, July 31st, 2020

A brief context: I watch videos from BaidUTube (name slightly altered just because) and my preferable way to do that is to grab video files with youtube-dl in 720p quality so I can watch them later at my leisure, in the way I like (i.e. without a browser), and re-watch later even if it’s taken down. It works fine but in recent weeks I’ve noticed that some of the downloaded videos are unplayable. Of course this can be fixed by downloading it again in slightly different form (separate video stream and separate audio streams muxed locally, youtube-dl can do that) but today I was annoyed enough to look at the problem.

In case it’s not obvious I’m talking about mp4 filed encoded and muxed at BaidUTube without any modifications by youtube-dl which merely downloaded it. So, what’s the problem?

Essentially MP4 file contains header with metadata telling at which offset and which size are frames for each codec and the actual data is stored in mdat atom. Not here. First you have lots of 12-byte sequenced 90 00 00 00 00 0X XX XX 00 02 XX XX, then moof atom (used in fragmented MP4) and then another mdat. And another. I’ve tried to avoid streaming stuff but even to me it looks like somebody put all fragments prepared for HLS streaming into single MP4 file making an unplayable mess.

Overall this happens only on few random videos and probably most of the browsers would not pick it (since VP9 or VP10 in WebMKV is the suggested format) so I don’t expect it to be fixed. My theory is that they decided to roll a new version of encoding software with a broken muxer library or muxing mode. And if you ask “What were they thinking? You should run at least some tests to see if it encodes properly.”, one wise guy has an answer to you: they weren’t thinking about that, they were thinking when how long until the lunch break and then when it’s time to go home. This is the state of enterprise software and I have no reasons to believe the situation will ever improve.

And there’s a fact maybe related to it. Random files starting from 2019 maybe also show the marker “x264 – core 155 r2901 7d0ff22” in the encoded frames while most of the files have no markers at all. While I don’t think they violate the license it still looks strange that a company known for not admitting that it uses open-source projects (“for their own protection” as it was explained once) lets such marker slip through.

Well, that was an even more useless rant than usual.

#chemicalexperiments — Bread

Saturday, May 9th, 2020

It seems that as a programmer and especially during these days you have an obligation to bake bread (the same way if you belonged to MPlayer community you had to watch anime). So here’s me doing it:

It’s made after a traditional recipe from Norrland: barley flour, wheat flour, milk, yeast, cinnamon, a bit of salt and molasses. IMO it goes fine with some gravad lax or proper cheese.

P.S. And if you think I should have made a sour-dough bread—I can always order some from Sweden instead.

Reviewing AV1 Features

Saturday, March 21st, 2020

Since we have this wonderful situation in Europe and I need to stay at home why not do something useless and comment on the features of AV1 especially since there’s a nice paper from (some of?) the original authors is here. In this post I’ll try to review it and give my comments on various details presented there.

First of all I’d like to note that the paper has 21 author for a review that can be done by a single person. I guess this was done to give academic credit to the people involved and I have no problems with that (also I should note that even if two of fourteen pages are short authors’ biographies they were probably the most interesting part of paper to me).

Om marsipangrisorna

Sunday, February 9th, 2020

Since I have nothing better to do (obviously) I want to talk about marzipan pig situation in Sweden.