NihAV — Notes on Audio

December 3rd, 2015

There’s one thing I like in Libav audio design and that’s planar audio. There’s one thing I don’t like about it and that’s handling multichannel audio. As a project with roots intermingled with MPlayer, it has followed M$ definition of multichannel audio and since many multichannel codecs have their own channel order lavc decoders have to define channel reordering tables. And it gets hairier when you have dynamic channel configuration (i.e. not one of the static channels configurations but rather a channel mask defining what channels are present or not). It gets especially fun with codecs that have extensions to the base 5.1 format that redefine existing channels and add new ones (like D*lby and DT$ codecs).

My proposal is that multichannel codecs should not bother with what the library considers The Only True Layout but rather export channel map along with the channel data and let the conversion library figure it out (shuffling pointers is not hard you know).

P.S. It’s hard to blame lu_zero for the current Libav situation but I still shall do it.

NihAV — NAScale

November 21st, 2015

First, some history. If you don’t like reading about it just skip to the ruler below.

So, NAScale is born after yet another attempt to design a good colourspace conversion and scaling library. Long time ago FFmpeg didn’t have any proper solution for that and used rather rudimentary imgconvert; later it was replaced with libswscale lifted from MPlayer. Unfortunately it was designed for rather specific player tasks (mostly converting YUV to RGB for displaying with X11 DGA driver) rather than generic utility library and some of its original design still shows to this day. Actually, libswscale should have a warm place in every true FFmpeg developer’s heart next to MPEGEncContext. Still, while being far from ideal it has SIMD optimisations and it works, so it’s still being used.

And yet some people unsatisfied with it decided to write a replacement from scratch. Originally AVScale (a Libav™ project) was supposed to be designed during coding sprint in Summer 2014. And of course nothing substantial came out of it.

Then I proposed my vision how it should work and even wrote a proof of concept (i.e. throwaway) code to demonstrate it back in Autumn 2014. I’d made an update to it in March 2015 to show how to work with high bitdepth formats but nobody has touched it since then (and hardly before that too). Thus I’m reusing that failing effort as NAScale for NihAV.


And now about the NAScale design.

The main guiding rule was: “see libswscale? Don’t do it like this.

First, I really hate long enums and dislike API/ABI breaks. So NAScale should have stable interface and no enumeration of known pixel formats. What should it have instead? Pixel format description that should be good enough to make NAScale convert even formats it had no idea about (like BARG5156 to YUV412).

So what should such description have? Colourspace information (RGB/YUV/XYZ/whatever, gamma, transfer function etc), size of whole packed pixel where applicable (i.e. not for planar formats) and individual component information. That component information includes information on how to find and/or extract such component (i.e. on which plane it is located, what shift and mask is needed to extract it from packed bitfield, how many bytes to skip to find the first and next component etc.) and subsampling information. The names chosen for those descriptors were Formaton and Chromaton (for rather obvious reasons).

Second, the way NAScale processes data. As I remember it libswscale converted input into YUV with fixed precision before scaling and then back into destination format unless it was common case format conversion without scaling (and then some bypass hacks were employed like plane repacking function and such).

NAScale prefers to build filter chain in stages. Each stage has either one function processing all components or a function processing only one component applied to each component — that allows you to execute e.g. scaling in parallel. It also allows to build proper conversion+scaling process without horrible hacks. Each stage might have its own temporary buffers that will be used for output (and fed to the next stage).

You need to convert XYZ to YUV? First you unpack XYZ into planar RGB (stage 1), then scale it (stage 2) and then convert it to YUV (stage 3). NAScale constructs chain by searching for kernels that can do the work (e.g. convert input into some intermediate format or pack planes into output format), provides that kernel with a Formaton and dimensions and that kernels sets stage processing functions. For example, the first stage of RGB to YUV is unpacking RGB data, thus NAScale searches for the kernel called rgbunp, which sets stage processing function and allocated RGB plane buffers, then the kernel called rgb2yuv will convert and pack RGB data from the planes into YUV.

And last, implementation. I’ve written some sample code that would be able to take RGB input (high bitdepth was supported too), scale it if needed and pack back into RGB or convert into YUV depending on what was requested. So test program converted raw r210 frame into r10k or input PPM into PPM or PGMYUV with scaling. I think it’s enough to demonstrate how the concept works. Sadly nobody has picked this work (and you know whom I blame for that; oh, and koda — he wanted to be mentioned too).

Sprint Report

November 20th, 2015

Last week I’ve been attending a fifth Libav coding sprint in Pelh?imov. Here’s a report from the host of the current sprint (she was amazing, many thanks). It was fun indeed (some fun provided by Lufthansa canceling my flight because of strike).

It’s the third sprint I attended and there’s a pattern in them. Even if they are called coding sprints there’s not so much coding going there, it’s more discussing various stuff and food than actual coding (though Anton spends a lot of time fixing some decoder usually). As I’m no longer a Libav developer, I simply hang around and try to provide enough trolling and proper drinks, sometimes even sharing a bit of knowledge if anybody wants to listen.

Another recurring theme is AVScale, a saner replacement for libswscale. It gets discussed during the sprints (since Summer 2014) but nothing substantial gets done. The only things we got so far are my proof of concept implementation (I’ll present it later in my post about NAScale, it’s the same thing) and something hacked by 1-2 Italians under influence of alcohol in a couple of hours (with great comments like //luzero doesn't remember this) that has Libav integration but no functionality (my code is complete opposite of that — standalone and doing some useful functions). Well, just wait for new posts about this, they’ll appear eventually.

In general I attend sprints just to see people and new places and have some fun. Among the places where sprints took place — Pelh?imov, Stockholm and Turin — I’d say Stockholm was the best so far as it is the easiest to reach and has the best food to my taste (plus thanks to SouthPole AB hosting more people than at any other sprint). I should attend a sprint again some time.

P.S. I still blame lu_zero for not writing anything about it yet.

Blogposts missing

November 17th, 2015

I like a good read with analysis of random stuff. The problem is that the planned series of posts on parallels between social organisations and software development are not written yet.

The first post would be dedicated to the idea supplanted by bureaucracy. This happens when somebody has a good idea like “let’s create X foundation to promote good thing Y” and with time it turns into monopoly that dictates the rules to everybody else in that area. A special attention would be paid to the ways such organisations maintain “democratic” façade while making all decisions in private — like with introducing many almost useless contributors into voters and convincing them to vote in the way organisers want. An addendum about how such non-profit organisations get their incredible amount of money would be good too. And if you’re still in need of an example — think about any sports organisation like IOC or FIFA.

The second post would be about cultists — not people adhering to some religion but rather people blindly following something and not even thinking that other people might have other views and needs. For example, iUsers or those who program in Oberon dialects (that seems to include Go despite it being not a direct descendant). The main problem with them is that those cultists usually force their fetish as the only proper solution and the main answer to “but I want to do that” is “it’s not needed and I’ve never did it myself”.

The third post would be about a common tactic switching from peculiar details when advertising stuff to blanket statements when defending from criticism (or vice versa). Among other things it’s very common for Oberon and systemd advocates in a fashion like “Feature X? We have it right in Y. Y sucks? Well, it’s just a single dialect/module, the whole system is wonderful, stop attacking it.” It would be painful but giving parallels between this and terrorists/peaceful Islam would be proper too.

And of course I blame lu_zero for not writing this.

Freudian Slip?

November 7th, 2015

Even if I’m no longer Libav or FFmpeg developer I still look at both projects’ development mailing lists (on FFmpeg’s one mostly in faint hope that Peter Ross submits anything awesome again).

So one day I see this message. The “former” leader calls a large share of commits they get “enemy merges” (and it cannot be humorous, it’s not mean enough to be Austrian humor). Well, nice attitude you have there. And you know what? This might be a semi-official position there.

I was present at FFmpeg-Libav discussion at VDD (since I was not noticed by Jean-Baptiste I remained while other outsiders were kicked out — here’s the recording of public part). There I even managed to ask a single question — what’s really changed since Michael’s resignation. FFmpeg people failed to answer that. Beside not making merges anymore Michael still announces and makes releases and does whatever changes he likes without reviews; he’s still a de facto leader in my opinion. I’m yet to see FFmpeg having defined rules stating something different (even Libav has something). Another fun fact from that meeting was some FFmpeg people openly stating they hate Libav merely because it exists.

Again, I don’t have to care about FFmpeg community but working at Libav in such conditions is no fun either (and it’s no fun for many other reasons many of which sadly have something to do with FFmpeg).

So I’d rather follow the advice from the great philosopher Eric Theodor Cartman — “screw you guys, I’m going home”. Developing NihAV at slow pace (i.e. when I feel like doing it) in a neutral one-developer atmosphere is much better.

NihAV — Processing Graph Notes

October 10th, 2015

I’m giving only a short overview for now, more to come later.

Basically, you have NAGraph that connects different workers (processing units with queues).
Lock-free NAQueue should accept only objects of certain type (side note: introduce libnaarch/naatomics.h).
Also those objects should have a common base (NAGraphObject) that extends NAClass by adding side data and signaling subtype (NAPacket, NARawData or NAFrame for now).
Multiple object types with a common base allow to have the same processing interface (after all, both encoders and decoders simply take some input data and output something else).
Elementary stream from demuxers can be either fed to parser filter that will produce proper packets or it will be directly accepted by decoder or muxer.
Later I should make those parser filters autoinserted too.
Uniform interface should allow easier integration of third-party components even in binary form (if there’s somebody willing to use not yet existing library like this).


Zonal partitioning of the graph (inputs, processing, outputs) maybe should include generic filters (e.g. de/noise) too, and maybe set a flag on NAFrame if it can be modified or should be kept intact.
Errors should be caught right at graph processing stage (do that with callbacks and have fun).
Roughly, this should be the architecture for NihAV till the end of its days since it should be future proof.

Of course, all things described here should be implemented too eventually (sigh).

Rants on Data Compression

October 9th, 2015

… When I was a young piglet I liked to read the rather famous paper by Bell, Cleary and Witten discussing general data compression and PPM. The best phrase there was that the progress in data compression is mostly defined by larger amounts of RAM available. I still believe those words to be true and below I present my thoughts on current state of data compression. Probably it’s trivial, well-known, obvious or wrong to anybody knowing a bit about data compression but well, it’s my blog and my discarded thoughts dumpster.

General data compression

Let’s start from the very end — entropy coding. There are two approaches: coding into integer amount of bits or coding as close to Shannon’s entropy limit as possible. For both we have been having optimal coding methods for about half a century (Huffman coding — 1952, arithmetic coding — mid-1970s). You cannot improve compression ratio here, so the following schemes are mostly tradeoffs sacrificing a bit of compression for speed gains (especially in form of (pseudo-)arithmetic coders operating only on binary). The only outstanding thing is so-called Asymmetric Numeral Systems but I suspect they are isomorphic to traditional entropy coders.

Now about let’s look at what feeds data to entropy coders. There are two main approaches (often combined): context modeling (probably the real foundation for current highest compression methods — PPM — was proposed in mid-1980s) and LZ77 (guess the year yourselves). Are there improvements in this area? Yes! The principle is simple — the better you can predict input the better you can code it. So if you combine different methods to better handle your data you can get some gains.

And yet the main compression gain here lies in proper preprocessing. From table or executable code preprocessing (table data usually differs only a bit between entries and for executables you can get some gains if you replace jump/call addresses with absolute values) to Burrows–Wheeler transform plus move-to-front plus RLE if needed etc.

Audio compression

You have four main targets here: general lossy compression, speech compression, lossless fast compression and lossless crazy compression.

General lossy compression follows the scheme established in 1990s or earlier: transform to frequency domain, grouping frequencies and coding frequency bands. Most of the methods are quite old and progress is defined mostly by how much RAM and CPU users are willing to sacrifice on it. For example, Celt (main part of Opus; the other part, SILK, is an ordinary speech codec) is not that much different in design from G.722.1 from late 1990s.

Speech coding follows canons from 1980s too — performing LPC, coding filter coefficients and other information enhancing signal reconstruction (pulse position, pitch tilt etc.).

Lossless fast compression (aka for normal usage) follows the suit too — you have LPC or some adaptive filters used for prediction plus residue coding (usually with Golomb/Rice codes from 1960s-1970s, BTW the original Golomb paper is AWESOME, they don’t write papers like that nowadays).

Lossless crazy compression (aka spend hours compressing it and as much for decompressing) employ the same suit but they have longer filters and usually even several filters of different size applied each after another plus better residue coding schemes.

Image compression

Here you have more variety of coding methods but most of them are very old (just look when Haar wavelet was proposed). Especially funny is that JPEG is still holding strong despite being more than twenty years old. I still remember so-called fourth generation image compression (separating image into region borders and textures to fill them and coding those), it didn’t lift off yet despite being introduced in late 1980s or so.

The only interesting development happens in lossless image compression but neither 2-D LZ77 (WebP) nor context modeling (FLIF) are particularly new ideas.

Video compression

Modern codecs are all so similar and they are usually ripoffs of H.26x (there are two exceptions — Thor, which is not a ripoff just because it was designed with openly acknowledging that some parts are taken from H.265, and Daala, which is more original and it’s discussed below).

So nowadays you have a very limited subset of ideas that were present in video codecs from 1990s — it’s boring macroblocks (now with quadtree partitioning instead of fixed size), motion compensation (now you have more reference frames to choose from though) and binary entropy coder (except for Thor, it went the way of RealVideo 3/4 with context-adaptive VLCs). Even the trend of adding special coding tools for special content doesn’t look that original (if you remember countless screen codecs and MPEG-4 Audio, *barf*).

The only exception for now is Daala that uses more original ideas but I fear it will end the same boring codec because it is not crazy enough to make a breakthrough. I believe it should do more crazy preprocessing at least and maybe better modeling, e.g. taking more than nearest neighbours into account (maybe even use something PPM-like for element coding and not just probabilities mixing). Look at JBIG for inspiration maybe 😉

Conclusions

Don’t expect miracles in data compression to happen anytime soon but couple of percent improvements for specialised fields at least in a decade is possible and even expected.

FAQ

October 3rd, 2015

Since I’ve been asked the same questions over and over again I’ve decided to make a short (for now) FAQ page.

  • How many years does it take to get a citizenship in Germany? 7-8 years.
  • How long have you been living in Germany? Since Spring 2010, do the math yourself.
  • So you’ll get your German citizenship in a couple of years, right? Maybe. It’s the same kind of maybe as in ‘Berlin-Brandenburg airport will be open in a couple of years.’ And it does not depend on me much.
  • Can you help me with ProRes issue … I can but I have no desire nor obligations. All Trocadero I got writing an encoder is gone long time ago and I don’t participate in projects that offer any ProRes support, inquire there.
  • Can you look at this codec … I can but no promises — I rarely have a desire to do anything these days.
  • Is NihAV real? More or less, it still lacks a lot of design and code but there are some bits implemented already. Design is described in this blog when it appears, code is developed as who-cares-source.
  • Why do you blame lu_zero? Oh, there are so many reasons for that and new ones keep appearing almost every day. Mostly it’s for the things he was supposed to do but still hasn’t done (and unlikely to do in foreseeable future): AVScale design and implementation, writing blog posts on certain topics (often I end writing them, which is yet another reason to blame him), not doing much about ASF or RealMedia demuxers and related delayed work, for personal stuff (like preventing me trying Torino trams and underground), for missing technical stuff in a wiki. Oh, and for being at least two different persons. There’s more that I can’t remember right now.
  • When will you visit Pelh?imov? Dunno, maybe when I have more than three free days.

NihAV: Data Delivery Channels

September 27th, 2015

Disclaimer: this should’ve been written by a certain Italian (and discussed during VDD too) but since he is even lazier than me (and stopped on writing only this), I ended writing this.

One of the main concerns in framework design is how data will be passed from sources to destinations. There are two ways that are mistakenly called synchronous and asynchronous, the main difference being lack of CLK signal how the producing function passes data further — on return when it’s done whatever it wanted to do or by passing control to some external procedure immediately when data is available.

In the worst case of synchronous data passing you have a fixed pipeline: one unit of input goes in, one unit of output is expected to be produced (and maybe one more on final flush), repeat for the next stage. In the worst case of asynchronous data passing you have one thread per stage waiting for another thread to signal that it has some data ready to process. Or in single-threaded mode you simply have nested callbacks, again one per stage, calling the next callback in a chain when it has something to deliver.

Anyway, how should this all apply to NihAV?

In the simplest case we have one chain:

[Demuxer] --> (optional stream splitter) --> Packets --> [Decoder] --> Frames --> [output]

In real world scenario all complexity starts at the stage marked [output]. There you usually feed frames to some filter chain, then to encoder and combine several streams to push all that stuff to some muxer(s).

As for data passing, you can have it in several modes: main processing loop, synchronous processing graph, asynchronous processing graph or an intricate maze of callbacks all alike.

With the main processing loop you have something like this:

while (has_input()) {
   pkt = get_input();
   send_output(pkt);
   while(has_output()) {
     write_output();
   }
}
while(has_output()) {
   write_output();
}

It’s ideal if you like micromanagement and want to save on memory but with many stages it might get rather hairy.

The synchronous processing graph (if you think it’s not called so look at the project name again) adds queues and operates on stages:

while(can_process(graph)) {
   for (i = 0; i < num_stages; i++) {      graph->stages[i]->process(graph->inqueues[i], graph->outqueues[i]);
   }
}

In this situation you have not a single [input] -> [output] connection but rather [input] -> (queue) -> [output], every stage is connected by some queues with another stage and can consume/produce several (or none) elements in any of the queues.

The asynchronous processing graph relies on the output stages pulling data from all previous stages or input stages pushing data to the following stages.

int filter_process(Filter *f, void *data)
{
  do_something(f->context, data);
  f->next->filter(f->next, data);
}

And callback is:

int demux(void (*consumer)(...)){
   while (!eof) {
     read_something();
     if (has_packet)
       consumer(ctx->consumer_id, packet);
   }
   return 0;
}

Callbacks are nice but they move a lot of burden into callback writer and they are hard to get right especially if you do something nontrivial.

So far I’m inclined to have the second approach. E.g. Bink demuxer has packets for several streams stored together in one block — so demuxer will create them all and put into its output queue. Hardware accelerated codec may send several frames for decoding at once — so it will read as many as possible or needed from the input queue and add all decoded frames into output queue when they’re done. Also all shitty tasks like proper synchronisation can be made into filters too.

Of course that will create several new problems especially with resource management but they should be solvable.

For NihAV implementation that means adding NAQueue and several new datatypes (packet, frame, whatever) plus a mechanism in NAClass to recognize them because you would not want a wrong thing going to the wrong pipeline (e.g. decoder expects NAPacket and gets NAFrame instead).

NihAV — I/O

September 3rd, 2015

And now let’s talk about probably the hairiest part of multimedia framework — input-output layer.

Current libavformat design looks too messy to me and thus my design will be different. I know that some Lucas prefer the old way but I’d rather split protocol handling into several layers.

First, there’s base I/O handler used solely for I/O operations. NAIOHandler contains functions for reading data, reading data asynchronously, writing data, seeking and ioctl() for some I/O-specific operations. There will be very few I/O handlers — just file, network (TCP/UDP) and null.

On top of that there’s protocol handler. Protocol handler employs I/O handler to perform I/O operations and provides only reading, writing, seeking and flushing functions. There may be buffered and unbuffered wrappers over I/O handler or something less trivial like HTTP handler.

On top of that there may be a layer or several of other protocol handlers if they need to relate to other protocols (i.e. Some-Streaming-Protocol-over-HTTP).

And on the very top there are I/O functions using protocol handlers for reading and writing data (bytes, X-bit integers, buffers). Those are used by (de)muxers.

That’s the plan, insert usual rant about lack of time, interest and such here.