NihAV: implementation start

May 14th, 2015

Before people reading this blog (all 0 of them) start asking about it — yes, I’ve started implementing NihAV, it will take a lot of time so don’t expect it to be finished or even usable this decade at least (too little free time, even less interest and too much work needed to be done to have it at least somewhat usable for anything).

Here’s the intended structure:

  • libnaarch — arch-specific stuff here, like little/big endian handling, specific speedup tricks etc. Do not confuse with libnaosstuff — I’m not interested in non-POSIX systems.
  • libnacodec — codecs will belong here.
  • libnacompr — decompression and compression routines belong here.
  • libnacrypto — cryptographic stuff (hashes, cyphers, ROT-13) belongs here.
  • libnadata — data structures implementations.
  • libnaformat — muxers and demuxers.
  • libnamath — mathematics-related stuff (fixedpoint calculations, fractional math etc).
  • libnaregistry — codecs registry. Codec information is stored here — both overall codec infomation (name to description mapping) and codec name search by tag. That means that e.g. FOURCC to codec name mapping from AVI or MOV are a part of this library instead of remaining demuxer-specific.
  • libnautil — utility stuff not belonging to any other library.

Remark to myself: maybe it’s worth splitting out libnadsp so it won’t clutter libnacodec.

Probably I’ll start with a simple stuff — implement dictionary support (for options), AVI demuxer and some simple decoder.

LZ77-based compressors — a story similar to lossless codecs

May 12th, 2015

What do LZ77 compressors and lossless codecs have in common? They are both perform lossless compression and there are too many of them because everyone tries to invent their own. And like lossless audio codecs — quite often in their own container too.

In case you don’t know (shame on you!) LZ77 scheme parses input into pieces like <literal> <copy> <literal> ... Literal means “copy these input bytes verbatim”, copy is “we had that substring some time ago, copy N bytes from the history at offset M”.

The idea by itself is rather simple and thus it’s easy to implement some LZ77 parsing with the following coding, slap your name on it and present as some new algorithm. There are three branches of implementation goals there — fast (but somewhat decent) compression, high (but not so fast) compression and experimental research that may lead to implementations in the first two branches.

Fast compression schemes usually pack everything into bytes so no time is wasted on bit reading. Usually format is like this — if top three bits of the next byte are something, then read literal copy length, otherwise determine offset size, read it and copy string from the dictionary. Quite often there are small tweaks to make compression faster (like using hashes) or slightly better (using escape values to code long values and coding small offsets/lengths into opcode etc.). There are so many implementations like that and they still keep appearing. LZO, LZF, FastLZ, snappy, chameleon… And lots of old games used such compression for their resources (including video) too.

High compression schemes use much better compressing of the data produced by LZ77 parsing and spending more cycles on finding the best parsing of the input. It all started essentially with LZHUF when someone decided to employ Huffman codes instead of writing values in a fixed amount of bits. If you’ve never heard about LHA/LZH you need your Amiga box confiscated. This approach reached its peak with Deflate — by modern standards it’s not the best format to compress (i.e. not fast enough, does not compress high enough etc etc.) but it’s the standard available everywhere and in any form. Deflate uses custom per-block Huffman codes with their definition stored in compressed form as well so there’s hardly anything to improve there radically. And thus (patent expiration helped greatly too) another form of LZ77-based compression started to bloom — LZA (using modelling and arithmetic coding on LZ77 parsing results). Current favourite LZMA (and main RAR compression scheme) uses this approach too albeit in very sophisticated form — preprocessors to increase compression ratio on some kinds of known data, Markov models, you name it.

And here’s my rant — leave Deflate alone! It’s like JPEG of data compression — old and seemingly not very effective but it’s ubiquitous, well-supported and still has some improvement potential (like demonstrated by e.g. 7-zip and zopfli). I hate it to have as many compression schemes to support as video codecs. Deflate and LZMA are enough for now and I doubt there will be something significantly more effective appearing soon. Work on something lossy — like H.265 encoder optimisations — instead.

NihAV: Logo proposal

May 11th, 2015

Originally it should’ve been Bender Bending Rodriguez on the Moon (implying that he’ll build his own NihAV with …) but since I lack drawing skills (at school I’ve managed to draw something more or less realistic only once), here’s the alternative logo drawn by professional coder in Inkscape in couple of minutes.

nihav

Somehow I believe that building a castle on a swamp is a good metaphor for this enterprise as well (and much easier to draw too).

NihAV — A New Approach to Multimedia Pt. 8

May 11th, 2015

Demuxers

First of all, as I’ve mentioned in the beginning, codecs should have string IDs. Of course if codec tag is present it can be passed too. Or stream handler in AVI, it’s just optional. This way AVI demuxer can report codec {NIH_TYPE_VIDEO, "go2meeting", "G2M6"} or {NIH_TYPE_VIDEO, "unknown", 'COL0"} and an external program can guess what the codec is that and handle it specially.

Second, demuxers should return two types of data — packets and streams. E.g. MPEG-TS (the best container ever by popular vote) does not care about frame boundaries, so it should not try to be smart and return a stream that can be fed to a parser and that parser will produce proper packets.

Third, parsers. There are two kinds of them — ones that split stream into frames and ones that parse frame data to set some properties to the packet. They should be two separate entities and invoked differently, one after another if needed.

Something similar for muxers — everybody knows that one can mux absolutely any codec into AVI. Or put H.264 into MPEG-PS (hi Luca!). Muxers just should allow callers to do that when possible instead of failing because codec is unrecognised.

P.S. If I’m ever to implement this all it will take a lot of time and Trocadero.

NihAV — A New Approach to Multimedia Pt. 7

May 9th, 2015

Modularity — codec level

FFmpeg, obviously, was made to transcode MPEG video (initial commit had support for JPEG, MPEG-1/2 video, some H-263 based formats like M$MPEG-4, MPEG-4 and RV10, MPEG audio layers I-III and AC3). It was expanded to handle other formats but the misdirection in initial design has grown into MpegEncContext that makes the ugliest part of libavcodec to date.

It is easy to start with an abstraction that all codecs consist of I/P/B-frames split into 16×16 macroblocks that have 8×8 DCT blocks. You just need to have some codec-specific decoding (or coding) for picture header or block codes, that’s all. And since they all are very similar why not unite them into single decoding function. I encourage everybody to look at mpv_decode_mb_internal in libavcodec/mpegvideo.c to see how this can go wrong.

Let’s just look at simple model of the codecs that should fit the model I can still name two from the top of my head that don’t fit that well. H.263+ (or was it H.263++?) — it has packed PB-frames that have blocks for both P- and B-frame. IIRC it sends an empty frame just after that so reordering can take place. VC-1 has BI-frames that should be coded as I-frames but treated as B-frames; also it has block subdivision into 8×4, 4×8 or 4×4 subblocks. And there’s On2 VP3. This gets even better with the new generation of codecs — more reference frames and more complex relations between them — B-pyramid in H.264 and H.265 frame management. And there’s On2 VPx. Indeo 4/5 had complex frame management too — droppable references, B-frames, null frames etc.

So, let’s look at video codec decoding stages to see why it’s unwise to try to use the single context to bind them all.

  1. Sequence header — whatever defines codec parameters like frame dimensions, various features used in the bitstream etc. May be as simple as frame dimensions provided by the container; it may be codec extradata from the container as well; it may be as complex as H.265 having multiple SPSes referencing multiple PPSes referencing multiple VPSes.
  2. Picture header — whatever defines frame parameters. Usually it’s frame type, sometimes frame dimensions, sometimes quantiser, whatever vendor decides to put into it.
  3. Slice header — if codec has slices; if codec has separate plane coding or scalable coding it can be considered slices too. Or fields (though they can have slices too). Usually it has information related to slice coding parameters — quantiser, bitstream features enabled etc.
  4. Macroblock header — macroblock type, coded block pattern other information is stored here.
  5. Spatial prediction information — not present for old codecs but is an essential part of intra blocks compression in the newer codecs.
  6. Motion vectors — usually a part of macroblock header but separated here to denote they can be coded in different ways too (e.g. newer codecs have to include reference frame index, for older codecs it’s obvious from the frame type).
  7. Block coefficients.
  8. Trailer information — whatever vendor decides to put at the end of the frame like CRC (or codec version for Indeo 4 I-frames).

And yet there are many features that complicate implementing this scheme in the same framework — frame management (altref frames in VPx, two frames fused together as in Indeo 4 or H.263), sprites, scalable coding features, interlacing, varying block sizes (especially in H.265 and ripoffs). Do you still think it’s a good idea to fit it all into the same mpegvideo?

That is why I believe the best approach in this case is to have small reusable blocks that can be combined to make a decoder. For starters, decoder should have more freedom to where it can decode to — that should be handy in decoding those fused frames, also quite often one decoder is used inside another to decode a part of the frame, especially JPEG and WMV9/VC-1. Second, decoder should be able to pick whatever components it needs — e.g. RealVideo 3/4 used H.264 spatial prediction and chroma motion compensation but the standard I/P/B frame management and its own bitstream decoding. WMV2 was mostly M$MPEG-4 with new motion compensation and special I-frame decoder. AVS (Chinese one) has 8×8 integer DCT coding but also spatial coding from H.264 and its frame management is almost standard I/P/B but P frame references two previous pictures and they’ve added S-frame that is B-frame with only forward references.

Hence I proposed long time ago to split out at least frame management in order to reduce decoder dependencies from mpv (It sank into the swamp. but again, no-one cared). Then block management functions (the utility functions that update and provide pointers to the current block on output frame planes). That sank into the swamp. I’d propose anything else in that direction but it will burn down, fell over, then sink into the swap no-one cares about my proposals.

Still, here’s how I see it.

#include “block_stuff.h”
#include “frame_mgmt.h”
#include “h264/intra_pred.h”

Since this is not intended for the user it can have multiple smaller headers with only related stuff. Also large codec data should’ve been moved into separate subdirectories since ages. It’s more than a thousand files in libavcodec already.

decode_frame()
{
   frame_type = get_bits(gb, 2);
   cur_frm = ipb_frame_get_cur(ctx->ipb, frame_type);
   init_block_pos(ctx->blk, cur_frm);
   for (blocks) {
     update_block_pos(ctx->blk);
     decode_mb(ctx, gb, ctx->blk, mb);
     if (mb->type == INTRA)
       h264_pred_spatial(ctx->blk, mb);
     else
       idct_put_mb420(ctx->blk, mb);
  }
  ipb_frame_update_refs(ctx->ipb, frame_type);
}

We have a lot of smaller blocks here encapsulating needed information — frame management, macroblock position and decoded macroblock information. Many chunks of code are the same between codecs, you often don’t need a full context for a small function that can be reused everywhere. Like spatial prediction — you just need to know if you can have neighbouring pixels, what prediction method to apply and what coefficients to add afterwards — be it RealVideo 3, H.264, or VP5. Similarly after motion vectors are reconstructed you do the same thing in most codecs — copy a rectangular area to the current frame using motion compensation functions. So write it once and reuse everywhere — and you need just a couple of small structures with essential information (where to copy to and what functions to use), not MpegEncContext.

Sigh, I really doubt I’ll see it implemented ever.

NihAV — A New Approach to Multimedia Pt. 6

May 9th, 2015

Modularity — library level

Luca has saved me some work on describing how it should work (here it is, pity nobody reads it anyway).

Quick summary:

  • do not dump everything into the same library (or two — do people remember libavcore?),
  • make library provide similar functionality (e.g. decoders, decompressors, hash or crypt functions) through the same interface,
  • provide implementations in future-compatible way (i.e. I might ask for LZ4 decompressor even while compression library currently supports only LZO and deflate and nothing bad happens — and you don’t have to check for libavutil/lz4.h presence and such).

Some Travel Notes

May 4th, 2015

So I’ve finally visited the disunited state of Austria-Hungary and can share some feelings for those who like to read my travel notes (all zero people).

First, I’d like to talk about rail magazines that are present in InterCity or express trains in different countries. The ones I know are issued monthly and have national peculiarities (for starters, they are written in the national language). The one from Deutsche Bahn (German railways) covers a lot of different topics — culture, travel, some short story or an excerpt from one, DB plans, kids corner etc. ÖBB (Austrian railways) one is mostly dedicated to advertising Austria for tourists (and maybe a bit or two about neighbouring resorts to visit). TGV magazine (obviously French) is something in-between (not fully advertisements but not much serious stuff either) plus advertisements for night clubs. Yet it’s the only one of three that features a scheme for IC and TGV routes. And the best one is of course Kupe from SJ (Swedish railways). It has articles on various topics and it also includes things close to my heart: a full map or Swedish railways (I need to travel more there!), SJ fleet description (I like to ride all those kinds of trains plus Inlandsbanan’s Y1, SL X60 and X10 and I definitely need to go to Lennakatten again!) and the most important thing — a page where locomotive driver (it was Peter and now Jenny) answering railway-related questions (e.g. what’s the difference between trains like X2 and X40, what’s the longest route they have to travel, why train goes slowly sometimes etc.). Anyway, back to actual travel.

For Hungarian part I’ve visited Budapest. If you ignore the river, buildings in the centre and people it looks and feels like Kharkiv. The same neglected buildings (often in the same architectural style), the same neglected streets. The transport is verily the same — Tatra trams, Ikarus buses, even underground rolling stock is the same and even painted the same! Heck, even most people I talked with there were from Kharkiv. And their suburban rail lines (like H5, H6 or H8/H9) are shaky as Ukrainian roads.

Also as I’m, to speak politically correct, a fat cripple I really appreciated how lines are connected there — you often have to cross a road or use an underground pass without any elevators. Tram routes are so well designed that they simply end somewhere in the middle of the street with no loop to turn around. And the airport reminds of Kharkiv too — it’s connected only by a bus (on an Ukrainian-grade road), they check your documents thoroughly. The only difference that in Kharkiv airport I had never had to take off my shoes on security check. At least after visiting it I don’t have a desire to go back to Ukraine (not that I had it before…).

Austrian part is represented by Innsbruck. It’s a stereotypical town in Austrian Alps. Transport system is rather strange — trams have numbers like 1, 3, 6, STB and buses have numbers like D, H, LK, O or TS. For skiers there are Alps with funiculars all around the town, for idiots who believe that fake should cost more than real there are tours to Swarovski, for me there was a museum of local rail lines (that means both local trams and railways in different part of Tirol including Italy). Museum ticket also gives a right to get a ride on museum tram around the town. While the museum by itself is small (only two rooms with mostly photos and plans) it also has a depot full of museum trams from probably 1920s to 1970s (that feeling when you see DÜWAG GT6 only in a museum while they are still common here). Two tram lines (6 and STB) go into the mountains, at least STB being one-track there with passing loop on some stations (and trams take left track there like on proper railways). One of those stations surprised me by having an emergency broom tied to the pole there.

It’s also worth noting that there are two rivers flowing through Innsbruck — Inn, obviously, and Sill. I don’t care what it means for them, I know what it means for me — salt water herring in Swedish and that’s what I was thinking about.

Overall, Innsbruck looked nice and a bit like Bavaria, I honestly expected it to be worse (mostly because of Austrians I know). And understanding German is much easier than understanding Hungarian unless you’ve been born one. It’s worth visiting again sometime.

NihAV — A New Approach to Multimedia Pt. 5

April 25th, 2015

Structures and functions

The problem with structures in libav* is that they are quite often contain a lot of useless information and easily break ABI when someone needs to add yet another crucial field like grandmother’s birthday. My idea to solve some of those problems was adding side data — something that is passed along the main data (e.g. packet) and decoders don’t have to care about it. It would be even better to make it more generic, so you don’t have to care about enums for that either. For instance, most of the codecs don’t have to care about broadcast grade metadata (but some containers and codecs like ATSC A/52 provide a lot of it) or stupid DVD shit (pan&scan anyone?). So if demuxer or decoder wants to provide it — fine, just don’t clutter existing structures with it, add it to metadata and if consumer (encoder/muxer/application) cares it can check whether such non-standard information is present and use it. That’s the general approach I want to have quite similar to FCC certification rule: producers (any code that outputs data) can have any kind of additional data but consumers (code that takes that data for input) do not have to care about it and can ignore it freely. It’s easy to add options marked as essential (like PNG chunks — they are self-marked that you can distinguish chunks that can be ignored from those that should be handled in any case) to ensure that this option won’t be ignored and input handler can error out on not understanding it.

As for proper function calls — Luca has described it quite well here (pity noone reads his blog).

NihAV — A New Approach to Multimedia Pt. 4

April 24th, 2015

On colourspaces and such

I think current situation with pixel formats is brain-damaged as well. You have a list of pixel formats longer than two arms and yet it’s insufficient for many use cases (e.g. Canopus HQX needs 12-bit YUVA422 but there’s no such format supported and thus 16-bit had to be used instead or ProRes with 8- or 16-bit alpha channel and 10-bit YUV). In this case it’s much better to have pixel format descriptor with all essential properties covered and all exotic stuff (e.g. Bayer to RGB conversion coefficients) in options. Why introduce a dozen IDs for packed raw formats when you can describe them in uniform way (i.e. read it as big/little-endian, use these shifts and masks to extract components etc.)? Even if you need to convert YUV with different subsampling for chroma planes (can happen in JPEG) into some special packed 10-bit RGB format you can simply pass those pixel format descriptors to the library and it will handle it despite encountering such formats for the first time.

P.S. I actually wrote some test code to demonstrate that idea but no-one got interested in it.

NihAV — A New Approach to Multimedia Pt. 3

April 24th, 2015

More on codecs handling

First of all, people are often AVI-centric and decide that you can always use 4-character code to identify a codec. Well, technically it’s true because there’s significantly less than 4 billion codecs in existence (I hope). The problem is uneven mapping — MPEG containers use integers for codec IDs, AVI uses 4-character code for video and 2-byte integer for audio, MOV uses 4-character code for both audio and video, Matroska uses long strings like V_MPEG4/MS/V3 etc etc. So in any case you have a problem of mapping codecs found by demuxers to internal decoders. In libavcodec it’s handled by having an insane enumeration of codec IDs and I’ve mentioned in part 2 that I’m not a fan of such approach.

So what I suggest instead? A global registry of codec names in string form. And splitting out media information database explicitly. After all, why not provide some codec information even if we cannot support it? Less effort when you add a new decoder and you can query some information about codec even if it’s not supported. Demuxer maps internal ID to codec name (if it can), codec database can be queried about that codec at any time to see what information is known about it and a decoder can be requested for that codec as well.

Here’s an example:

  1. Bink demuxer encounters KB2g;
  2. It reports binkvideo2 decoder;
  3. (optional) From database one can retrieve its name — “Bink Video 2″;
  4. A decoder for binkvideo2 is requested for it but that request is failed because noone has bothered to write such decoder;
  5. Or a decoder implemented by a special plugin that calls TotallyRADVideo.dll is called.

Just replace enum with string and you get better flexibility and only VideoLAN won’t like it.