Archive for the ‘Useless Rants’ Category

NihAV — A New Approach to Multimedia Pt. 3

Friday, April 24th, 2015

More on codecs handling

First of all, people are often AVI-centric and decide that you can always use 4-character code to identify a codec. Well, technically it’s true because there’s significantly less than 4 billion codecs in existence (I hope). The problem is uneven mapping — MPEG containers use integers for codec IDs, AVI uses 4-character code for video and 2-byte integer for audio, MOV uses 4-character code for both audio and video, Matroska uses long strings like V_MPEG4/MS/V3 etc etc. So in any case you have a problem of mapping codecs found by demuxers to internal decoders. In libavcodec it’s handled by having an insane enumeration of codec IDs and I’ve mentioned in part 2 that I’m not a fan of such approach.

So what I suggest instead? A global registry of codec names in string form. And splitting out media information database explicitly. After all, why not provide some codec information even if we cannot support it? Less effort when you add a new decoder and you can query some information about codec even if it’s not supported. Demuxer maps internal ID to codec name (if it can), codec database can be queried about that codec at any time to see what information is known about it and a decoder can be requested for that codec as well.

Here’s an example:

  1. Bink demuxer encounters KB2g;
  2. It reports binkvideo2 decoder;
  3. (optional) From database one can retrieve its name — “Bink Video 2”;
  4. A decoder for binkvideo2 is requested for it but that request is failed because noone has bothered to write such decoder;
  5. Or a decoder implemented by a special plugin that calls TotallyRADVideo.dll is called.

Just replace enum with string and you get better flexibility and only VideoLAN won’t like it.

NihAV — A New Approach to Multimedia Pt. 2

Thursday, April 23rd, 2015

Common design principles

I’d been participating in FFmpeg and then Libav development for about ten years and I’ve touched many parts of its codebase except for libavfilter and libavresample, so I know what I dislike in its design.

Enumerations. Maybe people like them but I think it’s much better to have list of string identifiers instead. You still specify codec or format or protocol by name in command line, why should code have that bulky and incompatible enumeration? It would be more convenient for library user to use string identifier — you try to find format handler for a given name and if you don’t have it or its support is disabled then no luck (of course VideoLAN prefers enums but that’s their problem).

Large pointless structures. AVCodecContext and AVFrame are good examples of that (especially the old versions). They lug around many members that are applicable only to very limited subset of video codecs and nothing else. A much better approach IMO would be to have substructures with minimal information needed for all audio/video/subtitle data (both in frame and context) and the rest is put into dictionary (maybe as subobjects, like motion information or rate control information structures).

API variations. Current approach is to shoehorn everything into specific structure. My opinion is that public functions should take as flexible (or simple) input as possible and do the same with output. For example, why have avcodec_decode_video2(), avcodec_decode_audio4() and avcodec_decode_subtitle2() if single function is enough? You feed input bytes and you obtain output bytes — no matter what you actually do (encode, decode, filter or pass through). Anything optional should be passed as optional — in a dictionary for example.

Various stuff. Parsing, probing, timestamp handling. All these things need to be reinvented because it’s hard to imagine them being much worse than they are or were a couple years ago.

I’d also like to have some small building blocks for codecs. In libavcodec many video decoders were forced to be built around MpegEncContext and noone likes that structure (except one guy who even named a video player after it but then again he doesn’t want to disclose his real name…). I prefer to have more independent decoders reusing the same methods somehow (e.g. this codec needs this frame management, this motion compensation). How to implement it, boost::codec::video::block_decoder templating and macros or function pointers for codec-specific functions (like block decoding) is yet to be conceived.

To be continued eventually…

NihAV — A New Approach to Multimedia Pt. 1

Thursday, April 23rd, 2015

Foreword or Why?!?!

There are two curses in program design (among many others) — legacy and monolithic design.

Legacy means two things: first, there is such thing ask backward compatibility that you (sometimes have to maintain) or the users will complain about broken APIs and ABIs; second, there’s code legacy, i.e. decisions taken in the past that are kept for some reason (e.g. noone understands how it works). Like AVI demuxer in libavformat containing special cases for handling specific files that noone has ever seen.

Monolithic design is yet another problem that creeps into many projects with time. I don’t know why but quite often code gathers itself into intangible chunks and with time those chunks grow and get uglier. Anyone worked with FFmpeg might take a pleasure looking at mpegvideo in libavcodec, libswscale and libpostproc (especially if you look at the versions from about 2010).

So there are two ways how to deal with it — evolution (slowly change interfaces in hope to be better one day, deprecate stuff etc.) and revolution (simply forget it and write a new stuff from scratch).

In this and following posts I’ll describe a new framework (or whatever buzzword applies here) NihAV (Not-Invented-Here Audio-Video). Maybe I’ll even implement it for my own needs and the name should hint how much I care about existing design decisions.

Decompilation Horror

Saturday, April 18th, 2015

In the old days I found PackBits (also DxTory) decoding routine monstrous. That Japanese codec had a single decoding function 349549 bytes long (0x1003DFC00x1009352D) and that was bad style in my opinion.

Well, what do you know? Recently I’ve looked at AMV3 codec. Its encode function is 445048 bytes long (0x10160C200x101CD698). And decode function? 1439210 bytes (0x100011500x1016073A)! I’ve seen many decoders smaller than that function alone.

There’s one thing common to those two codecs that might explain this — both DxTory/PackBits and AMV3 are Japanese codecs. It might be their programming practice (no, it’s not that bad) but remember that other codecs have crappy code too for other reasons. And some of them actually look better in compiled form than in source form (hello there, Ad*be and Micro$oft!). Yet I find it somewhat easier to deal with code that doesn’t frighten IDA (it refuses to show those functions in graph form because of too many nodes and maybe I’ll run decompiler on decode function in autumn – because it will keep my apartment warm till spring).

Vector Quantisation Codecs are Still not (semi-kinda) Dead!

Thursday, April 16th, 2015

While golden days of vector quantisation codecs seem to be over (Cinepak, Smacker and such) there’s still one quite widespread use of vector quantisation in video — texture compression. And, surprisingly, there’s a couple of codecs that employ texture compression methods (good for GPU acceleration, less stuff to invent etc.) like Vidvox Hap or Resolume DXV (which looks suspiciously similar in many aspects but with some features like LZ4, LZF or YCoCg10 compression added). I have not looked that closely at either of them but looks like they still operate on small blocks, it’s just e.g. compressing each plane 8×8 block with BC4 and combining them later.

This does not seem that much interesting to me but I’m sure Vittorio will dig deeper. Good luck to him!

P.S. I forgot — which version of Firefox comes with ORBX.js support?

Some Notes on Lossless Video Codecs

Saturday, March 21st, 2015

While reading a nice PhD thesis from 2014 (in Russian) about new lossless video compression method I laughed hard at lossless video codec descriptions (here’s the link for unbelievers – http://gorkoff.ru/wp-content/uploads/articles/dissertation.pdf, translation below is mine):

To date various lossless videostream compression methods and algorithms have been developed. They are used in the following widespread codecs:

* CorePNG employs deflate algorithm for independent compression of every frame. Theoretically the codec supports delta frames but this option is not used.

* FFV1 employs prediction coding with following entropy coding or prediction error.

* Huffyuv, like FFV1 algorithm, employs predictive coding but prediction error is effectively coded with Huffman algorithm.

* MSU Lossless Video Codec has been developed for many years at Moscow State University labs.

And yet some real world tasks demand more effective compression and thus a challenge of developing new more effective lossless video compression methods still remains actual.

Readers are welcome to find inaccurate, erroneous and outright bullshit statements in this quote themselves. I’d rather talk about lossless video codecs as I know them.
(more…)

Some notes on VP4

Sunday, March 1st, 2015

Well, this information should’ve been posted by someone else but those people seem to be lazier than me. In return I’m not going to use XViD or FLIC for encoding my content.

So, REing VP4 is rather easy – you just download original VP3.2 decoder source (still available at Xiph SVN servers) and compare it to the structure in vp4vfw.dll. There are differences in structures and a bit in code layout but mostly it’s the same code with new additions.

So, VP4 is based on VP3 (surprise!) and introduces a new bitstream version (which is 3 for some reason). Here’s an incomplete list of differences I’ve spotted:

  • Base frame header has some additional fields (I didn’t care enough to decipher their meaning though);
  • Superblock coding uses a bit different scheme with new universal codes resembling exp-Golomb but with VP4 quirk;
  • Frame data decoding differs for frame types;
  • Motion vector component extraction uses Huffman tables and sign from the previous block.

And yet it uses the same coding principles and even token coding seems to be left untouched. It was suspected for a long time that even-numbered On2 codecs were simply an improvements over previous version while odd-numbered On2 codecs were more innovative but not much was known about VP4 to prove it:

  1. Duck TrueMotion 1 — a new codec;
  2. Duck TrueMotion 2 — mostly like TrueMotion 1 but with Huffman encoding;
  3. Duck/On2 TrueMotion VP3 — DCT + static Huffman coding;
  4. On2 TrueMotion VP4 — VP3 with some bitstream coding changes;
  5. On2 TrueCast VP5 — DCT + arithmetic coder;
  6. On2 VP6 — VP5 with some bitstream changes;
  7. On2 VP7 — H.264 ripoff with their own arithmetic coder;
  8. On2 VP8 — VP7 with some small changes;
  9. Baidu VP9 — H.265 ripoff with their own arithmetic coder;
  10. rumoured Baidu VP10 — since there’s no H.266 in the works for now…

It’s all kinda Intel CPUs but without confusing codenames (and Xiph hasn’t produced too many codecs to confuse whether Daalawell came before Theorabridge or after).

P.S. Many thanks to big G for releasing no information on that codec or any other codecs from On2. Oh, and is VP9 “specification” still under NDA?

P.P.S. I should really work on a game codec named after chemical warfare instead.

A Call for Modern Audio Codec

Wednesday, February 11th, 2015

We need a proper audio codec to accompany state of the art video codecs, so here’s an outline of codec features that should be present:

  • audio codec should make more of its context, it should have a system of forward and backward reference frames like B-pyramid in H.264 or H.265;
  • it should employ tonal compensation with that — track the frequency changes from the references (e.g. it may be the same note continued or changing pitch);
  • time domain prediction via FIR or IIR filters;
  • flexible subdivision into subframes like binary tree;
  • raw (or non-transformed at least) coding mode for transients or noise;
  • integer only bitexact transform that passes for MDCT under bad light;
  • high-bitdepth sound support (up to 64 bits per sample).

The project name is transGhost (hopefully no Monty will be hurt by this).

And if you point out this is stupid — well, audio codecs should have the same rights as video codecs including PTS/DTS differences and employing similar coding methods.

Why one should not be overexcited about new formats

Saturday, January 10th, 2015

Today I’ll talk about Opus and BPG and argue why they are not the silver bullets everyone was expecting.

Opus

I cannot say this is a bad codec, it has modern design (hybrid speech+music coder) and impressive performance. What’s wrong about it? Usage.

The codec is ideal for streaming, broadcasting and such. It does not have special multichannel audio, you can combine mono and stereo Opus streams in whatever way you like and you don’t have to care about passing special configuration for it in a special way.

What’s bad about that? When you try to apply it to stored media all those advantages turn into drawbacks. There was no standard way to store it (IIRC Opus-in-TS and Opus-in-MP4 specifications were developed by people that had little in common with Opus developers although some of the latter were present too). There is still one big problem with an ugly hack as “solution” — the lack of keyframes in Opus and the “solution” in form of preroll (i.e. “decode certain number of audio frames before the needed one and discard them”). And not all containers support that feature.

That reminds me of MoosePack SV1-SV7. That was a project intended to improve MPEG Audio Layer II compression and make it a new codec (yes, there’s Layer III but that was one of the reasons MoosePack, Vorbis and other audio codecs were born). It had enjoyed some limited popularity (I’ve implemented MPC decoding support for a reason) but it had two major drawbacks:

  • very brief file format — IIRC it’s just a header and audio blocks prefixed by 20-bit size and no padding to byte either (if you’ve ever worked with raw FLAC streams you should have no problems imagining how good MPC format was);
  • no intra frames — again, IIRC their solution was to simply decode and discard 12 frames before the given one in hope the sound will converge.

MusePack SV8 tried to address all those issues by making new chunked format that could be easily embedded into other containers, its audio blocks could be decoded independently because first frame in it was a keyframe. But it was too late and I don’t know who uses this format at all.

Opus is more advanced and performs better by offloading those problems to container but I still don’t think Opus is an ideal codec for all cases. If you play it continuously it’s fine, when you try to seek problems start to occur.

BPG

This is quite recent example of the idea “let’s stick intraframe coding from some video codec into image format”.

Of course such approach saves time especially if you piggyback state of the art codec but it’s not the optimal solution. Why? Because still image coding and video sequence coding have different goals and working conditions.

In video coding you have a large amount of data that you have to (de)compress efficiently but mostly under specific constraints like framerate. While coding an individual frame is important it’s much more convenient to spend efforts on evening load for decoding all frames. After all, hardly anyone would like to have first frame to be decoded in 0.8s and other 24 frames in 0.1s. That reminds me of ClearVideo which had the inverse problem – intraframes were coded very simply (just IDCT+static Huffman) and interframes employed something fractal and took much more time.

Another difference is content. For video you usually have common frame sizes (like 1920×1080 or 1280×768) and actually modern video codecs are targeted for handling bigger and bigger resolutions. Images on the other hand come in various sizes, even ridiculous ones like 173×69, and they contain stuff you usually don’t expect to be in video form — pixel art, synthetic images, line art etc. (Yes, some people care about monochrome FMV but it’s a very rare case).

Another problem is efficient coding of palettised and monochrome images, lossy or losslessly. For lossless compression it’s much better to operate on whole lines while video coding standards nowadays are block-based and specialised compression schemes beat generic ones. For instance, the same test page compresses to 80kB PNG, 56kB Group4 TIFF or 35kB JBIG image. JPEG-LS beats PNG too and both are very simple compression standards compared to even H.261.

There’s also alpha plane coding, not so many video codecs support it because of its limited use in video. You have it mostly in intermediate codecs or game ones (hello Indeo 4!). So if selected video codec doesn’t support alpha natively you have to glue it somehow (that’s what BPG does).

Thus, we come to the following points:

  • images are individually coded while video codec has to care about whole sequence;
  • images come in different sizes, video sizes are usually few standard ones;
  • images have different content that’s not always well compressed by video coder and specialised compression scheme is always better and maybe faster;
  • images might need some additional features not required by video.

This should also explain why I have some respect for WebPLL but none for WebP.

I’ve omitted obvious problems with adoption, small-power hardware and such because hardly anything beats (M)JPEG there. So next time you choose format for images choose wisely.

A Codec Family Proposal

Monday, September 29th, 2014

There are enough general use standardised codecs, there’s even VPx family for those who want more. But there are not enough niche codecs with free/open specifications.

One of such niche codecs would be an intermediate codec. It’s suitable for capturing and quick editing of video material. Main requirements are modest compression rate and fast processing (scalable is a plus too). Maybe SMPTE VC-5 will be the answer, maybe Ogg Chloe, maybe something completely different. Let’s discuss it some other time.

Another niche codec that desperately needs an open standard is screen video codec. Such codec may be also used for recording webcasts, presentations and such. And here I’d like to discuss a whole family of such codecs based on the same coding principles.

It makes sense to make codec fast by employing multithreading where possible. That’s why frame should be divided into tiles that should be not so large and not so small, maybe 192×128 pixels or so.

Each tile should be coded independently, preferably its distinct features coded separately too. It makes sense to separate tile data into smooth features (like gradients and real life pictures) and sharp transitions (like text and UI elements). Let’s call the former a natural layer and the latter a synthetic layer. We’ll need a mask to tell which layer to use for the current pixel too. And using these main blocks and employing different coding methods we can make a whole family of codecs.

Here’s the list of example codecs (with a random FOURCC assigned):

  • J-B0 — employ JPEG for natural layer and GIFPNG for mask and synthetic layer coding;
  • J-B1 — employ Snow for natural layer coding and FFV1 for synthetic layer coding;
  • J-B2 — employ JPEG-2000 for natural layer coding, JBIG for mask coding and something like PPM modeller for synthetic layer;
  • J-BG — employ WebP for natural layer and WebP LL for synthetic layer.

As one can see, it’s rather easy to build such codec since all coding blocks are there and only natural/synthetic layer separation might need a bit of research. I see no reasons why, say, VLC can’t use it for recording and streaming desktop for e.g. virtual meeting.