Archive for the ‘Various Video Codecs’ Category

H.264 specification sucks

Saturday, November 14th, 2020

So it has come to a stage where I have nothing better to do so I tried to write H.264 decoder for NihAV (so I can test the future nihav-player with the content beside just sample files and cutscenes from various games). And while I’ve managed to decode at least something (more about that in the end) the specification for H.264 sucks. Don’t get me wrong, the format by itself is not that bad in design but the way it’s documented is far from being good (though it’s still serviceable—it’s not an audio codec after all).

And in the beginning to those who want to cry “but it’s GNU/Linux, err, MPEG/AVC”. ITU H.264 was standardised in May 2003 while MPEG-4 Part 10 came in December 2003. Second, I can download ITU specification freely and various editions too while MPEG standard still costs money I’m not going to pay.

I guess the main problems of H.264 come from two things: dual coding nature (i.e. slice data can be coded using variable-length codes or binary arithmetic coder) and extensions (not as bad as H.263 but approaching it; and here’s a simple fact to demonstrate it—2003 edition had 282 pages, 2019 edition has 836 pages). Plus the fact that is codified the wrong name for Elias gamma’ codes I ranted on before.

Let’s start with the extensions part since most of them can be ignored and I don’t have much to say about them except for one thing—profiles. By itself the idea is good: you have certain set of constraints and features associated with the ID so you know in advance if you should be able to handle the stream or not. And the initial 2003 edition had three profiles (baseline/main/extended) with IDs associated with them (66, 77 and 88 correspondingly). By 2019 there have been a dozen of various profiles and even more profile IDs and they’re not actually mapped one to one (e.g. constrained baseline profile is baseline profile with an additional constraint_set1_flag set to one). In result you have lots of random profile IDs (can you guess what profile_idc 44 means? and 86? or 128?) and they did not bother to make a table listing all known profile IDs so you need to search all specification is order to find out what they mean. I’d not care much but they affect bitstream parsing, especially sequence parameter set where they decided to insert some additional fields in the middle for certain high profiles.

Now the more exciting part: coding. While I understand the rationale (you have simpler and faster or slower but more effective (de)coding mode while using the same ways to transform data) it created some problems for describing it. Because of that decision you have to look at three different places in order to understand what and how to decode: syntax tables in 7.3 which present in which order and under which conditions elements are coded, semantics in 7.4 telling you what that element actually means and what limitations or values it has, and 9.2 or 9.3 for explanations on how certain element should be actually decoded from the bitstream. And confusingly enough coded block pattern is put into 9.1.2 while it would be more logical to join it with 9.2, as 9.1 is for parsing generic codes used not just in slice data but various headers as well and 9.2 deals with parsing custom codes for non-CABAC slice data.

And it gets even worse for CABAC parsing. For those who don’t know what it is, that abbreviation means context-adaptive binary arithmetic coding. In other words it represents various values as sequences of bits and codes each bit using its own context. And if you ask yourself how the values are represented and which contexts are used for each bit then you point right at the problem. In the standard you have it all spread in three or four places: one table to tell you which range of contexts to use for a certain element, some description or separate table for the possible bit strings, another table or two to tell you which contexts should be used for each bit in various cases (e.g. for ctxIdxOffset=36 you have these context offsets for following bits: 0, 1, (2 or 3), 3, 3, 3), and finally an entry that tells you how to select a context for the first bit if it depends on already decoded data (usually by checking if top and left (macro)blocks have the same thing coded or not). Of course it’s especially fun when different bit contexts are reused for different bit positions or the same bit positions can have different contexts depending on previously decoded bit string (this happens mostly for macroblock types in P/SP/B-slices but it’s still confusing). My guess is that they tried to optimise the total number of contexts and thus merged the least used ones. In result you about 20 pages of context data initialisation in the 2019 edition (in initial edition of both H.264 and H.EVC it’s just eight pages)—compare that to almost hundred pages of default CDFs in AV1 specification. And CABAC part in H.265 is somehow much easier to comprehend (probably because they made the format less dependent on special bit strings and put some of the simpler conditions straight into binarisation table).

To me it seems that people describing CABAC coding (not the coder itself but rather how it’s used to code data) did not understand it well themselves (or at least could not convey the meaning clearly). And despite the principle of documenting format from decoder point of view (i.e. what bits should it read and how to act on them in order to decode bitstream) a lot of CABAC coding is documented from encoder point of view (i.e. what bits you should write for syntax element instead of what reading certain bits would produce). An egregious example of that is so-called UEGk binarisation. In addition to the things mentioned above it also has rather meaningless parameter name uCoff (which normally would be called something like escape value). How would I describe decoding it: read truncated unary sequence up to escape_len, if the read value is equal to escape_len then read an additional escape value as exp-Golomb code shifted by k and trailing k-bit value, otherwise escape value is set to zero. Add escape value to the initial one and if the value is non-zero and should be signed, read the sign. Section 9.2.3.2 spends a whole page on it with a third of it being C code for writing the value.

I hope I made it clear why H.264 specification sucks in my opinion. Again, the format itself is logical but comprehending certain parts of the specification describing it takes significantly more time than it should and I wanted to point out why. It was still possible to write a decoder using mostly the specification and referring to other decoders source code only when it was completely unclear or worked against expectations (and JM is still not the best codebase to look at either, HM got much better in that aspect).

P.S. For those zero people who care about NihAV decoder, I’ve managed to decode two random videos downloaded from BaidUTube (funny how one of them turned out to be simple CAVLC-coded video with no B-frames) without B-frames and without apparent artefacts in first hundred frames. There’s still a lot of work to make it decode data correctly (currently it lacks even loop filter and probably still has bugs) plus beside dreaded B-frames with their co-located MVs there are still some features like 8×8 DCTs or high-bitdepth support I’d like to have (but definitely no interlaced support or scalable/multiview shit). It should be good enough to play content I care about and that’s all, I do not want to waste extremely much time making it a perfect software that supports all possible H.264/AVC features and being the fastest one too.

A Modest Proposal for AV2

Wednesday, September 16th, 2020

Occasionally I look at the experiments in AV1 repository that should be the base for AV2 (unless Baidu rolls out VP11 from its private repository to replace it entirely). A year ago they added intra modes predictor based on neural network and in August they added a neural network based loop filter experiment as well. So, to make AV2 both simpler to implement in hardware and improve its compression efficiency I propose to switch all possible coding tools to use misapplied statistics. This way it can also attract more people from the corresponding field to compensate the lack of video compression experts. Considering the amount of pixels (let alone the ways to encode them) in a modern video it is BigData™ indeed.

Anyway, here is what I propose specifically:

  • expand intra mode prediction neural networks to predict block subdivision mode and coding mode for each part (including transform selection);
  • replace plane intra prediction with a trained neural network to reconstruct block from neighbours;
  • switch motion vector prediction to use neural network for prediction from neighbouring blocks in current and reference frames (the schemes in modern video codecs become too convoluted anyway);
  • come to think about it, neural network can simply output some weights for mixing several references in one block;
  • maybe even make a leap and ditch all the transforms for reconstructing block from coefficients directly by the model as well.

In result we’ll have a rather simple codec with most blocks being neural networks doing specific tasks, an arithmetic coder to provide input values, some logic to connect those blocks together, and some leftover DSP routines but I’m not sure we’ll need them at this stage. This will also greatly simplify the encoder as well as it will be more of a producing fitting model weights instead of trying some limited encoding combinations. And it may also be the first true next generation video codec after H.261 paving the road to radically different video codecs.

From hardware implementation point of view this will be a win too, you just need some ROM and RAM for models plus a generic tensor accelerator (which become common these days) and no need to design those custom DSP blocks.

P.S. Of course it may initially be slow and work in a range of thousands FPS (frames per season) but I’m not going to use AV1 let alone AV2 so why should I care?

A Quick Look on LCEVC

Wednesday, July 29th, 2020

As you might’ve heard, MPEG is essentially no more. And the last noticeable thing related to video coding it did the last was MPEG-5 (and synthesising actors and issuing commands to them with MPEG-G and MPEG-4 standards unholy unity). In result we have an abuse of letter ‘e’—in HEVC, EVC and LCEVC it means three different things. I’ll talk about VVC probably when AV2 specification is available, EVC is slightly enhanced AVC and LCEVC is interesting. And since I was able to locate DIS for it why not give a review of it?

LCEVC is based on Perseus and as such it’s still an interesting concept. For starters, it is not an independent codec but an enhancement layer to add scalability to other video codecs, somewhat like video SBR but hopefully it will remain more independent.

A good deal of specification is copied from H.264 probably because nobody in the industry can take a codec without NALs, SEIs and HRD seriously (I know their importance but here it still feels excessive). Regardless, here is what I understood from the description while suffering from thermal throttling.

The underlying idea is quite simple and hasn’t changed since Perseus: you take a base frame, upscale it, add the high-frequency differences and display the result. The differences are first grouped into 4×4 or 8×8 blocks, transformed with Walsh-Hadamard matrix or modified Walsh-Hadamard matrix (with some coefficients being zeroed out), quantised and coded. Coding is done in two phases: first there is a compaction state where coefficients are transformed into byte stream with flags for zero runs and large values (or RLE just for zeroes and ones) and then it can be packed further with Huffman codes. I guess that there are essentially two modes: a faster one where coefficient data is stored as bytes (with or without RLE) and slightly better compressed mode with those values are further packed with Huffman codes generated per tile.

Overall this looks like a neat scheme and I hope it will have at least some success. No, not to prove Chiariglione’s new approach for introducing new codecs an industry can use without complex patent licensing, but rather because it might be the only recent major video codec built on principles different from H.26x line and its success may introduce more radically different codecs and my codec world will get less boring.

Reviewing AV1 Features

Saturday, March 21st, 2020

Since we have this wonderful situation in Europe and I need to stay at home why not do something useless and comment on the features of AV1 especially since there’s a nice paper from (some of?) the original authors is here. In this post I’ll try to review it and give my comments on various details presented there.

First of all I’d like to note that the paper has 21 author for a review that can be done by a single person. I guess this was done to give academic credit to the people involved and I have no problems with that (also I should note that even if two of fourteen pages are short authors’ biographies they were probably the most interesting part of paper to me).
(more…)

MidiVid codec family

Thursday, September 26th, 2019

VP7 is such a nice codec that I decided to distract myself a little with something else. And that something else turned out to be MidiVid codec family. It turned out to be quite peculiar and somehow reminiscent of Duck codecs.

The family consists of three codecs:

  1. MidiVid — the original codec based on LZSS and vector quantisation;
  2. MidiVid Lossless — exactly what is says on a tin, based on LZSS and bunch of other technologies;
  3. MidiVid 3 — a codec based on simplified integer DCT and single codebook for all values.

I’ve actually added MidiVid decoder to NihAV because it’s simple (two hundred lines including boilerplate and tests) and way more fun than working on VP7 decoder. Now I’ll describe them and hopefully you’ll understand why it reminds me of Duck codecs despite not being similar in design.

MidiVid

This is a simple hold-and-modify video codec that had been used in some games back in PS2/Xbox era. The frame data can be stored either unpacked or packed with LZSS and it contains the following kinds of data: change mask for 8×8 blocks (in case of interframe—if it’s zero then leave block as is, otherwise decode new data for it), 4×4 block codebook data (up to 512 entries), high bits for 9-bit indices (if we have 257-512 various blocks) and 8-bit indexes for codebook.

The interesting part is that LZSS scheme looked very familiar and indeed it looks almost exactly like lzss.c from LZARI author (remember that? I still do), the only differences is that it does not use pre-filled window and flags are grouped into 16-bit word instead of single byte.

MidiVid Lossless

This one is a special best as it combines two completely different compression methods: the same LZSS as before and something used by BWT-based compressor (to the point that frame header contains FTWB or ZTWB IDs).

I’m positively convinced it was copied from some BTW-based compressor not just because of those IDs but also because it seems to employ the same methods as some old BTW-based compressor except for the Burrows–Wheeler transform itself (that would be too much for the old codecs): various data preprocessing methods (signalled by flags in the frame header), move-to-front coding (in its classical 1-2 coding form that does not update first two positions that much) plus coding coefficients in two groups: first just zero/one/large using order-3 adaptive model and then values larger than one using single order-1 adaptive model. What made it suspicious? Preprocessing methods.

MVLZ has different kinds of preprocessing methods: something looking like distance coding, static n-gram replacement, table prediction (i.e. when data is treated as series of n-bit numbers and the actual numbers are replaced with the difference between previous ones) and x86 call preprocessing (i.e. that trick when you change function call address from relative into absolute for better compression ratio and then undo it during decompression; known also as E8-preprocessing because x86 call opcode is E8 <32-bit offset> and it’s easy to just replace them instead of adding full disassembler to the archiver). I had my suspicions as n-gram replacement (that one is quite stupid for video codecs and it only replaces some values with some binary values that look more related to machine code than video) but the last item was a dead give-away. I’m pretty sure that somebody who knows open-source BWT compressors of late 1990s will probably recognize it even from this description but sadly I’ve not been following it that closely being more attracted to multimedia.

MidiVid 3

This codec is based on some static codebook for packing all values: block types, motion vectors and actual coefficients. Each block in macroblock can be coded with one of four modes: empty (fill with 0x80 in case of intra), DC only, just few coefficients DCT, and full DCT. As usual various kinds of data are grouped and coded as single array.

Motion compensation is full-pixel and unlike its predecessor it operates in YUV420 format.


This was an interesting detour but I have to return back to failing to start writing VP7 decoder.

P.S. I’ll try to document them with more details in the wiki soon.
P.P.S. This should’ve been a post about railways instead but I guess it will have to wait.

Why I am sceptical about AV1

Friday, December 7th, 2018

I wanted to write this post for a while but I guess AV1.1 is that cherry on top of the whole mess called AV1 that made me finally do this.

Since the time I first heard about AV1 I tried to follow its development as much as it’s possible for a person not subscribed to their main mailing list. And unfortunately while we all expected great new codec with cool ideas we got VP10 instead (yes, I still believe that AV1 is short for “A Vp 1(0) codec”). Below I try to elaborate my view and present what facts I know that formed my opinion.

A promising beginning

It all started with ITU H.EVC and its licensing—or rather its inability to be licensed. In case you forgot the situation here’s a summary: there are at least three licensing entities that claim to have patents on HEVC that you need to license in order to be using HEVC legally. Plus the licensing terms are much greedier than what we had for H.264 to the point where some licensing pool wanted to charge fees per streaming IIRC.

So it was natural that major companies operating video in Internet wanted to stay out of this and use some common license-free codec. Resorting to creating one if the need arises.

That was a noble goal that only HEVC patent holders may object to, so the Alliance for Open Media (or AOM for short) was formed. I am not sure about the details but IIRC only organisations could join and they had to pay entrance fee (or be sponsored—IIRC VideoLAN got sponsored by Mozilla) and the development process was coordinated via members-only mailing list (since I’m not a member I cannot say what actually happened there or how and have to rely on second- or third-hand information). And that is the first thing that I did not like—the process not being open enough. I understand that they might not wanted some ideas leaked out to the competitors but even people who were present on that list claim some decisions were questionable at best.

Call for features

In the beginning there were three outlooks on how it will go:

  • AV1 will be a great new codec that will select only the best ideas from all participants (a la MPEG but without their political decisions) and thus it will put H.266 to shame—that’s what optimists thought;
  • AV1 will be a great new codec that will select only the best ideas and since all of those ideas come from Xiph it will be simply Daala under new name—that’s what cultists thought;
  • Baidu wants to push its VP10 on others but since VP8 and VP9 had very limited success it will create an illusion of participation so other companies will feel they’ve contributed something and spread it out (also maybe it will be used mostly for bargaining better licensing terms for some H.26x codecs)—that’s what I thought.

And looks like all those opinions were wrong. AV1 is not that great especially considering its complexity (we’ll talk about it later); its features were not always selected based on the merit (so most of Daala stuff was thrown out in the end—but more about it later); and looks like the main goal was to interest hardware manufacturers in its acceptance (again, more on it later).

Anyway, let’s look what main feature proposals were (again, I could not follow it so maybe there was more):

  • Baidu libvpx with current development snapshot of VP10;
  • Baidu alternative approach to VP10 using Asymmetric Numeric Systems coding;
  • Cisco’s simplified version of ITU H.EVC aka Thor codec (that looks more like RealVideo 6 in result) with some in-house developed filters that improve compression;
  • Mozilla’s supported Daala ideas from Xiph.

But it started with a scandal since Baidu tried to patent ANS-based video codec (essentially just an idea of video codec that uses ANS coding) after accepting ANS inventor’s help and ignoring his existence or wishes afterwards.

And of course they had to use libvpx as the base because. Just because.

Winnowing

So after the initial gathering of ideas it was time to put them all to test to see which ones to select and which ones to reject.

Of course since organisations are not that happy with trying something radically new, AV1 was built on the existing ideas with three main areas where new ideas were compared:

  1. prediction (intra or inter);
  2. coefficient coding;
  3. transform.

I don’t think there were attempts to change the overall codec structure. To clarify: ITU ITU H.263 used 8×8 DCT and intra prediction consisted of copying top row or left column of coefficients from the reference block, ITU H.264 used 4×4 integer transform and tried to fill block from its neighbours already reconstructed pixel data, ITU H.265 used variable size integer transform (from 4×4 to 32×32), different block scans and quadree coding of the blocks. On the other hand moving from VP9 to AV1 did not involve such significant changes.

So, for prediction there was one radically new thing: combining Thor and Daala filter into single constrained directional enhancement filter (or CDEF for short). It works great, it gives image quality boost at small cost. And another interesting tool is predicting chroma from luma (or CfL for short) that was a rejected idea for ITU H.EVC but later was tried both in Thor and Daala and found good enough (the history is mentioned in the paper describing it). This makes me think that if Cisco joined efforts with Xiph foundation they’d be able to produce a good and free video codec without any other company. Getting it accepted by others though…

Now coefficient coding. There were four approaches initially:

  • VP5 bool coding (i.e. binary coding of bits with fixed probability that gets updated once per frame; it appeared in On2 VP5 and survived all the way until VP10);
  • ANS-based coding;
  • Daala classic range coder;
  • Thor variable-based codes (probably not even officially proposed since it is significantly less effective than any other proposed scheme).

ANS-based coding was rejected probably because of the scandal and that it requires data to be coded in reverse direction (the official reasoning is that while it was faster on normal CPU it was slower in some hardware implementations—that is a common reason for rejecting a feature in AV1).

Daala approach won, probably because it’s easier to manipulate a multi-symbol model than try to code everything as context-dependent binarisation of the value (and you’ll need to store and/or code a lot of context probabilities that way). In any case it was clear winner.

Now, transforms. Again, I cannot tell how it went exactly but all stories I heard were that Daala transforms were better but then Baidu had to intervene citing hardware implementation reasons (something in the lines that it’s hard to implement new transforms and why do that since we have working transforms for VP9 with tried hardware design) so VP9 transforms had been chosen in the end.

The final stage

In April 2018 AOM has announced long-awaited bitstream freeze which came as a surprise to the developers.

The final final stage

In June it was actually frozen and AV1.0 was released along with the spec. Fun fact: the repository for av1-spec on baidusource.com that once hosted it (there are even snapshots of it from June in the Web Archive) now is completely empty.

And of course because of some hardware implementation difficulties (sounds familiar yet?) now we have AV1.1 which is not fully compatible with AV1.0.

General impressions

This all started as a good intent but in the process of developing AV1.x it raised so many flags that I feel suspicious about it:

  • ANS patent;
  • Political games like A**le joining AOM as “founding member” when the codec was almost ready;
  • Marketing games like announcing frozen bitstream before large exhibition while in reality it reached 1.0 status later and without many fanfares;
  • Not very open development process: while individual participants could publish their achievements and it was not all particularly secret, it was more “IBM open” in the sense it’s open if you’re registered at their portal and signed some papers but not accessible to any passer-by;
  • Not very open decision process: hardware implementation was very often quoted as the excuse, even in issues like this;
  • Not very good result (and that’s putting it mildly);
  • Oh, and not very good ecosystem at all. There are test bitstreams but even individual members of AOM have to buy them.

And by “not very good result” I mean that the codec is monstrous in size (tables alone take more than megabyte in source form and there’s even more code than tables) and its implementation is slow as pitch drop experiment.

Usually people trying to defend it say the same two arguments: “but it’s just a reference model, look at JM or HM” and “codecs are not inherently complex, you can write a fast encoder”. Both of those are bullshit.

First, comparing libaom to the reference software of H.264 or H.265. While formally it’s also the reference software there’s one huge difference. JM/HM were the plain C/C++ implementations with no optimisation tricks (beside transform speed-up by decomposition in HM) while libaom has all kinds optimisations including SIMD for ARM, POWER and x86. And dav1d decoder with rather full set of AVX optimisations is just 2-3 times faster (more when it can use threading). For H.264 optimised decoders were tens of times faster than JM. I expect similar range for HM too but two-three times faster is very bad result for unoptimised reference (which libaom is not).

Second, claiming that codecs are not inherently complex and thus you can write fast encoder even is the codec is more complex than its predecessor. Well, it is partly true in the sense that you’re not forced to use all possible features and thus can avoid some of combinatorial explosion by not trying some coding tools. Well, there is certain expectation built in into any codec design (i.e. that you use certain coding tools in certain sequence omitting them only in certain corner cases) and there are certain expectations on compression level/quality/speed.

For example, let’s get to the basics and make H.EVC encoder encode raw video. Since you’re not doing intra prediction, motion compensation or transforms it’s probably the fastest encoder you can get. But in order to do that you still have to code coding quadtrees and transmit flags that it has PCM data. In result your encoder will beat any other on speed but it will still lose to memcpy() because it does not have to invoke arithmetic coder for mandatory flags for every coded block (which also take space along with padding to byte boundary, so it loses in compression efficiency too). That’s not counting the fact that such encoder is useless for any practical purpose.

Now let’s consider some audio codecs—some of them use parametric bit allocation in both encoder and decoder (video codecs might start to use the same technique one day, Daala has tried something like that already) so such codec needs to run it regardless on how you try to compute better allocation—you have to code it as a difference to the implicitly calculated one. And of course such codec is more complex than the one that transmits bit allocation explicitly for each sample or sample group. But it gains in compression efficiency and that’s the whole point of having more complex codec in the first place.

Hence I cannot expect of AV1 decoders magically being ten times faster than libaom and similarly while I expect that AV1 encoders will become much faster they’ll still either measure encoding speed in frames per month minute or be on par with x265 in terms on compression efficiency/speed (and x265 is also not the best possible H.265 encoder in my opinion).


Late Sir Terence Pratchett (this world is truly sadder place without his presence) used a phrase “ladies of negotiable hospitality” to describe certain profession in Discworld. And to me it looks like AV1 is a codec of negotiated design. In other words, first they tried to design the usual general purpose codec but then (probably after seeing how well it performs) they decided to bet on hardware manufacturers (who else would make AV1 encoders and more importantly decoders perform fast enough especially for mobile devices?). And that resulted in catering to all possible objections any hardware manufacturer of the alliance had (to the point of AV1.1).

This is the only way I can reasonably explain what I observe with AV1. If somebody has a different interpretation, especially based on facts I don’t know or missed, I’d like to hear it and know the truth. Meanwhile, I hope I made my position clear.

ClearVideo: Somewhat Working!

Saturday, February 3rd, 2018

So I’ve finally written a decoder for ClearVideo in NihAV and it works semi-decently.

Here’s a twentieth frame of basketball.avi from the usual sample repository. Only the first frame was intra-frame, the rest are coded with just the transforms (aka “copy block from elsewhere and change its brightness level if needed too”).

As you can see there are still serious glitches in decoding, especially on bottom and right edges but it’s moving scene and most of it is still good. And the standard “talking head” sample from the same place decodes perfectly. And RealMedia sample is decoded acceptably too.

Many samples are decoded quite fine and it’s amazing how such simple method (it does not code residue unlike other video codecs with interframes!) still achieves good results at reasonable (for that time) bitrate.

Hopefully there are not so many bugs in my implementation to fix so I can finally move to RealVideo 3 and 4. And then probably to audio codecs before RealVideo 6 (aka RealMedia HD) because it needs REing work for details (and maybe wider acceptance). So much stuff to procrastinate!

Update: I did MV clipping wrong, now it works just fine except for some rare glitches in one RealMedia file.

ClearVideo: Some Progress!

Sunday, January 21st, 2018

I don’t know whether it’s Sweden in general or just proper Swedish Trocadero but I’ve managed to clarify some things in ClearVideo codec.

One of the main problems is that binary specifications are full of cruft: thunks for (almost) every function in newer versions (it’s annoying) and generic containers with all stuff included (so you have lists with elements that have actual payload which are different kinds of classes—it was so annoying that I’ve managed to figure it all out just this week). Anyway, complaining about obscure and annoying binary specifications is fun but it does not give any gain, so let’s move to the actual new and clarified old information. Plus it has several different ways of coding information depending on various flags in extradata.

The codec has two modes: intra frames coded a la JPEG and inter frames that are coded with fractal transforms (and nothing else). Fractal frame is split into tiles of predefined size (that information is stored in extradata) and those tiles may be split into smaller blocks recursively. The information for one block is plane number, flags (most likely to show whether it should be split further), bias value (that should be added to the transformed block) and motion vector (a byte per component). The information is coded with static codebooks and it depends on the coding version and context (it’s one set for version 1, another for version 2 and completely different single codebook for version 6). Codebooks are stored in the resources of decoder wrapper, the same as with DCT coefficients tables.

Now, the extradata. After the copywrong string it actually has the information used in the decoding: picture size (again), flags, version, tile sizes and such. Fun thing is that this information is stored in 32-bit little-endian words for AVI but it uses big-endian words for RealMedia and probably MOV.

And the tables. There are two tables: CVLHUFF (single codebook definition) and HUFF (many codebooks). Both have similar format: first you have byte array for code lengths, then you have 16-bit array of actual codewords (or you can reconstruct them from code lengths the usual way—the shortest code is all zeroes and after that they increase) and finally you have 16-bit array of symbols (just bytes for case of 0x53 chunks in HUFF). The multiple codebook definition has 8-byte header and then codebook chunks in form [id byte][32-bit length in symbols][actual data], there are only 4 possible ID bytes (0xFF—empty table, 0x53—single byte for symbol, the rest is as described above). Those IDs correspond to the tables used to code 16-bit bias value, motion values (as a pair of bytes with possible escape value) and 8-bit flags value.

So, overall structure is more or less clear, underlying details can be verified with some debugging, and I hope to make ClearVideo decoder for NihAV this year. RMHD is still waiting 😉

Some Notes on VivoActive Video

Tuesday, November 21st, 2017

When you refactor code (even if your own one) any other activity looks better. So I decided to look at VivoActive Video instead of refactoring H.263-based decoders in NihAV.

In case you don’t know, Vivo was a company that created own formats (container and video, no idea about audio) that seems to that old that its beard rivals the beard of its users. Also it’s some MPlayer-related joke but I never got it.

Anyway, it’s two H.263-based video codecs, one being vanilla H.263+ decoder will all exciting stuff like PB-frames (but no B-frames) and another one is an upgrade over it that’s still H.263+ but with different coding scheme.

Actually, how the codec handles coding is the only interesting thing there. First, codebooks. They are stored in semi-readable way: first entry may be an optional FLC marker, last entry is always End marker, the rest of entries are human-readable codes (e.g. 00 1101 11 — the codebook parser actually parses those ones and zeroes and skips white spaces) with some binary data (the number of trailing bits, symbol start value, something else too). The way how bitstream is handled reminds me of VPx somewhat: you have a set of 49 codebooks, you start decoding tokens from certain codebook and then if needed you switch to secondary codebook. In result you get a stream of tokens that may need to be parsed further (skip syncword prevention codes that decode to 0xB3, validate the decoded block—mind you, escape values are handled as normal codes there too, assign codes to proper fields etc etc). In result while it’s easy to figure out which part is H.263 picture/GOB/MB header decoding because of the familiar structure and get_bits() calls, Vivo v2 decoding looks like “decode a stream of tokens, save first ones to certain fields in context, interpret the rest of them depending on them”. For example, macroblock decoding starts with tokens for MB type, CBP and quantiser, those may be followed up by 1 or 4 motion vector deltas and then you have block coefficients (and don’t forget to skip stuffing codes when you get them).

Overall, not a very interesting codec with some crazy internal design (another fun fact: it has another set of codebooks in slightly different format but they seem to be completely unused). I’m not sure if it’s worth implementing but it was interesting to look at.

Why Modern Video Codecs Suck and Will Keep on Sucking

Friday, May 12th, 2017

If you look at the modern video codecs you’ll spot one problem: they get designed for large resolutions and follow one-size-does-not-fit-exactly-anybody approach. By that I mean that codecs are following the model introduced by ITU H.261—split image into blocks, predict block from the previous frame if possible, apply DCT, quantise and code resulting coefficients (using zigzag scan order and special treatment for runs of zeroes). The same was later applied to pictures in JPEG format that is still staying strong.

Of course modern codecs are much more complex that that, current ITU H.EVC standard enhanced every stage:

  • image is no longer split into 8×8 blocks, you have quadtrees coding blocks from 64×64 down to 4×4 pixels;
  • block prediction got more complicated, now you have intra (or spatial prediction) that tries to fill block with gradient derived from already decoded neighbour blocks) and inter prediction (the old prediction from the previous frame);
  • and obviously inter prediction is not that simple either: now it’s decoupled from transformed block and can have completely different sizes (like 16×4 or 24×32), instead of single previous frame you can use two reference frames selected from two separate lists of references and even motion vectors are often predicted using motion vectors from the reference frames (does anybody like implementing those colocated MV prediction modes BTW?);
  • DCT is replaced with some bitexact integer approximations (and the dequantisation and/or transform stages may be skipped completely);
  • there are more scan types used and all values are coded using some context-adaptive coder.

Plus some hacks for low-resolution mode (e.g. special 4×4 transform for luma), lossless (or as they call it, “PCM coding”) and now also special coding mode for screen content (i.e. images with fewer distinct colours and where fine details matter).

The enhancements on streamline coding process are enhancements, they don’t change principles of coding but rather adapt them to modern conditions (meaning that there’s demand in higher compression and there’s more CPU power and RAM can be thrown at the processing—mostly RAM though).

And what the hacks do? They try to deal with the fact that this model works fine for smooth changing continuous tone images and it does not work that good on other types of video source. There are several ways to deal with the problem but keep in mind that the problem of distinguishing video types and selecting proper coding is AI-complete:

  1. JPEG+PNG approach. You select best coder for the source manually and transmit it like that. Obviously it works well in limited scenarios but even people quite often don’t bother and compress everything with the single format even if that hurts quality or compression ratio. Plus you need to handle two different formats, make sure that the receiving end also supports them etc etc.
  2. MPEG-4 approach. You have single format that has various “coding tools” embedded, they can be both full alternative coding features (like WebP has VP8 compression and lossless compression and nothing common between them or MPEG-4 Audio can be coded as conventional AAC, TwinVQ, speech codec or even as a description for synthesised audio) or various enhancement applied to the main coding method (like you have AAC-LC, AAC-Main that enables several features or HE-AACv2 which takes AAC-LC audio and applies SBR and Parametric Stereo to double its channels and frequency range). Actually there are more than forty various MPEG-4 Audio object types (various coding modes) already, do you think there’s any software that supports everything? And looks like modern video codecs head this way too: they introduce various coding tools (like for screen content) and it would be fun to support all possible features in the decoder. Please consider how much effort should be spent on effectively applying all those tools too (and that’s obviously beside the scope of standards).
  3. ZPAQ approach. The terminal AI-complete solution. You are not merely generating bitstream but first you need to transmit bytecode for a program that will decode this bytestream. It’s the ultimate solution—if you can describe the perfect model for the stream then you can compress it the best. Finding an optimal model for given bitstream is left as an exercise for the reader (in TAoCP it would be marked with M60 I guess).

The second thing I find sucky is combinatorial explosion of encoding parameters. Back in the day you had to worry about selecting the best quantisation matrix (or merely a quantiser) and motion vector if you decided to code it as inter-block. Now you have countless ways to split large tile into smaller blocks, many ways to select prediction mode (inter/intra, prediction angle for intra, partitioning, reference frames and motion vectors) and whether to skip transform stage or not and if not whether it’s worth to subdivide block further or not… The result is as good as string theory—you can get a good one if you can guess zillions of parameters right.

It would be nice to have encoder actually splitting video into scene and actors and transmitting just the changes to the objects (actors, scene) instead of blocks. But then you have a problem of coding those descriptions efficiently and even greater problem of automatically classifying the video into such objects (obviously software can do that, that’s why MPEG-4 Synthetic Video is such a great success). Actually it had some use: there was AVS-S standard for coding video specifically from surveillance cameras (why would China need such standard anyway?). In this standard there was special kind of frame for the whole scene and the main share of video was supposed to be just objects moving around the scene. Even if the standard is obsolete its legacy was included into HEVSAVS2 as three or four new special frame types.

Personally I believe that current video formats are being optimised to local minimum, there are probably other coding methods that give larger gain on certain kinds of data, preferably with less tweaking. For example, that was probably the best thing about Daala, its PVQ coding; the rest was nor crazy enough. I have a gut feeling that vector quantisation might be a good base for an alternative approach to building video codecs. And I think it’s better to have different formats oriented for e.g. low-latency broadcasting and video distributing. If you remember, back in the days people actually spent time to decide which segment was coded better with DivX ;-) 3 Fast-Motion or DivX ;-) 3 Low-Motion, so those who care will be able to select proper format. And the rest can keep watching content in VP11/AV2 format. Probably only the last sentence will come to life.

That’s why I don’t expect bright future in video codecs and that’s why my blog is titled like this.