Various Video Codecs « Kostya's Boring Codec World

Archive for the ‘Various Video Codecs’ Category

ClearVideo: Somewhat Working!

Saturday, February 3rd, 2018

So I’ve finally written a decoder for ClearVideo in NihAV and it works semi-decently.

Here’s a twentieth frame of basketball.avi from the usual sample repository. Only the first frame was intra-frame, the rest are coded with just the transforms (aka “copy block from elsewhere and change its brightness level if needed too”).

As you can see there are still serious glitches in decoding, especially on bottom and right edges but it’s moving scene and most of it is still good. And the standard “talking head” sample from the same place decodes perfectly. And RealMedia sample is decoded acceptably too.

Many samples are decoded quite fine and it’s amazing how such simple method (it does not code residue unlike other video codecs with interframes!) still achieves good results at reasonable (for that time) bitrate.

Hopefully there are not so many bugs in my implementation to fix so I can finally move to RealVideo 3 and 4. And then probably to audio codecs before RealVideo 6 (aka RealMedia HD) because it needs REing work for details (and maybe wider acceptance). So much stuff to procrastinate!

Update: I did MV clipping wrong, now it works just fine except for some rare glitches in one RealMedia file.

Posted in Various Video Codecs | 2 Comments »

ClearVideo: Some Progress!

Sunday, January 21st, 2018

I don’t know whether it’s Sweden in general or just proper Swedish Trocadero but I’ve managed to clarify some things in ClearVideo codec.

One of the main problems is that binary specifications are full of cruft: thunks for (almost) every function in newer versions (it’s annoying) and generic containers with all stuff included (so you have lists with elements that have actual payload which are different kinds of classes—it was so annoying that I’ve managed to figure it all out just this week). Anyway, complaining about obscure and annoying binary specifications is fun but it does not give any gain, so let’s move to the actual new and clarified old information. Plus it has several different ways of coding information depending on various flags in extradata.

The codec has two modes: intra frames coded a la JPEG and inter frames that are coded with fractal transforms (and nothing else). Fractal frame is split into tiles of predefined size (that information is stored in extradata) and those tiles may be split into smaller blocks recursively. The information for one block is plane number, flags (most likely to show whether it should be split further), bias value (that should be added to the transformed block) and motion vector (a byte per component). The information is coded with static codebooks and it depends on the coding version and context (it’s one set for version 1, another for version 2 and completely different single codebook for version 6). Codebooks are stored in the resources of decoder wrapper, the same as with DCT coefficients tables.

Now, the extradata. After the copywrong string it actually has the information used in the decoding: picture size (again), flags, version, tile sizes and such. Fun thing is that this information is stored in 32-bit little-endian words for AVI but it uses big-endian words for RealMedia and probably MOV.

And the tables. There are two tables: CVLHUFF (single codebook definition) and HUFF (many codebooks). Both have similar format: first you have byte array for code lengths, then you have 16-bit array of actual codewords (or you can reconstruct them from code lengths the usual way—the shortest code is all zeroes and after that they increase) and finally you have 16-bit array of symbols (just bytes for case of 0x53 chunks in HUFF). The multiple codebook definition has 8-byte header and then codebook chunks in form [id byte][32-bit length in symbols][actual data], there are only 4 possible ID bytes (0xFF—empty table, 0x53—single byte for symbol, the rest is as described above). Those IDs correspond to the tables used to code 16-bit bias value, motion values (as a pair of bytes with possible escape value) and 8-bit flags value.

So, overall structure is more or less clear, underlying details can be verified with some debugging, and I hope to make ClearVideo decoder for NihAV this year. RMHD is still waiting 😉

Posted in Various Video Codecs | 2 Comments »

Some Notes on VivoActive Video

Tuesday, November 21st, 2017

When you refactor code (even if your own one) any other activity looks better. So I decided to look at VivoActive Video instead of refactoring H.263-based decoders in NihAV.

In case you don’t know, Vivo was a company that created own formats (container and video, no idea about audio) that seems to that old that its beard rivals the beard of its users. Also it’s some MPlayer-related joke but I never got it.

Anyway, it’s two H.263-based video codecs, one being vanilla H.263+ decoder will all exciting stuff like PB-frames (but no B-frames) and another one is an upgrade over it that’s still H.263+ but with different coding scheme.

Actually, how the codec handles coding is the only interesting thing there. First, codebooks. They are stored in semi-readable way: first entry may be an optional FLC marker, last entry is always End marker, the rest of entries are human-readable codes (e.g. 00 1101 11 — the codebook parser actually parses those ones and zeroes and skips white spaces) with some binary data (the number of trailing bits, symbol start value, something else too). The way how bitstream is handled reminds me of VPx somewhat: you have a set of 49 codebooks, you start decoding tokens from certain codebook and then if needed you switch to secondary codebook. In result you get a stream of tokens that may need to be parsed further (skip syncword prevention codes that decode to 0xB3, validate the decoded block—mind you, escape values are handled as normal codes there too, assign codes to proper fields etc etc). In result while it’s easy to figure out which part is H.263 picture/GOB/MB header decoding because of the familiar structure and get_bits() calls, Vivo v2 decoding looks like “decode a stream of tokens, save first ones to certain fields in context, interpret the rest of them depending on them”. For example, macroblock decoding starts with tokens for MB type, CBP and quantiser, those may be followed up by 1 or 4 motion vector deltas and then you have block coefficients (and don’t forget to skip stuffing codes when you get them).

Overall, not a very interesting codec with some crazy internal design (another fun fact: it has another set of codebooks in slightly different format but they seem to be completely unused). I’m not sure if it’s worth implementing but it was interesting to look at.

Posted in Various Video Codecs | Comments Closed

Why Modern Video Codecs Suck and Will Keep on Sucking

Friday, May 12th, 2017

If you look at the modern video codecs you’ll spot one problem: they get designed for large resolutions and follow one-size-does-not-fit-exactly-anybody approach. By that I mean that codecs are following the model introduced by ITU H.261—split image into blocks, predict block from the previous frame if possible, apply DCT, quantise and code resulting coefficients (using zigzag scan order and special treatment for runs of zeroes). The same was later applied to pictures in JPEG format that is still staying strong.

Of course modern codecs are much more complex that that, current ITU H.EVC standard enhanced every stage:

image is no longer split into 8×8 blocks, you have quadtrees coding blocks from 64×64 down to 4×4 pixels;
block prediction got more complicated, now you have intra (or spatial prediction) that tries to fill block with gradient derived from already decoded neighbour blocks) and inter prediction (the old prediction from the previous frame);
and obviously inter prediction is not that simple either: now it’s decoupled from transformed block and can have completely different sizes (like 16×4 or 24×32), instead of single previous frame you can use two reference frames selected from two separate lists of references and even motion vectors are often predicted using motion vectors from the reference frames (does anybody like implementing those colocated MV prediction modes BTW?);
DCT is replaced with some bitexact integer approximations (and the dequantisation and/or transform stages may be skipped completely);
there are more scan types used and all values are coded using some context-adaptive coder.

Plus some hacks for low-resolution mode (e.g. special 4×4 transform for luma), lossless (or as they call it, “PCM coding”) and now also special coding mode for screen content (i.e. images with fewer distinct colours and where fine details matter).

The enhancements on streamline coding process are enhancements, they don’t change principles of coding but rather adapt them to modern conditions (meaning that there’s demand in higher compression and there’s more CPU power and RAM can be thrown at the processing—mostly RAM though).

And what the hacks do? They try to deal with the fact that this model works fine for smooth changing continuous tone images and it does not work that good on other types of video source. There are several ways to deal with the problem but keep in mind that the problem of distinguishing video types and selecting proper coding is AI-complete:

JPEG+PNG approach. You select best coder for the source manually and transmit it like that. Obviously it works well in limited scenarios but even people quite often don’t bother and compress everything with the single format even if that hurts quality or compression ratio. Plus you need to handle two different formats, make sure that the receiving end also supports them etc etc.
MPEG-4 approach. You have single format that has various “coding tools” embedded, they can be both full alternative coding features (like WebP has VP8 compression and lossless compression and nothing common between them or MPEG-4 Audio can be coded as conventional AAC, TwinVQ, speech codec or even as a description for synthesised audio) or various enhancement applied to the main coding method (like you have AAC-LC, AAC-Main that enables several features or HE-AACv2 which takes AAC-LC audio and applies SBR and Parametric Stereo to double its channels and frequency range). Actually there are more than forty various MPEG-4 Audio object types (various coding modes) already, do you think there’s any software that supports everything? And looks like modern video codecs head this way too: they introduce various coding tools (like for screen content) and it would be fun to support all possible features in the decoder. Please consider how much effort should be spent on effectively applying all those tools too (and that’s obviously beside the scope of standards).
ZPAQ approach. The terminal AI-complete solution. You are not merely generating bitstream but first you need to transmit bytecode for a program that will decode this bytestream. It’s the ultimate solution—if you can describe the perfect model for the stream then you can compress it the best. Finding an optimal model for given bitstream is left as an exercise for the reader (in TAoCP it would be marked with M60 I guess).

The second thing I find sucky is combinatorial explosion of encoding parameters. Back in the day you had to worry about selecting the best quantisation matrix (or merely a quantiser) and motion vector if you decided to code it as inter-block. Now you have countless ways to split large tile into smaller blocks, many ways to select prediction mode (inter/intra, prediction angle for intra, partitioning, reference frames and motion vectors) and whether to skip transform stage or not and if not whether it’s worth to subdivide block further or not… The result is as good as string theory—you can get a good one if you can guess zillions of parameters right.

It would be nice to have encoder actually splitting video into scene and actors and transmitting just the changes to the objects (actors, scene) instead of blocks. But then you have a problem of coding those descriptions efficiently and even greater problem of automatically classifying the video into such objects (obviously software can do that, that’s why MPEG-4 Synthetic Video is such a great success). Actually it had some use: there was AVS-S standard for coding video specifically from surveillance cameras (why would China need such standard anyway?). In this standard there was special kind of frame for the whole scene and the main share of video was supposed to be just objects moving around the scene. Even if the standard is obsolete its legacy was included into ~~HEVS~~AVS2 as three or four new special frame types.

Personally I believe that current video formats are being optimised to local minimum, there are probably other coding methods that give larger gain on certain kinds of data, preferably with less tweaking. For example, that was probably the best thing about Daala, its PVQ coding; the rest was nor crazy enough. I have a gut feeling that vector quantisation might be a good base for an alternative approach to building video codecs. And I think it’s better to have different formats oriented for e.g. low-latency broadcasting and video distributing. If you remember, back in the days people actually spent time to decide which segment was coded better with DivX ;-) 3 Fast-Motion or DivX ;-) 3 Low-Motion, so those who care will be able to select proper format. And the rest can keep watching content in VP11/AV2 format. Probably only the last sentence will come to life.

That’s why I don’t expect bright future in video codecs and that’s why my blog is titled like this.

Posted in Useless Rants, Various Video Codecs | 4 Comments »

A Quick Look on Perseus

Tuesday, June 21st, 2016

So, unlike those breakthrough codecs everybody talks about (I mean RMHD and ORBX.js), V-Nova Perseus was delivered (but what do you expect from a codec announced on the first of April?) and is available in some Android app. So I’ve looked at it.

The implementation seems bafflingly simple: there’s a base layer, it gets upscaled 2x and an enhancement is applied to the upscaled image. And those enhancements are essentially quantised differences after 2×2 Haar transform plus runs, all coded with context-dependent Huffman codes. If that reminds you of RealVideo—don’t worry, they code those codebook descriptions too so it’s different.

I don’t know if it really works as good as ~~promised~~marketed but it’s an interesting approach and it introduces some variety in the world of codecs that look alike—mostly because they all use the same principles as the standard video codec with some small enhancements or building blocks replaced with functional analogues; yes, I completely forgot about Daala, please remind me about it when they settle with final design—it might be the codec of choice for GNU HURD NG by then too.

Posted in Useless Rants, Various Video Codecs | 6 Comments »

Kostya's Boring Codec World

Archive for the ‘Various Video Codecs’ Category

ClearVideo: Somewhat Working!

ClearVideo: Some Progress!

Some Notes on VivoActive Video

Why Modern Video Codecs Suck and Will Keep on Sucking

A Quick Look on Perseus

Pages

Archives

Categories