Archive for May, 2017

NihAV — a Small Update

Wednesday, May 31st, 2017

For testing how well NihAV handles palettised formats I’ve decided to add support for Gremlin Digital Video format (8-bit only). So now I can decode various cutscenes from Normality, one of very few 3D first person adventure games for DOS. I’ve tested my implementation and it works fine.

The funny thing is that this demuxer and decoder for GDV (actually there’s also GDV DPCM but the samples I have seem to use raw PCM) are missing from CEmpeg. Wiki description also has some parts missing.

The first frame I was decoding started with a code for copying 8 bytes from offset -56. The first frame. At the very first pixel. So I’ve consulted the VAG’s code and the original binary specification (even by dumping executed instructions in DosBox and analysing them—it helped me in debugging later) to see where it went wrong. And it turns out the decoder is really supposed to do that because it has specially initialised buffer before the actual frame data (kinda like the original LZHUF did, also there’s no need to check if we copy before the buffer start since it’s not possible) plus some other small issues. I’ll try to correct the Wiki article on GDV in the following days.

And I don’t really plan to add any other old game codecs beside VMD and Smacker (I have soft spot for them after all). Next decoders should be either for audio or more modern ones, like H.26x or Indeo 4/5 since I still have some ideas to test out.

Update to to this update: my decoder code is here.

NihAV — Buffers and Wrappers

Saturday, May 27th, 2017

It might be hard to believe but the number of decoders in NihAV has tripled! So now there are three codecs supported in NihAV: Intel Indeo 2, Intel Indeo 3 and PCM.

Before I talk about the design I’d like to say some things about Indeo 3 implementation. Essentially it’s an improvement over Indeo 2 that had simple delta compression—now deltas are coming from one of 21 codebooks and can be applied to both pairs and quads of pixels, there is motion compensation and planes are split into cells that use blocks for coding data in them (4×4, 4×8 or 8×8 blocks). libavcodec had two versions of the decoder: the first version was submitted anonymously and looks like it’s a direct translation of disassembly for XAnim vid_iv32.so; the second version is still based on some binary specifications but also with some information coming from the Intel patent. The problem is that those two implementations are both rather horrible to translate directly into Rust because of all the optimisations like working with a quad of pixels as 32-bit integer plus lots of macros and overall control flow like a maze of twisty little passages. In result I’ve ended with three main structures: Indeo3Decoder for main things, Buffers for managing the internal frame buffers and doing pixel operations like block copy and CellDecParams for storing current cell decoding parameters like block dimensions, indices to the codebooks used and pointers to the functions that actually apply deltas or copy the lines for the current block (for example there are two different ways to do that for 4×8 block).

Anyway, back to overall NihAV design changes. Now there’s a dedicated structure NATimeInfo for keeping DTS, PTS, frame duration and timebase information; this structure is used in both NAFrame and NAPacket for storing timestamp information. And NAFrame now is essentially the wrapper for NATimeInfo, NABufferType plus some metadata.

So what is NABufferType? It is the type-specific frame buffer that stores actual data:

pub enum NABufferType {
    Video      (NAVideoBuffer<u8>),
    Video16    (NAVideoBuffer<u16>),
    VideoPacked(NAVideoBuffer<u8>),
    AudioU8    (NAAudioBuffer<u8>),
    AudioI16   (NAAudioBuffer<i16>),
    AudioI32   (NAAudioBuffer<i32>),
    AudioF32   (NAAudioBuffer<f32>),
    AudioPacked(NAAudioBuffer<u8>),
    Data       (NABufferRefT<u8>),
    None,
}

As you can see it declares several types of audio and video buffers. That’s because you don’t want to mess with bytes in many cases: if you decode 10-bit video you’d better output pixels directly into 16-bit elements, same with audio; for the other cases there’s AudioPacked/VideoPacked. To reiterate: the idea is that you allocate buffer of specific type and output native elements into it (floats for AudioF32, 16-bit for packed RGB565/RGB555 formats etc. etc.) and the conversion interface or the sink will take care of converting data into designated format.

And here’s how audio buffer looks like (video buffer is about the same but doesn’t have channel map):

pub struct NAAudioBuffer<T> {
    info:   NAAudioInfo,
    data:   NABufferRefT<T>,
    offs:   Vec<usize>,
    chmap:  NAChannelMap,
}

impl<T: Clone> NAAudioBuffer<T> {
    pub fn get_offset(&self, idx: usize) -> usize { ... }
    pub fn get_info(&self) -> NAAudioInfo { self.info }
    pub fn get_chmap(&self) -> NAChannelMap { self.chmap.clone() }
    pub fn get_data(&self) -> Ref<Vec<T>> { self.data.borrow() }
    pub fn get_data_mut(&mut self) -> RefMut<Vec<T>> { self.data.borrow_mut() }
    pub fn copy_buffer(&mut self) -> Self { ... }
}

For planar audio (or video) get_offset() allows caller to obtain the offset in the buffer to the requested component (because it’s all stored in the single buffer).

There are two functions for allocating buffers:

pub fn alloc_video_buffer(vinfo: NAVideoInfo, align: u8) -> Result<NABufferType, AllocatorError>;
pub fn alloc_audio_buffer(ainfo: NAAudioInfo, nsamples: usize, chmap: NAChannelMap) -> Result<NABufferType, AllocatorError>;

Video buffer allocated buffer in the requested format with the provided block alignment (it’s for the codecs that actually code data in e.g. 16×16 macroblocks but still want to report frame having e.g. width=1366 or height=1080 and if you think that it’s better to constantly confuse avctx->width with avctx->coded_width then you’ve forgotten this project name). Audio buffer allocator needs to know the length of the frame in samples instead.

As for subtitles, they will not be implemented in NihAV beside demuxing the stream with subtitle data. I believe subtitles are the dependent kind of stream and because of that they should be rendered by the consumer (video player program or whatever). Otherwise you need to take, say, RGB-encoded subtitles, convert them into proper YUV flavour and draw in the specific region of the frame which might be not the original size if you use e.g. DVD rip encoded into different size with DVD subtitles preserved as is. And for textual subtitles you have even more rendering problems since you need to render them with proper font (stored as the attachment in the container), apply using the proper effect, adjust positions if needed and such. Plus the user may want to adjust them during playback in some way so IMO it belongs to the rendering pipeline and not NihAV (it’s okay though, you’re not going to use NihAV anyway).

Oh, and PCM “decoder” just rewraps buffer provided by NAPacket as NABufferType::AudioPacked, it’s good enough to dump as is and the future resampler will take care of format conversion.

No idea what comes next: maybe it’s Indeo audio decoders, maybe it’s Indeo 4/5 video decoder or maybe it’s deflate unpacker. Or something completely different. Or nothing at all. Only the time will tell.

NihAV — Glue and Hacks

Saturday, May 20th, 2017

I don’t like to write the code that does nothing, it’s the excitement of my code doing at least something that keeps me writing code. So instead of designing a lot of new interfaces and such that can describe all theoretically feasible stuff plus utility code to handle the things passed through aforementioned interfaces, I’ve just added some barely working stuff, wrote a somewhat working demuxer and made a decoder.

And here it is:
(more…)

#chemicalexperiments

Friday, May 19th, 2017

Well, here’s yet another post nobody asked.

As a bog standard programmer I love organ music, hacking various stuff, and cooking Also it’s easier to satisfy my tastes and limitations that way too.

I’m not a skilled cook at all but I can make myself a semi-decent soup or bake something (casserole, quiche or pie). And here’s my short report on trying macaroni and cheese in three variations.

The first version was made after some recipe—cook pasta (I chose fusilli because it’s the only kind I had at hoof), make cheese sauce (essentially start with sauce thickener made from fried flour, add milk and melt a lot of cheese in it), combine together and bake in oven. Simple, filling and tasty. The only problem I found is that it thickens into a solid mass when cooled but it’s still enjoyable then.

The second version I tried was Kraft dinner. Just cook the pasta from the box and mix it with milk, butter and powder (from the packet inside the box) in still warm cooking pot. This version I found incompatible with me—not gross or allergy inducing, just after tasting one spoonful I could not bring myself to take another. Oh well, not a big loss.

And finally, käsespätzle. For this variation you take spätzle (the usual long thin variation sold in every supermarket here), mix it with cooking cream that has been boiled and with some cheese melted it, put the result into baking dish, sprinkle with more grated cheese and bake (I’ve also added chopped dried tomatoes because I had to put them somewhere). The result is tasty and more tender than the first variation. So I approve it too.

P.S. I don’t take pictures of what I cook, you want #opticalexperiments then and from a different person too.

A Short Essay on Bitstream Reading

Monday, May 15th, 2017

So, it has come to this. How does bitstream reading might work. Here I’ll try to present several ways to read bits and variable-length codes.
(more…)

Why Modern Video Codecs Suck and Will Keep on Sucking

Friday, May 12th, 2017

If you look at the modern video codecs you’ll spot one problem: they get designed for large resolutions and follow one-size-does-not-fit-exactly-anybody approach. By that I mean that codecs are following the model introduced by ITU H.261—split image into blocks, predict block from the previous frame if possible, apply DCT, quantise and code resulting coefficients (using zigzag scan order and special treatment for runs of zeroes). The same was later applied to pictures in JPEG format that is still staying strong.

Of course modern codecs are much more complex that that, current ITU H.EVC standard enhanced every stage:

  • image is no longer split into 8×8 blocks, you have quadtrees coding blocks from 64×64 down to 4×4 pixels;
  • block prediction got more complicated, now you have intra (or spatial prediction) that tries to fill block with gradient derived from already decoded neighbour blocks) and inter prediction (the old prediction from the previous frame);
  • and obviously inter prediction is not that simple either: now it’s decoupled from transformed block and can have completely different sizes (like 16×4 or 24×32), instead of single previous frame you can use two reference frames selected from two separate lists of references and even motion vectors are often predicted using motion vectors from the reference frames (does anybody like implementing those colocated MV prediction modes BTW?);
  • DCT is replaced with some bitexact integer approximations (and the dequantisation and/or transform stages may be skipped completely);
  • there are more scan types used and all values are coded using some context-adaptive coder.

Plus some hacks for low-resolution mode (e.g. special 4×4 transform for luma), lossless (or as they call it, “PCM coding”) and now also special coding mode for screen content (i.e. images with fewer distinct colours and where fine details matter).

The enhancements on streamline coding process are enhancements, they don’t change principles of coding but rather adapt them to modern conditions (meaning that there’s demand in higher compression and there’s more CPU power and RAM can be thrown at the processing—mostly RAM though).

And what the hacks do? They try to deal with the fact that this model works fine for smooth changing continuous tone images and it does not work that good on other types of video source. There are several ways to deal with the problem but keep in mind that the problem of distinguishing video types and selecting proper coding is AI-complete:

  1. JPEG+PNG approach. You select best coder for the source manually and transmit it like that. Obviously it works well in limited scenarios but even people quite often don’t bother and compress everything with the single format even if that hurts quality or compression ratio. Plus you need to handle two different formats, make sure that the receiving end also supports them etc etc.
  2. MPEG-4 approach. You have single format that has various “coding tools” embedded, they can be both full alternative coding features (like WebP has VP8 compression and lossless compression and nothing common between them or MPEG-4 Audio can be coded as conventional AAC, TwinVQ, speech codec or even as a description for synthesised audio) or various enhancement applied to the main coding method (like you have AAC-LC, AAC-Main that enables several features or HE-AACv2 which takes AAC-LC audio and applies SBR and Parametric Stereo to double its channels and frequency range). Actually there are more than forty various MPEG-4 Audio object types (various coding modes) already, do you think there’s any software that supports everything? And looks like modern video codecs head this way too: they introduce various coding tools (like for screen content) and it would be fun to support all possible features in the decoder. Please consider how much effort should be spent on effectively applying all those tools too (and that’s obviously beside the scope of standards).
  3. ZPAQ approach. The terminal AI-complete solution. You are not merely generating bitstream but first you need to transmit bytecode for a program that will decode this bytestream. It’s the ultimate solution—if you can describe the perfect model for the stream then you can compress it the best. Finding an optimal model for given bitstream is left as an exercise for the reader (in TAoCP it would be marked with M60 I guess).

The second thing I find sucky is combinatorial explosion of encoding parameters. Back in the day you had to worry about selecting the best quantisation matrix (or merely a quantiser) and motion vector if you decided to code it as inter-block. Now you have countless ways to split large tile into smaller blocks, many ways to select prediction mode (inter/intra, prediction angle for intra, partitioning, reference frames and motion vectors) and whether to skip transform stage or not and if not whether it’s worth to subdivide block further or not… The result is as good as string theory—you can get a good one if you can guess zillions of parameters right.

It would be nice to have encoder actually splitting video into scene and actors and transmitting just the changes to the objects (actors, scene) instead of blocks. But then you have a problem of coding those descriptions efficiently and even greater problem of automatically classifying the video into such objects (obviously software can do that, that’s why MPEG-4 Synthetic Video is such a great success). Actually it had some use: there was AVS-S standard for coding video specifically from surveillance cameras (why would China need such standard anyway?). In this standard there was special kind of frame for the whole scene and the main share of video was supposed to be just objects moving around the scene. Even if the standard is obsolete its legacy was included into HEVSAVS2 as three or four new special frame types.

Personally I believe that current video formats are being optimised to local minimum, there are probably other coding methods that give larger gain on certain kinds of data, preferably with less tweaking. For example, that was probably the best thing about Daala, its PVQ coding; the rest was nor crazy enough. I have a gut feeling that vector quantisation might be a good base for an alternative approach to building video codecs. And I think it’s better to have different formats oriented for e.g. low-latency broadcasting and video distributing. If you remember, back in the days people actually spent time to decide which segment was coded better with DivX ;-) 3 Fast-Motion or DivX ;-) 3 Low-Motion, so those who care will be able to select proper format. And the rest can keep watching content in VP11/AV2 format. Probably only the last sentence will come to life.

That’s why I don’t expect bright future in video codecs and that’s why my blog is titled like this.

NihAV – io module

Thursday, May 11th, 2017

I’ve more or less completed nihav::io module, so let’s look at it (or not, it’s your choice).

There are four components there: bytestream reading, bitstream reading, generic integer code reading and codebook support for bitstream reader. Obviously there is no writing functionality there but I don’t need it now and it can be added later when (or if) needed.
(more…)

NihAV Development Progress

Saturday, May 6th, 2017

After long considerations and much hesitation NihAV finally accepts its first developer (that would be me). And for a change it will be written in Rust. First, it’s an interesting language worth learning; second, it seems to offer needed functionality without much hassle; third, it offers new features that are tempting to try. By new features I mostly mean enums, traits and functions bound to structures.

Previously I expressed the intent to do a completely new design of multimedia (mostly decoding) framework with decoders being assembled from smaller blocks. For example, if I’d implement VIVO H.263 decoder (just as a troll) it would contain these bits:

  • generic 8×8 block decoder interface that does common stuff for such decoders (maintaining block indices, filling frame information e.g. block type, motion vectors etc etc);
  • trait for 8×8 block decoder implementation that does actual bitstream decoding (functions for decoding GOP/picture/slice headers, MV prediction, block data decoding and such);
  • IPB frame shuffler as implementation of generic frame shuffler (i.e. that piece of code that selects which frames to use as references and which frame to output after decoding the current one);
  • maybe even custom codebook accessor (the piece of code that tells codebook generator what is the code and symbol at position N) so it doesn’t need to be converted into some fixed form.

There’s not much code written yet but there are some bits implemented: rudimentary universal bitstream reader, bytestream reader (the same for memory and file I/O) and semi-working framework for demuxing. That’s a start at least.

Side rant: it seems that visible languages (i.e. not completely obscure ones) that use := form assignment have rather unpleasant evangelists (full of themselves in the best case, actively pushing their language to replace everything else in the worst case). That includes Go, Oberon and Wolfram Language. I don’t mean that other languages are free of that problem but in these cases looks like the majority of posts or articles about such language are written from this position.

Blog Restarted

Saturday, May 6th, 2017

As you (imaginary person who actually read this blog) I’ve stopped blogging last year. Mostly I did it because I ran out of material to write about. Well, now I have some new thoughts that I’d rather dump on the blog and forget so the blog restarted.

While the previous blog was mostly centred on codec reverse engineering and design and rants about random topics, this version should be more about NihAV development and codecs design. And obviously useless rants on random topics—except on opensource multimedia projects (I’ve said enough about CEmpeg, Libav and such; corporate ones I might still have something about).

Let’s see how well it goes this time…