It’s pretty obvious but I got requests to write this nevertheless.
All known containers suck, some of them suck gloriously, some of them plainly suck. And there’s
Ogg Matroska Ogg.
There are several features that distinguish container usefulness:
- flexibility (supporting various codecs and number of streams);
- easy to parse;
- well-defined specification (there must be a format with such thing);
- metadata support;
- low overhead (bytes needed to define frame size and other properties);
- advanced features for insane people.
Now let’s review containers grouped by design.
Raw or raw with header. Those are the simplest and codec-specific. Besides being designed (usually) for only one stream and one codec, they often decide to save bits on frames and in result you have hard time implementing seeking (say hello to FLAC or Moosepack SV7). Some have seek table at least (old Monkey’s Audio has two — for byte and bit position).
Your favourite FLV belongs to this category — it has one audio and one video stream with no headers (and that’s why it has its own flavour of VP6 with frame dimensions stored at every frame) though one can abuse it to add a data stream. And of course some Chinese used it to store HEVC too in the stupidest way possible (for starters they have introduced half a dozen of different video codec IDs for it).
Chunk-based. The most popular category that refuses to go away. The best representative is RIFF (M$ ripoff of EA IFF format, there are many specific RIFF variants known — AVI, RMF, WAV, WebP) and runner-up is MOV/MP4. AVI is verily the pinnacle — flexible, extensible, every frame is its own chunk. What can go wrong with it? The usual thing: abuse. Too many idiots implemented their own AVI writers with whatever bugs they could introduce and it got even worse when codecs started to employ B-frames. Intel worked around by adding combined I+B-frame and dummy frame afterwards so decoder would handle it internally (you can see it both in Indeo 4 and their I.263). DiVX on the other tentacle… And variable framerate is not for AVI either (unless you simply use zero frames to define skips).
As for MOV/MP4 there seems to be a problem with parsing custom atoms (there are too many atom types around). And of course you have nice abuse like ASF packets stored inside MOV packets if you use Flip4Mac.
And if you replace chunks with an unholy mix of tags and UIDs you get MXF. That format doesn’t have a specification but rather a swarm of them so you don’t know which ones you’ll need to demux some file.
There’s NUT — probably the only format out there with two specifications and three or four implementations, each not agreeing with all other.
MPEG-TS inspired. MPEG-TS is one of overengineered container formats that nothing in this world would be able to demux a TS file with all possible features. And forget about seeking (unless you have an external index or build index yourself).
Of course such design inspired a lot of other formats that have some features of it but often those features are used without understanding why they are there. But result is good for streaming!!!1one
There’s ASF with crazy GUIDs for everything and fixed packet size (which means there’s no direct correspondence between ASF packet and stream packet anymore).
And there’s Ogg. Read this if you still haven’t.
Matroska. That’s a cancer — when you design container that should be able to contain everything and support any feature possible and it gets out of control you get Matroska. It’s based on binary XML, it can have any feature. And it stores every codec in its unique way — see what they call codec specs. So they save bytes here and there and demuxer should put them back, which is not nice, especially if you believe that demuxers and decoders do not need to know about each other.
If you wonder why I haven’t mentioned RealMedia, it’s because this format is an unholy mix of all categories:
- Old RealAudio is rather simple raw + header;
- RealMedia in general is chunk-based format (with a hack for B-frames even).
- Video frame can be split into several packets or several frames can be merged into single packet, a lot like MPEG-TS inspired formats.
- And they had mangled audio streams long before Matroska was here. Actually only some audio codecs data is stored as is, the rest is XORed or has permuted subpackets.