Disclaimer: all container formats suck, either by being too simple and tied to certain (types of) codecs, too ineffective (by wasting too many bytes of frame metadata and headers compared to other formats), or too flexible and complicated to implement in full. And there’s Ogg.
Let’s start our story with old times. Back in the day Electronic Arts made probably the only two good things it’s ever made. I’m talking of course about Deluxe Paint
and IFF container format (and that happened 35 years ago; it’s a pity the company still exists). The chunked approach to storage was reused by certain other companies. Remember RIFF that was used for storing audio (RMI or WAV), video (AVI) or pictures (WebP) among other things? Remember RealMedia? Remember QuickTime MOV? That’s the thing we’re going to talk about.
As you can guess all these containers are just a series of chunks, some chunks being a container for another list of chunks. MOV and MP4 are the exception because they have atoms and boxes correspondingly. Anyway, this structure allows you to put virtually anything and while some formats used that in moderation (WAV essentially in top-level RIFF chunk with header and data chunks, there may be user metadata present and that’s about it) some others abused that possibility as much as possible (no points for guessing).
Let’s quickly review MOV structure: essentially you have moov
atom with the description of the container data (which includes atoms for each individual track description of course) and mdat
chunk with actual data if you’re lucky. Sounds simple? The devil hides in the deeply-nested atoms.
First, MOV has a lot of features for specific groups of things like:
- streaming—tracks can be joined into groups to signal that only one of them should be played depending on language/quality/decoding capabilities (kinda like what RMVB was famous for);
- mastering—all those atoms for matte,
kropping!clipping and notorious edit lists; - DVD-like playback—the less said about track referencing feature the better.
But what I said that you have mdat
chunk with actual data “if you’re lucky”? Because of the wonderful feature called data reference which tells you where the actual data is stored: it may be the same file, the same file but different resource fork (for classic MacOS), different file; you can even get some URL to the resource with the data. And of course different tracks may be stored in different data locations. Flexibility!
Then you get to actual frames (or “samples” as they’re called). For certain reasons frames of the same track can be clustered together in a block (or “chunk” as the specification calls it). And to make things better frames can have different duration stored in a different table in run-length form i.e. “first N frames have duration X, then next M frames have duration Y, then next K frames…”. So in order to extract proper frame with a correct timestamp you just need to find out in which block it resides and with which offset, sum durations for the previous frames, apply information from track metadata (don’t forget the edit list!) and you’re done. Except when you have audio in PCM format or compressed with some standard fixed compression scheme (A-law/μ-law, IMA ADPCM, MACE 3:1/6:1). Then you need to calculate duration from size. And that’s one of the things I really hate in containers: they should be codec-agnostic, in other words you should be able to seek to a certain time point (with a given precision of course), extract frame data and feed it to a decoder that does not know which container it came from either. And of course Matroska is the worst violator as it employs codec-specific ways to shave off some of the frame payload (so if you don’t know how to reconstruct the frame data the decoder won’t recognize it either).
Anyway, having such nice format with so many features not needed by anybody 99% of the time it was a perfect fit for MPEG-4 container format. So MOV was adopted, some of its terminology was changed (as mentioned before atoms->boxes) and lots of new features were added as well both in metadata and file structure (like those segments for streaming).
And if by this point the title is still not clear to you, QuickTime MOV format appeared long before Matroska but conceptually it had many of the features present in the latter as well. To put it crooked, if you replace atoms with EBML entities and add ways to reduce payload (e.g. by generating missing frame headers) you can rename MOV to MKV and nobody will notice a difference.
P.S. It’s just I’m writing a second iteration of NihAV
-based video player which should not just play single file continuously and maybe deadlock in process of doing that because of some SDL audio issue but a proper player that can play multiple files, pause and seek. And while testing seeking in MOV files I’ve discovered some of those wonderful issues that inspired me to write this post.