MOV — Matroska of its time

Disclaimer: all container formats suck, either by being too simple and tied to certain (types of) codecs, too ineffective (by wasting too many bytes of frame metadata and headers compared to other formats), or too flexible and complicated to implement in full. And there’s Ogg.

Let’s start our story with old times. Back in the day Electronic Arts made probably the only two good things it’s ever made. I’m talking of course about Deluxe Paint and IFF container format (and that happened 35 years ago; it’s a pity the company still exists). The chunked approach to storage was reused by certain other companies. Remember RIFF that was used for storing audio (RMI or WAV), video (AVI) or pictures (WebP) among other things? Remember RealMedia? Remember QuickTime MOV? That’s the thing we’re going to talk about.

As you can guess all these containers are just a series of chunks, some chunks being a container for another list of chunks. MOV and MP4 are the exception because they have atoms and boxes correspondingly. Anyway, this structure allows you to put virtually anything and while some formats used that in moderation (WAV essentially in top-level RIFF chunk with header and data chunks, there may be user metadata present and that’s about it) some others abused that possibility as much as possible (no points for guessing).

Let’s quickly review MOV structure: essentially you have moov atom with the description of the container data (which includes atoms for each individual track description of course) and mdat chunk with actual data if you’re lucky. Sounds simple? The devil hides in the deeply-nested atoms.

First, MOV has a lot of features for specific groups of things like:

  • streaming—tracks can be joined into groups to signal that only one of them should be played depending on language/quality/decoding capabilities (kinda like what RMVB was famous for);
  • mastering—all those atoms for matte, kropping! clipping and notorious edit lists;
  • DVD-like playback—the less said about track referencing feature the better.

But what I said that you have mdat chunk with actual data “if you’re lucky”? Because of the wonderful feature called data reference which tells you where the actual data is stored: it may be the same file, the same file but different resource fork (for classic MacOS), different file; you can even get some URL to the resource with the data. And of course different tracks may be stored in different data locations. Flexibility!

Then you get to actual frames (or “samples” as they’re called). For certain reasons frames of the same track can be clustered together in a block (or “chunk” as the specification calls it). And to make things better frames can have different duration stored in a different table in run-length form i.e. “first N frames have duration X, then next M frames have duration Y, then next K frames…”. So in order to extract proper frame with a correct timestamp you just need to find out in which block it resides and with which offset, sum durations for the previous frames, apply information from track metadata (don’t forget the edit list!) and you’re done. Except when you have audio in PCM format or compressed with some standard fixed compression scheme (A-law/μ-law, IMA ADPCM, MACE 3:1/6:1). Then you need to calculate duration from size. And that’s one of the things I really hate in containers: they should be codec-agnostic, in other words you should be able to seek to a certain time point (with a given precision of course), extract frame data and feed it to a decoder that does not know which container it came from either. And of course Matroska is the worst violator as it employs codec-specific ways to shave off some of the frame payload (so if you don’t know how to reconstruct the frame data the decoder won’t recognize it either).

Anyway, having such nice format with so many features not needed by anybody 99% of the time it was a perfect fit for MPEG-4 container format. So MOV was adopted, some of its terminology was changed (as mentioned before atoms->boxes) and lots of new features were added as well both in metadata and file structure (like those segments for streaming).

And if by this point the title is still not clear to you, QuickTime MOV format appeared long before Matroska but conceptually it had many of the features present in the latter as well. To put it crooked, if you replace atoms with EBML entities and add ways to reduce payload (e.g. by generating missing frame headers) you can rename MOV to MKV and nobody will notice a difference.

P.S. It’s just I’m writing a second iteration of NihAV-based video player which should not just play single file continuously and maybe deadlock in process of doing that because of some SDL audio issue but a proper player that can play multiple files, pause and seek. And while testing seeking in MOV files I’ve discovered some of those wonderful issues that inspired me to write this post.

6 Responses to “MOV — Matroska of its time”

  1. lu_zero says:

    I recall the deduplication hacks existing mainly to look better than the alternative, (see what NUT has for mp3 and similar), probably they could be deprecated nowadays…

  2. Kostya says:

    Matroska has “header stripping” compression method that stores common frame data prefix in metadata and prepends it to each demuxed frame. While I heard many people complained about it at least it’s codec agnostic instead of other hacks like removing chunk header for ProRes frames.

  3. Ugh, your post reminded me of how a general multimedia player needs to have 3 paths for handling audio in a MOV demuxer (in my experience): 1. Uncompressed PCM; 2. Compressed constant bitrate; 3. Compressed variable bitrate.

    I always felt one of the biggest mistakes of the MOV format was making the chunk offsets reference absolute offsets within the file– if they had been offsets relative to the start of the mdat chunk, the moov chunk would be a lot more mobile within the file, which would make it easier to move around the moov chunk for HTTP streaming. But then I never would have gotten to write my ‘qt-faststart’ program which I suspect is the most widely used piece of software I will ever write.

    I still can’t believe they named the chunks ‘atoms’ in the first place. Atoms containing other atoms…

  4. Kostya says:

    The main thing qt-faststart beside correcting offsets is re-ordering atoms (and header is easier to store in the memory so it gets written last), so maybe the tool would be written regardless. Still it gets funnier with a compressed header in the beginning and some padding after it that kinda defeats the purpose of compression.

    And you know why they gave chunks such an unfitting name—they thought differently.

  5. Andrew-R says:

    there is longish article on older m68k Macs and birth of quicktime:

    you can play with it in BasiliskII emulator

    of couse then quicktime evolved – by version 3 it had built-in vide effects, then streaming was bolted on… and there was quicktime-vr as distraction..

    I found another link chronicling early quicktime days

  6. Kostya says:

    This history is worth preserving and maybe somebody will write a book about all those multimedia frameworks, how they evolved (or were marketed) and how they died.

    I’m more interested in technical side of the things though, so I look on how it was done and try to find parallels to what others did (and quite often why it sucks). The first link you posted contains an explanation why QT was designed the way it was. Thanks, that was an interesting reading.