Sadly there’s enough MP3s in my music collection to ignore the format and I’ve finally implemented MP3 decoding support for NihAV
. That involved introducing several new concepts which I’d like to review in this post.
Previously NihAV
operated on a simple approach: there’s a demuxer that produces full packets, those packets are fed to the corresponding decoder and the decoded audio/video data is used somehow. With MP3 you have a raw stream of audio packets (sometimes with an additional metadata). While I could pretend to have a demuxer (that will simply read data and form packets) I decided to do it differently.
I’ve decided that raw streams should be a separate entity and introduced NAPacketiser
for forming packets from raw stream data and RawDemuxCore
/RawDemuxer
for the formats that may contain several raw streams (think MPEG PS/TS or ever better don’t think about it at all) or one raw stream with either some header or some data mixed along with the raw streams.
So RawDemuxer
functions as a normal Demuxer
but instead of get_frame()
it has get_data()
that returns NARawData
with a piece of raw data and a pointer to the stream it belongs to. Additionally RawDemuxerCreator
has check_format()
for probing the format without creating an actual demuxer instance first. This may be useful when the format is hard to detect with a static description in nihav_registry::detect
.
Now what to do with it? That’s where NAPacketiser
comes into play. This trait has the following methods:
add_data()
for queuing raw data that will be formed into packets;get_packet()
that tries to produce a full packet from the internal buffer;reset()
for clearing the buffer and the internal state;skip_junk()
that tries to skip data until the next valid header;- and
parse_stream()
for the case when we have a single raw stream without any headers (again, like MP3) so we can create a proper stream description and use it for producing packets that can be used in a conventional decoding process.
Essentially I have three codepaths: normal demuxer, raw stream demuxer with packetisers for each stream, and single raw stream. And my tools implement them with a special DemuxerObject
that unites them providing the same demuxer interface for all the cases. Why haven’t I done that in nihav_core
so there’s just a single demuxer interface to call? Because I don’t like automagical things.
Of course I’m familiar with libavformat
and its approach where a demuxer simply sets a flag on a demuxed stream and it may invoke a parser on it. Similarly it has a special demuxer for a raw stream that essentially a thin layer over corresponding AVParser (and effectively almost anything gets detected as MP3). I prefer to have clarity instead: if I have raw streams I prefer them to be recognised as such, chunks of raw data should not be confused for full packets and the way raw streams are handled should not be hidden inside the library with only some obscure options that affect the process.
Another thing worth mentioning is various metadata. Since I don’t care about it, I’ve written a small function to detect ID3 and APETAG and report it. If some metadata is found, I create BoundedFileReader
to operate on the part of the file without metadata as a whole file—and only after that I invoke format detection and probing. Again, metadata parsing may be an essential part of some other library while I prefer to detect (and skip) it inside the tools. Not everybody is happy when they invoke a player on audio file and suddenly it displays an album cover (or even worse, refuses to play the file because that cover is 2Kx2K JPG file) because metadata was parsed and images from it were put into new video streams.
And the last thing I should mention is a can of worms related to raw streams that I don’t know how to handle: calculating duration and seeking. In theory MP3 bitrate should not change so you can divide total file size by the bitrate to get the length—but how to make it work with variable-size frames? And the same with seeking: you can calculate a rough position but considering that the MP3 frames can differ in length by one byte you risk seeking into inside of a frame. As I said, I don’t know how to deal with it cleanly so I’ll leave it for later, maybe much much later.
The first MPEG Audio frame may be a Xing frame and contain data that help compute the duration. But not all mp3 have it
Yeah, I’m trying to forget about it.