Modularity — codec level
FFmpeg, obviously, was made to transcode MPEG video (initial commit had support for JPEG, MPEG-1/2 video, some H-263 based formats like M$MPEG-4, MPEG-4 and RV10, MPEG audio layers I-III and AC3). It was expanded to handle other formats but the misdirection in initial design has grown into MpegEncContext
that makes the ugliest part of libavcodec
to date.
It is easy to start with an abstraction that all codecs consist of I/P/B-frames split into 16×16 macroblocks that have 8×8 DCT blocks. You just need to have some codec-specific decoding (or coding) for picture header or block codes, that’s all. And since they all are very similar why not unite them into single decoding function. I encourage everybody to look at mpv_decode_mb_internal
in libavcodec/mpegvideo.c
to see how this can go wrong.
Let’s just look at simple model of the codecs that should fit the model I can still name two from the top of my head that don’t fit that well. H.263+ (or was it H.263++?) — it has packed PB-frames that have blocks for both P- and B-frame. IIRC it sends an empty frame just after that so reordering can take place. VC-1 has BI-frames that should be coded as I-frames but treated as B-frames; also it has block subdivision into 8×4, 4×8 or 4×4 subblocks. And there’s On2 VP3. This gets even better with the new generation of codecs — more reference frames and more complex relations between them — B-pyramid in H.264 and H.265 frame management. And there’s On2 VPx. Indeo 4/5 had complex frame management too — droppable references, B-frames, null frames etc.
So, let’s look at video codec decoding stages to see why it’s unwise to try to use the single context to bind them all.
- Sequence header — whatever defines codec parameters like frame dimensions, various features used in the bitstream etc. May be as simple as frame dimensions provided by the container; it may be codec extradata from the container as well; it may be as complex as H.265 having multiple SPSes referencing multiple PPSes referencing multiple VPSes.
- Picture header — whatever defines frame parameters. Usually it’s frame type, sometimes frame dimensions, sometimes quantiser, whatever vendor decides to put into it.
- Slice header — if codec has slices; if codec has separate plane coding or scalable coding it can be considered slices too. Or fields (though they can have slices too). Usually it has information related to slice coding parameters — quantiser, bitstream features enabled etc.
- Macroblock header — macroblock type, coded block pattern other information is stored here.
- Spatial prediction information — not present for old codecs but is an essential part of intra blocks compression in the newer codecs.
- Motion vectors — usually a part of macroblock header but separated here to denote they can be coded in different ways too (e.g. newer codecs have to include reference frame index, for older codecs it’s obvious from the frame type).
- Block coefficients.
- Trailer information — whatever vendor decides to put at the end of the frame like CRC (or codec version for Indeo 4 I-frames).
And yet there are many features that complicate implementing this scheme in the same framework — frame management (altref frames in VPx, two frames fused together as in Indeo 4 or H.263), sprites, scalable coding features, interlacing, varying block sizes (especially in H.265 and ripoffs). Do you still think it’s a good idea to fit it all into the same mpegvideo?
That is why I believe the best approach in this case is to have small reusable blocks that can be combined to make a decoder. For starters, decoder should have more freedom to where it can decode to — that should be handy in decoding those fused frames, also quite often one decoder is used inside another to decode a part of the frame, especially JPEG and WMV9/VC-1. Second, decoder should be able to pick whatever components it needs — e.g. RealVideo 3/4 used H.264 spatial prediction and chroma motion compensation but the standard I/P/B frame management and its own bitstream decoding. WMV2 was mostly M$MPEG-4 with new motion compensation and special I-frame decoder. AVS (Chinese one) has 8×8 integer DCT coding but also spatial coding from H.264 and its frame management is almost standard I/P/B but P frame references two previous pictures and they’ve added S-frame that is B-frame with only forward references.
Hence I proposed long time ago to split out at least frame management in order to reduce decoder dependencies from mpv (It sank into the swamp. but again, no-one cared). Then block management functions (the utility functions that update and provide pointers to the current block on output frame planes). That sank into the swamp. I’d propose anything else in that direction but it will burn down, fell over, then sink into the swap no-one cares about my proposals.
Still, here’s how I see it.
#include “block_stuff.h”
#include “frame_mgmt.h”
#include “h264/intra_pred.h”
Since this is not intended for the user it can have multiple smaller headers with only related stuff. Also large codec data should’ve been moved into separate subdirectories since ages. It’s more than a thousand files in libavcodec
already.
decode_frame()
{
frame_type = get_bits(gb, 2);
cur_frm = ipb_frame_get_cur(ctx->ipb, frame_type);
init_block_pos(ctx->blk, cur_frm);
for (blocks) {
update_block_pos(ctx->blk);
decode_mb(ctx, gb, ctx->blk, mb);
if (mb->type == INTRA)
h264_pred_spatial(ctx->blk, mb);
else
idct_put_mb420(ctx->blk, mb);
}
ipb_frame_update_refs(ctx->ipb, frame_type);
}
We have a lot of smaller blocks here encapsulating needed information — frame management, macroblock position and decoded macroblock information. Many chunks of code are the same between codecs, you often don’t need a full context for a small function that can be reused everywhere. Like spatial prediction — you just need to know if you can have neighbouring pixels, what prediction method to apply and what coefficients to add afterwards — be it RealVideo 3, H.264, or VP5. Similarly after motion vectors are reconstructed you do the same thing in most codecs — copy a rectangular area to the current frame using motion compensation functions. So write it once and reuse everywhere — and you need just a couple of small structures with essential information (where to copy to and what functions to use), not MpegEncContext
.
Sigh, I really doubt I’ll see it implemented ever.