Why I am sceptical about AV1

I wanted to write this post for a while but I guess AV1.1 is that cherry on top of the whole mess called AV1 that made me finally do this.

Since the time I first heard about AV1 I tried to follow its development as much as it’s possible for a person not subscribed to their main mailing list. And unfortunately while we all expected great new codec with cool ideas we got VP10 instead (yes, I still believe that AV1 is short for “A Vp 1(0) codec”). Below I try to elaborate my view and present what facts I know that formed my opinion.

A promising beginning

It all started with ITU H.EVC and its licensing—or rather its inability to be licensed. In case you forgot the situation here’s a summary: there are at least three licensing entities that claim to have patents on HEVC that you need to license in order to be using HEVC legally. Plus the licensing terms are much greedier than what we had for H.264 to the point where some licensing pool wanted to charge fees per streaming IIRC.

So it was natural that major companies operating video in Internet wanted to stay out of this and use some common license-free codec. Resorting to creating one if the need arises.

That was a noble goal that only HEVC patent holders may object to, so the Alliance for Open Media (or AOM for short) was formed. I am not sure about the details but IIRC only organisations could join and they had to pay entrance fee (or be sponsored—IIRC VideoLAN got sponsored by Mozilla) and the development process was coordinated via members-only mailing list (since I’m not a member I cannot say what actually happened there or how and have to rely on second- or third-hand information). And that is the first thing that I did not like—the process not being open enough. I understand that they might not wanted some ideas leaked out to the competitors but even people who were present on that list claim some decisions were questionable at best.

Call for features

In the beginning there were three outlooks on how it will go:

  • AV1 will be a great new codec that will select only the best ideas from all participants (a la MPEG but without their political decisions) and thus it will put H.266 to shame—that’s what optimists thought;
  • AV1 will be a great new codec that will select only the best ideas and since all of those ideas come from Xiph it will be simply Daala under new name—that’s what cultists thought;
  • Baidu wants to push its VP10 on others but since VP8 and VP9 had very limited success it will create an illusion of participation so other companies will feel they’ve contributed something and spread it out (also maybe it will be used mostly for bargaining better licensing terms for some H.26x codecs)—that’s what I thought.

And looks like all those opinions were wrong. AV1 is not that great especially considering its complexity (we’ll talk about it later); its features were not always selected based on the merit (so most of Daala stuff was thrown out in the end—but more about it later); and looks like the main goal was to interest hardware manufacturers in its acceptance (again, more on it later).

Anyway, let’s look what main feature proposals were (again, I could not follow it so maybe there was more):

  • Baidu libvpx with current development snapshot of VP10;
  • Baidu alternative approach to VP10 using Asymmetric Numeric Systems coding;
  • Cisco’s simplified version of ITU H.EVC aka Thor codec (that looks more like RealVideo 6 in result) with some in-house developed filters that improve compression;
  • Mozilla’s supported Daala ideas from Xiph.

But it started with a scandal since Baidu tried to patent ANS-based video codec (essentially just an idea of video codec that uses ANS coding) after accepting ANS inventor’s help and ignoring his existence or wishes afterwards.

And of course they had to use libvpx as the base because. Just because.

Winnowing

So after the initial gathering of ideas it was time to put them all to test to see which ones to select and which ones to reject.

Of course since organisations are not that happy with trying something radically new, AV1 was built on the existing ideas with three main areas where new ideas were compared:

  1. prediction (intra or inter);
  2. coefficient coding;
  3. transform.

I don’t think there were attempts to change the overall codec structure. To clarify: ITU ITU H.263 used 8×8 DCT and intra prediction consisted of copying top row or left column of coefficients from the reference block, ITU H.264 used 4×4 integer transform and tried to fill block from its neighbours already reconstructed pixel data, ITU H.265 used variable size integer transform (from 4×4 to 32×32), different block scans and quadree coding of the blocks. On the other hand moving from VP9 to AV1 did not involve such significant changes.

So, for prediction there was one radically new thing: combining Thor and Daala filter into single constrained directional enhancement filter (or CDEF for short). It works great, it gives image quality boost at small cost. And another interesting tool is predicting chroma from luma (or CfL for short) that was a rejected idea for ITU H.EVC but later was tried both in Thor and Daala and found good enough (the history is mentioned in the paper describing it). This makes me think that if Cisco joined efforts with Xiph foundation they’d be able to produce a good and free video codec without any other company. Getting it accepted by others though…

Now coefficient coding. There were four approaches initially:

  • VP5 bool coding (i.e. binary coding of bits with fixed probability that gets updated once per frame; it appeared in On2 VP5 and survived all the way until VP10);
  • ANS-based coding;
  • Daala classic range coder;
  • Thor variable-based codes (probably not even officially proposed since it is significantly less effective than any other proposed scheme).

ANS-based coding was rejected probably because of the scandal and that it requires data to be coded in reverse direction (the official reasoning is that while it was faster on normal CPU it was slower in some hardware implementations—that is a common reason for rejecting a feature in AV1).

Daala approach won, probably because it’s easier to manipulate a multi-symbol model than try to code everything as context-dependent binarisation of the value (and you’ll need to store and/or code a lot of context probabilities that way). In any case it was clear winner.

Now, transforms. Again, I cannot tell how it went exactly but all stories I heard were that Daala transforms were better but then Baidu had to intervene citing hardware implementation reasons (something in the lines that it’s hard to implement new transforms and why do that since we have working transforms for VP9 with tried hardware design) so VP9 transforms had been chosen in the end.

The final stage

In April 2018 AOM has announced long-awaited bitstream freeze which came as a surprise to the developers.

The final final stage

In June it was actually frozen and AV1.0 was released along with the spec. Fun fact: the repository for av1-spec on baidusource.com that once hosted it (there are even snapshots of it from June in the Web Archive) now is completely empty.

And of course because of some hardware implementation difficulties (sounds familiar yet?) now we have AV1.1 which is not fully compatible with AV1.0.

General impressions

This all started as a good intent but in the process of developing AV1.x it raised so many flags that I feel suspicious about it:

  • ANS patent;
  • Political games like A**le joining AOM as “founding member” when the codec was almost ready;
  • Marketing games like announcing frozen bitstream before large exhibition while in reality it reached 1.0 status later and without many fanfares;
  • Not very open development process: while individual participants could publish their achievements and it was not all particularly secret, it was more “IBM open” in the sense it’s open if you’re registered at their portal and signed some papers but not accessible to any passer-by;
  • Not very open decision process: hardware implementation was very often quoted as the excuse, even in issues like this;
  • Not very good result (and that’s putting it mildly);
  • Oh, and not very good ecosystem at all. There are test bitstreams but even individual members of AOM have to buy them.

And by “not very good result” I mean that the codec is monstrous in size (tables alone take more than megabyte in source form and there’s even more code than tables) and its implementation is slow as pitch drop experiment.

Usually people trying to defend it say the same two arguments: “but it’s just a reference model, look at JM or HM” and “codecs are not inherently complex, you can write a fast encoder”. Both of those are bullshit.

First, comparing libaom to the reference software of H.264 or H.265. While formally it’s also the reference software there’s one huge difference. JM/HM were the plain C/C++ implementations with no optimisation tricks (beside transform speed-up by decomposition in HM) while libaom has all kinds optimisations including SIMD for ARM, POWER and x86. And dav1d decoder with rather full set of AVX optimisations is just 2-3 times faster (more when it can use threading). For H.264 optimised decoders were tens of times faster than JM. I expect similar range for HM too but two-three times faster is very bad result for unoptimised reference (which libaom is not).

Second, claiming that codecs are not inherently complex and thus you can write fast encoder even is the codec is more complex than its predecessor. Well, it is partly true in the sense that you’re not forced to use all possible features and thus can avoid some of combinatorial explosion by not trying some coding tools. Well, there is certain expectation built in into any codec design (i.e. that you use certain coding tools in certain sequence omitting them only in certain corner cases) and there are certain expectations on compression level/quality/speed.

For example, let’s get to the basics and make H.EVC encoder encode raw video. Since you’re not doing intra prediction, motion compensation or transforms it’s probably the fastest encoder you can get. But in order to do that you still have to code coding quadtrees and transmit flags that it has PCM data. In result your encoder will beat any other on speed but it will still lose to memcpy() because it does not have to invoke arithmetic coder for mandatory flags for every coded block (which also take space along with padding to byte boundary, so it loses in compression efficiency too). That’s not counting the fact that such encoder is useless for any practical purpose.

Now let’s consider some audio codecs—some of them use parametric bit allocation in both encoder and decoder (video codecs might start to use the same technique one day, Daala has tried something like that already) so such codec needs to run it regardless on how you try to compute better allocation—you have to code it as a difference to the implicitly calculated one. And of course such codec is more complex than the one that transmits bit allocation explicitly for each sample or sample group. But it gains in compression efficiency and that’s the whole point of having more complex codec in the first place.

Hence I cannot expect of AV1 decoders magically being ten times faster than libaom and similarly while I expect that AV1 encoders will become much faster they’ll still either measure encoding speed in frames per month minute or be on par with x265 in terms on compression efficiency/speed (and x265 is also not the best possible H.265 encoder in my opinion).


Late Sir Terence Pratchett (this world is truly sadder place without his presence) used a phrase “ladies of negotiable hospitality” to describe certain profession in Discworld. And to me it looks like AV1 is a codec of negotiated design. In other words, first they tried to design the usual general purpose codec but then (probably after seeing how well it performs) they decided to bet on hardware manufacturers (who else would make AV1 encoders and more importantly decoders perform fast enough especially for mobile devices?). And that resulted in catering to all possible objections any hardware manufacturer of the alliance had (to the point of AV1.1).

This is the only way I can reasonably explain what I observe with AV1. If somebody has a different interpretation, especially based on facts I don’t know or missed, I’d like to hear it and know the truth. Meanwhile, I hope I made my position clear.

14 Responses to “Why I am sceptical about AV1”

  1. MPEGEncContext says:

    Most of this post, and AV1 in general as whole, can be summarized as “Baidu bullies others and practices nepotism and doesn’t give a crap if anyone outside BaiduTube adopts actually creating it” and maybe a little “A**le are plumheads who like IPR.”

    As for the Super Secret AOM list, don’t worry about it. Us members didn’t get TOO much insight from it. Lots of stuff still just happened behind closed doors, or on private BaiduDocs. Or it just turned up from a Baidu mail in git.

  2. Kostya says:

    Well, it’s just the usual story with any People’s Democratic Republic being anything but democratic. Those Open Standards quite often are made behind closed doors. After that you start to appreciate MPEG a bit more.

    And while I’m not particularly missing not having access to the mailing list, it would be nice to be able to provide links to it and find some facts there. For example, scalability looks like an afterthought but maybe it has been discussed right from the start. Or switching from WebMKV to MP4 as main target container (outside IVF of course)—was it some later (maybe even fruit company effected) decision or not? What about superres and film grain? Trying to parse git repository post factum and trying to match it to various pieces of news is painful.

  3. HardwsreInput says:

    I can answer your question about hardware manufacturer input and whether it was a last minute change due to performance issues.

    The Mozilla/Xilph people have given presentations about progress all the way through the process and one of the consistent messages was that everything was being reviewed by hardware experts at every stage and this often ihad implications on their design decisions

    You can find these on YouTube and in Xiph’s writeups on Mozilla.org and Xioh.org, nearly every one has a reference to this and to the AV1 alliance having hardware partners (as well as browser and delivery).

  4. Kostya says:

    Yes, there were presentations and write-ups (mostly but not exclusively by Xiph people). And it is also true that AOM is formed from various kinds of companies—streaming, browsers, hardware. But the decision process was still unclear and gave me an impression that it was first decided on general level (i.e. if the feature gives compression gain and is not very complex it should be included) and then any opinions from hardware manufacturers (probably different from hardware experts reviewing those features before) were treated as vetoes or final says (i.e. “we might have trouble implementing this—okay, we amend it or throw it away”). It is also hard to explain AV1.1 otherwise.

  5. utack says:

    The worst thing is probably that the encoder is not even trying to get a good result.
    A at least somewhat resonable AQ would give a lot of benefit for human perception, but they mostly care about good PSNR for some press or papers to publish.
    The result is that it can be worse than x264 in many scenarios, and that companies will have no large interest in improving it, because for any reasonable use case they must get a commercial encoder (TwoOrioles or others)

  6. Kostya says:

    Well, that part I have no problems with: if you care about the format enough you write your own encoder, be it hardware assisted one or pure software. And here it’s probably just its software roots showing up since libvpx was not that terrific encoder either.

  7. Peter says:

    Excellent point regarding reference implementation performance.

  8. Kostya says:

    It gets funnier if you look at aomedia.org/about/av1-roadmap/ as it mentions having unoptimised software implementation first and optimising it for phase 2 when the codec is ready but in reality they’ve started to work on SIMD optimisations right from the start.

  9. teeg says:

    i guess i’m kinda conflicted about which video codec is best now

  10. Kostya says:

    There’s hardly any theoretically best codec, there’s codec (and its implementation) that works the best for your scenarios: for archiving you’d prefer lossless codec with best possible compression, for intermediate footage editing you’d prefer something with scalability, for streaming you’d prefer something with good compression and small lag etc etc. As somebody said, the advantage of standards is that there are a lot of them to choose from 😉

  11. Aswin says:

    The article mentions Baidu – I dont see this company on the list of founding members @ all. Are we using this as a search and replace term for the G company

  12. Kostya says:

    Yes, I always shorten American Baidu to just Baidu. Usually it’s clear from the context and somebody has to keep the joke running…

  13. derf says:

    > For example, scalability looks like an afterthought but maybe it has been discussed right from the start.

    Scalability roughly followed the VP9 design (both designs were largely contributed by Vidyo). You could maybe do more sophisticated things, but the simple design was good enough for Vidyo’s use-case and does not run the patent risks that more sophisticated things do.

    > Or switching from WebMKV to MP4 as main target container (outside IVF of course)—was it some later (maybe even fruit company effected) decision or not?

    This was pretty much a universal decision, made relatively early on, following a longstanding trend in the industry. There certainly wasn’t any controversy over it.

    > What about superres and film grain?

    Not sure what about them.

    Superres is just trying to solve the problem of dynamically downscaling the frame (to allow operating over a wider bitrate range) while taking advantage of one of AV1’s new tools (loop restoration), which can clean up some rescaling artifacts when allowed to run at original resolution. It got changed to horizontal-only scaling because hardware vendors did not want to add additional line buffers (which are very costly in terms of chip area), and it took a few tries to convince people that there was enough benefit to make it worth adding, but that’s all pretty normal.

    Film grain was primarily proposed by Netflix (as you can see by the commit logs), as it is particularly suited to their type of content and target quality. There was a little bit of song and dance around how to define conformance w.r.t. film grain, since it is purely a post-filter outside of the prediction loop, but it cannot be skipped without serious visual implications. That was made more complex by the fact that some hardware decoders would want to apply it during their display pipeline processing to avoid additional buffers and memory bandwidth, and that processing also includes things like scaling and RGB conversion. That might make it less practical to do something bit-exact with libaom, and bit-exactness is not really required outside of the prediction loop as long as the correct visual character is retained. Eventually Netlix and the hardware companies worked out a compromise that I think everyone was okay with, and that is what is in the spec.

    With respect to 1.1 (now officially classified as 1.0.0 with errata), I think you are blowing that a little out of proportion. HEVC also had corrections issued after the original spec was frozen. Like software, there’s always going to be a few mistakes in a several-hundred-page technical specification. Argon Designs thought that AV1 had fewer bugs in it than HEVC did when it was frozen, and they should know, having made conformance streams for both. None of the errata are going to impact hardware ship dates, for example. Indeed it was hardware companies that noticed the mistakes and proposed the fixes.

  14. Kostya says:

    > Superres is just trying to solve the problem of dynamically downscaling the frame (to allow operating over a wider bitrate range)

    But why does that sound like scalability to me?

    > Indeed it was hardware companies that noticed the mistakes and proposed the fixes.

    It felt more like hardware companies complaining about some part of AV1 allowed more flexibility than they wanted to have in their hardware decoders so the specification had to be amended to disallow that. Also nice move with re-versioning.

    Many thanks for that additional information, the history of AV1 development is clearer now.

Leave a Reply