So, unlike those breakthrough codecs everybody talks about (I mean RMHD and ORBX.js), V-Nova Perseus was delivered (but what do you expect from a codec announced on the first of April?) and is available in some Android app. So I’ve looked at it.
The implementation seems bafflingly simple: there’s a base layer, it gets upscaled 2x and an enhancement is applied to the upscaled image. And those enhancements are essentially quantised differences after 2×2 Haar transform plus runs, all coded with context-dependent Huffman codes. If that reminds you of RealVideo—don’t worry, they code those codebook descriptions too so it’s different.
I don’t know if it really works as good as promisedmarketed but it’s an interesting approach and it introduces some variety in the world of codecs that look alike—mostly because they all use the same principles as the standard video codec with some small enhancements or building blocks replaced with functional analogues; yes, I completely forgot about Daala, please remind me about it when they settle with final design—it might be the codec of choice for GNU HURD NG by then too.
This project has been discussed already in video forums, like VideoHelp; first mistaken as “codec” due to the lack of facts in the marketing mumbo-jumbo, it was later discovered as a kind of visual “noise modeller”, similar to SBR in audio technologies like mp3PRO or HE-AAC.
Yes, I remember all the original hype (“based on breakthrough technologies from the guy who worked on MPEG-1 video standard and has done nothing else in the following years”), the fact it’s a scalability extension like SBR (I heard more technical details last September) but now I got hold on the decoding library for it (from some Android app) and looked at the code.
The terminology question (i.e. whether to call it codec, technology or something else) remains open. At least there’s something working here that we can see (and some people even fail to evaluate properly) unlike certain vaporware.
Do you have any more information about how the Haar transform is being used? Does the upscaled image go through a 2×2 Haar forward conversion, the huffman stored Perseus differences applied to the 4 bands and then go through a 2×2 Haar reverse? Is this done for every frame, or are the differences temporally compressed as well?
Thanks as always for your awesome RE work!
It’s used for differences—i.e. it reads four values first and then applies
(a+b+c+d), (a-b+c-d), (a+b-c-d), (a-b-c-d)
to the 2×2 block of the already upscaled image (which is done using simple interpolation filters like96, 32
or-10, 111, 30, -3
). The transform is done on residues simply to save some bits on transmitting them. The frames seem to be treated independently (well, if you start to involve temporal compression here you’ll have to follow the structure of the underlying codec and deal with frames being coded in different order, some frames being optional to decode etc etc).And the work was not awesome, it’s a quick look on rather simple thing.
When you’re referring to the simple interpolation filters, are you just referring to how the image is upscaled? Or are you referring to it somehow being used to apply the differences to the source image? Or are the differences applied just using simple addition? (dst_pixel0 = upscaled_src_pixel0 + (a+b+c+d)) ?
When reading the four values, are they just huffman decoded and dequantised? Anything dynamic/special about how?
One final question 🙂 As it’s so simple, (possibly by design for low processing reasons?), do you have any thoughts on a superior design? (larger/different DWT?, arithmetic coding?, etc)
Interpolation filters is what used to upscale input frame 2x or so.
I.e.
upscaled[0] = (src[0] * 96 + src[1] * 32) / 128; upscaled[1] = (src[0] * 32 + src[1] * 96) / 128; ...
(I have not looked how filters are selected and applied because it’s rather trivial operation).And then the difference is simply added i.e.
dst[x][y] = clip(upscaled[x][y] + (a+b+c+d)); dst[x][y+1] = clip(upscaled[x][y+1] + (a-b+c-d)); ...
And now for deltas decoding. They use static Huffman codes given in the frame header plus some tricks for better compression (like special treatment for large values and IIRC coding runs—again, it’s all rather insignificant details for a quick look).
As for the improved design—I don’t know, it’s like with Daala: you need to test various ideas to see what sticks. For example, coding region shapes and noise variance in them. Or having a dictionary of noise textures and applying them to the various tiles. Or using enumerative coding (i.e. code residual average and the number corresponding to variant of the distribution variation, maybe do it in pyramid hierarchy too). Anyway, I’m more of a critic who looks at what other people did and badmouths their work because I lack a patience to make a decent implementation of a format encoder let alone develop a format from scratch.