Transform is one of the essential parts of typical video codec, lots of them can be described as e.g. “DCT-based video codec using X coding [and additional features like …]”. That is why I’m starting with it.
In the extremely unlikely case you don’t know what DCT is and what it is for, here’s an oversimplified explanation. DCT stands for Discrete Cosine Transform, which means representing a discrete signal (i.e. signal made of values measured at certain fixed distance in time or space) with a combination of parts of cosine waves of different frequencies (depending on which parts you take you get several DCT types but that’s not important now). And you do it because it represents a block as frequencies and human eye perceives low frequencies better, so if you filter out high frequencies (and use coarser representation for low frequencies) you’ll get a block that compresses much better but still looks close enough to the original (especially if you’re not trained to recognize such compression artefacts).
Anyway, original DCT employs floating-point coefficients and requires too many operations if you do it in straightforward way (by matrix multiplication). So at first people thought up the ways to calculate it faster and many such ways were proposed (essentially you can factorise matrix into operations like a±b or a cos(X) + b sin(X); a sin(X) – b cos(X) which are much faster to compute). Then people thought about making it even faster and using only integer math to calculate the approximation of the transform. And finally you have a lossless (reversible) transforms which does not lose precision due to rounding and imprecise coefficients (i.e. if you apply forward transform first and then inverse transform on its result, you’ll get the original signal without any distortions; that was not always the case with DCT). After that you got H.264 and H.EVC with lossless transforms that only remotely resemble DCT and are not that efficient, but they’re easier to calculate and scale to different dimensions—but that’s a story for another time.
For more details on how to do it properly I refer you to the excelled two-part post from Fabian Giesen that can be found here (part 1) and here (part 2). (An uninteresting side story: somebody told me about those posts after I’ve reverse-engineered the transform already. Nevertheless it’s an interesting and educating reading.)
The problem with original floating-point DCT was that it was calculated slightly differently in different implementations, so even with the same codec but different decoders you got slightly different results (this was especially funny during DivX and XviD era). With integer transform specified for the decoder you can at least guarantee that all decoders will decode the same input in the same way. And with lossless transform you can even expect that the same source would be encoded in the same way with the same settings (not counting for the losses introduced on the other stages of encoders like different quantisation or psychovisual modelling).
Anyway, what about the specific version of DCT used in VP6? It is lossy integer DCT with the standard way to decode and it’s been employed since VP3 so you can even look at the original code (which I did eventually).
Since there’s normative IDCT, I simply implemented it by reversing the steps. Then I tested it by transforming a matrix with all but one elements zero (for all 64 positions of non-zero coefficient), transforming it back and comparing the results. The results were not matching (you had a couple of random plus and minus ones and the expected non-zero element value was not exactly the same). I tried the original forward transform from VP3 sources (and looked at VP6 encoder binary specification as well), it did the same but with an additional rounding during multiplication (and its output was essentially the same). Then I looked at Theora, it was doing more tricks to make it better—increased precision (adding two bits to the input and then removing them from the output values) and supposedly more accurate transform stage by replacing some of the constants and introducing non-trivial rounding values. Yet it’s not lossless either.
This is not an issue for the encoding process since an encoder is supposed to maintain not the original reference picture but rather the same reconstructed picture as decoder would (and as you remember, decoder reconstructs it using the normative IDCT). Plus the differences introduced by the transform should be negligible compared to the quantisation.
Okay, this part is done, next time I’m going to talk about coefficients coding.