I was honoured to implement VC-1 decoder in course of Google Summer of Code.
I had almost no time to do anything this week but at least here is straightforward optimization of 4×4 transform.
Speed gain varies from 4 times (PII-266) to 7 times (PPC G4-1.42 GHz). My name is not Niedermayer and I’m not sure how to further optimize it / rewrite with MMX/AltiVec but it’s a good start. 8×8 transform will follow soon (I hope).
The test code is given below.
Function trans_slow is taken from VC1 reference decoder.
#include <stdio.h> #include <stdlib.h> #include <time.h> #define HWD16 short #define UHWD16 unsigned short #define WORD32 int /* 4x4 Inverse Transform matrix defined by the standard */ static const HWD16 T4[16] = { 17, 17, 17, 17 , 22, 10, -10, -22 , 17, -17, -17, 17 , 10, -22, 22, -10 }; static void trans_slow(HWD16 *pR, HWD16 *pD) { HWD16 pE[8*8]; const HWD16 *pT, *pC; UHWD16 i, j, k; WORD32 acc; /* Row transform: * Equation: E = (M * T + 4) >> 3 */ for (j = 0; j < 4; j++) { for (i = 0; i < 4; i++) { for (acc = 0, k = 0; k < 4; k++) { acc += pD[j*8+k]*T4[k*4+i]; } pE[j*8+i] = (HWD16) ((acc + 4) >> 3); } } /* Column transform: * Equation: R = (T' * E + C * 1 + 64) >> 7 */ for (i = 0; i < 4; i++) { for (j = 0; j < 4; j++) { for (acc = 0, k = 0; k < 4; k++) { acc += pE[i+k*8]*T4[k*4+j]; } pR[i+j*8] = (HWD16) ((acc + 64) >> 7); } } } static void trans_fast(HWD16 *pR, HWD16 *pD) { HWD16 pE[8*8]; const HWD16 *pT, *pC; UHWD16 i, j, k; WORD32 acc; register int t1,t2,t3,t4,t5,t6; HWD16 *src = pD; HWD16 *dst = pE; for(i = 0; i < 4; i++){ t1 = 17 * (src[0] + src[2]); t2 = 17 * (src[0] - src[2]); t3 = 22 * src[1]; t4 = 22 * src[3]; t5 = 10 * src[1]; t6 = 10 * src[3]; dst[0] = (t1 + t3 + t6 + 4) >> 3; dst[1] = (t2 - t4 + t5 + 4) >> 3; dst[2] = (t2 + t4 - t5 + 4) >> 3; dst[3] = (t1 - t3 - t6 + 4) >> 3; src += 8; dst += 8; } src = pE; dst = pR; for(i = 0; i < 4; i++){ t1 = 17 * (src[ 0] + src[16]); t2 = 17 * (src[ 0] - src[16]); t3 = 22 * src[ 8]; t4 = 22 * src[24]; t5 = 10 * src[ 8]; t6 = 10 * src[24]; dst[ 0] = (t1 + t3 + t6 + 64) >> 7; dst[ 8] = (t2 - t4 + t5 + 64) >> 7; dst[16] = (t2 + t4 - t5 + 64) >> 7; dst[24] = (t1 - t3 - t6 + 64) >> 7; src++; dst++; } } main(){ HWD16 in[64], outf[64], outs[64]; int i; clock_t time1, time2; srand(time(NULL)); for(i = 0; i < 64; i++) in[i] = (rand() / 24657) & 0xFFFF; time1 = clock(); for(i = 0; i < 1000000; i++) trans_slow(outs, in); time1 = clock() - time1; time2 = clock(); for(i = 0; i < 1000000; i++) trans_fast(outf, in); time2 = clock() - time2; printf("Transforming took %i and %i clocks\n", time1, time2); for(i=0;i<32;i+=8) printf("%6i %6i %6i %6i %6i %6i %6i %6i\n",outs[i+0],outs[i+1],outs[i+2],outs[i+3],outf[i+0],outf[i+1],outf[i+2],outf[i+3]); puts(""); }
First your code is somewhat buggy …
1. copy & paste causes ‘”‘ symbols to be somehow messed up here, maybe firefox localization “feature”, anyway having a downloadable file would have been nice
2. clock_t type vs %i
3. now the actual bug, your transform works with a 16bit temporary buffer and the input is random & 0xFFFF now 0x8000*(17+17+22+10) >> 3 doesnt fit in a 16bit variable, may i suggest that you use a int foobar[] style temp buffer, its faster too here
the column transform in the VC1 spec&reference SW looks funny, did you confirm that your implementation of the column transform is binary identical to the reference, i think it is but i might be wrong …
heres a optimized version of the row transform loop, the same can be done with the colun transform
for(i = 0; i < 4; i++){
t2= 17 * src[0] + 4;
t3= 17 * src[2];
t1= t2 + t3;
t2-= t3;
t4 = 10*(src[3] – src[1]);
t3 = t4 + 32*src[1];
t4 += 12*src[3];
tmp[0] = (t1 + t3) >> 3;
tmp[1] = (t2 – t4) >> 3;
tmp[2] = (t2 + t4) >> 3;
tmp[3] = (t1 – t3) >> 3;
src += 8;
tmp += 8;
}
and the input values will very often be 0 so checking for that and skiping the row transform or checking for x,0,0,0 and handling this as a special case (usefull for row and column transforms) or some similar method might significantly improve the speed
also note that if you want to write a vc1 decoder maybe its better if you optimize it after its working instead of too early as that might make debugging harder and introduce bugs …
Thanks for taking interest in this! Here are my answers:
1. I provided modified C file for VC1 reference decoder which I used to test correctness in real world. All seems to be correct (some bugs WERE in implementation but I stamped out them).
2. I’m very glad you thought about further optimization (but if not you then who?). I tried to create sane version of transform (matrix multiplication is slow^Wtheoretical way to do transforms) and all further optimizations will be done after I finish at least I-frame decoder.
Any chance you’ll be addressing the bane of my existence at the moment?
By which, of course, I’m referring to the “Image too large for all layers” problem that keeps popping up in the libvc1 reference decoder.
I thought it was just me to begin with, but I’ve found other references.
Whilst I have sound programming knowledge, my knowledge of video decompression/decoding is severely lacking, so I have no means to fix it =(
A simple question – are you sure you properly initialized decoder? This message should be triggered either for very large image sizes or incorrect [header] data.
I have a doubt in the implemantation of VC1 inverse transform
will the intermediate values range limited to 16 bit or 32 bit?
In spec, it is mentioned that, the result of row transform is limted to signed 13 bit , but the inverse quantized values are in the range of signed 12 bit ?
is it possible to get the result in 16 bit so that is is limited to 13 bit after right shifted by 3 ? or any clip operation is required in between?
What is the best book on this topic?