How to perform fast motion search

To answer the obvious question with the obvious answer, brute force searching for a decent motion vector takes insanely large time. For example, VP6 motion search area be up to 63×63 pixels and checking all possible positions there requires a lot of tries. And if you remember that VP6 has quarterpel motion compensation precision, you should multiply that number by 16 possible sub-pixel positions. Obviously in order to reduce the number of tries various tricks are employed.

While by itself fast motion search methods I describe here are not that complex, it was rather hard to locate books where such details of developing video encoders are presented. At last I’ve found two or three books with the chapters dedicated to motion compensation plus the papers referenced there. The results of this mini-research are given below.

The principle of fast motion search is extremely simple: you check some selected points on the plane around the initial position (and including it), select the best candidate from them and either repeat the process starting from the new point or (if the current position is the best one) switch to a smaller pattern.

First, let’s look at historical fast motion search methods. Those methods often have limited search range by design and also they expected input data to be smooth and thus were prone to getting stuck in local minimum (that reminds me: will AV2 or AV3 switch to machine learning methods? Maybe simulated annealing and gradient descent can be used for motion estimation after all). The main representatives of such methods are: 2D-logarithmic search (where you start by checking four points lying half the search range away from the selected one, and if the current one is the best one, halve the range and repeat the operations until the range is just one pixel), three-step search (similar but it’s nine points being checked – first at (-4,-4), (-4,0), (-4,4), (0,-4), (0,0), (0,4), (4,-4), (4,0), (4,4) then the grid is shrunk twice and on the third step eight direct neighbours of the current point are inspected). There’s also a multi-resolution search which first searches for a match of a downscaled version of the block in the downscaled version of the frame and then, when the good match is found, increases resolution and repeats search until it reaches the original size.

Now let’s look at outdated motion search methods. These methods have appeared in the beginning of 2000s and were widely used in H.263 and MPEG-4 ASP encoders, so it should be a good fit for VP6 as well (so I picked them for my encoder as well). Unlike the previous group, these methods use constant grid for normal search and smaller grid for refinement search. The main representatives of these methods are diamond and hexagonal search. Of course there were many various search patterns proposed during the years but these two remain the most popular ones.

On the pictures below I tried to show how the search happens:

  • you start searching from the centre of the grid;
  • first you check pixels marked with black dots;
  • if one of them gives better match (e.g. the ones pointed to with a red or a green arrow), you select it as the next centre of search step and check new positions (marked with red or green dots correspondingly) plus keeping the results from already checked original candidates (the ones still marked with black dots);
  • if the current centre is the best position, you perform refinement step and check neighbouring pixels marked with blue dots.

Diamond shape search. Black dots—original position, red and green dots—possible new search positions, blue dots—positions for refinement check

Hexagonal shape search, the meaning is the same as with diamond shape search.

Another thing worth considering is how sub-pixel search is done. You can perform search like that from the start (i.e. for diamond or hexagonal search it won’t be “check candidate two pixels left from the current point” but rather “check candidate half-pixel left from the current point” for quarterpel resolution). You can perform full-pixel search first and then repeat the search from that point using sub-pixel precision. Or you can do full-pixel search first and perform sub-pixel MV refinement in a different way.

And finally, state of the art motion search methods. That would be various zonal searches. Encoders for formats starting with H.264 use these methods.

They develop further on the previous group of search methods in two directions: first, it starts with a reasonable assumption that the motion vector is likely to be similar to the ones in the neighbouring blocks and the one in co-located block (the block on the same position in the previous frame) so it selects those motion vectors as initial candidates (hence “zonal” in the name), and then it also adds better early termination conditions like when the search does not improve on the initial guess of taking the motion vector from the co-located block. The search process is often integrated with the rate-distortion optimisation part of the encoder, so it estimates encoding cost on the go to see if the motion estimation result is worth encoding. The best known method of this family is EPZS (Enhanced Predictive Zonal Search).

Now I’d like to describe how it all applies to my VP6 encoder.

As I mentioned above, I have diamond or hexagonal shape search as they’re simple and give good results. Additionally I introduced an early termination condition, so if sum of absolute macroblock differences is 256 or less I terminate the search. And since I perform search mostly on whole macroblocks (and on individual 8×8 blocks sometimes for four-MV mode), I can stop calculating differences after the sum for first couple of blocks is higher than the best (lowest) macroblock difference.

Another thing that I needed dealing with is sub-pixel search. I implemented it by searching the best position with pixel precision and then repeating the same process with sub-pixel precision. This works good enough and on average (for the small test video I used) it takes 5-8 matches against possible candidates instead of initial 20-30 matching attempts (for full search that would mean searching just for two pixels around the current point with no sub-pixel precision).

Overall, this turned out to be rather fun experience.

4 Responses to “How to perform fast motion search”

  1. Thank you for this beautiful post, I was eagerly waiting for this part, and it didn’t disappoint. I’m usually unable to understand all the codec talk on your posts, but the presentation on this one is so nice (images!) even an idiot like myself can get it. I’m planning on trying to apply similar techniques in real-time graphics, and when I saw the beginning of this series, I thought I’d wait for you to do the research on motion search for me ?

    I have only one request, you cite specific chapters of books and papers where you learned this. Mind to share the titles?

  2. Kostya says:

    Here are two books containing at least some information:

    Yun Q. Shi, Huifang Sun Image and video compression for multimedia engineering: fundamentals, algorithms, and standards third edition (CRC Press 2019) — chapter 12.3 gives an overview of historical motion search methods.

    Sergios Theodoridis, Rama Chellappa, David R. Bull, Min Wu Image and Video Compression and Multimedia (Academic Press 2014) — I’d say the best book on the topic. Chapter 5.02 (this is volume 5 in Academic Press Library in Signal Processing and they have sequential numbering) has all the information you need on theoretical principles and obsolete and historical search methods. I’d say this book and the paper on EPZS (given below) is all I needed.

    And here are the papers:

    Shan Zhu and Kai-Kuang Ma A New Diamond Search Algorithm for Fast Block-Matching Motion Estimation (IEEE Transactions on Image Processing Volume: 9, Issue: 2, Feb. 2000)
    Ce Zhu, Xiao Lin, and Lap-Pui Chau Hexagon-Based Search Pattern for Fast Block Motion Estimation (IEEE Transactions on Circuits and Systems for Video Technology Volume: 12, Issue: 5, May 2002)
    Alexis M. Tourapis Enhanced Predictive Zonal Search for
    Single and Multiple Frame Motion Estimation
    (Proceedings of SPIE vol. 4671, 2002)

    I suspect though that you might need a different kind of methods for computer graphics, maybe e.g. pel-recursive methods mentioned in the books will suit your needs better.

  3. Thank you sir, you are a true gent! I could find everything and I’ve just skimmed over, really nice stuff. Also thanks for the advice, I’ll keep it in mind. I haven’t started my research yet, but I suspect you’ve just given me a hell of a head start. Really appreciated.

  4. […] found value is good enough (e.g. block distortion is below some threshold). And I’ve reviewed the ways to save time on motion search already. Now what to do with intra 4×4 prediction mode selection? I’ve done a quick […]

Leave a Reply