From: Chris Russ Subject: Re: Fast convolution algorithm. Date: 30 Aug 1999 00:00:00 GMT Message-ID: <37CA8D35.B29E1FA3@aol.com> Content-Transfer-Encoding: 7bit References: <37C64DEE.24A46550@torsana.com> <7qckk4$7oh$1@newsfeed.unitel.co.kr> X-Accept-Language: en X-Notice: should be reported to postmaster@ibm.net Content-Type: text/plain; charset=us-ascii; x-mac-type="54455854"; x-mac-creator="4D4F5353" X-Complaints-To: postmaster@ibm.net X-Trace: 30 Aug 1999 13:58:48 GMT, 32.100.213.153 Organization: Reindeer Games, Inc. MIME-Version: 1.0 Reply-To: jcr6@aol.com Newsgroups: sci.image.processing,comp.graphics.algorithms I got a few emails over the weekend: > So how *do* you code a fast spatial convolution? I guess I deserve it for taunting people. Here we go. NOTE: While my examples are all using FP math, you will find that INT multiplies are a *lot* slower that INT adds, and that by reducing the multiplies in your code, we shift the throughput bottleneck to the memory system. There are video game tricks that are suitable for speeding up those accesses (like tickling an address before you need it so that it will be cached when its turn comes), but that is subject for another discussion *and* becomes highly processor dependent. I also neglect the overhead of converting from INT to FP and back. My goal is to show algorithms, not quality code. Nor is there a warranty on the code of any kind. ;-) USING SYMMETRY TO REDUCE COMPUTATION Let's talk about symmetry. This discussion assumes that the convolution kernel is radially symmetric. Thus, it applies to gaussians, laplacians, d.o.g's, cross-correlation templates (if they're symmetric), etc. If not, then you may be able to use some portion of this symmetry to get speed. H & V Bilateral Symmetry (works with kernels in the form): A B C B A D E F E D G H I H G (remember "I" is the center pixel) D E F E D A B C B A This is the 5x5 case of a bilaterally symmetric kernel. There are 25 entries in the kernel, but only 9 unique ones. How can we make use of this? Normally, I would convolve this with a 2-D array Kernel[5][5] in a doubly nested FOR-loop. There will be 25 accesses to the various lines in the image (5 on each line, 5 lines) and 25 accesses to the Kernel array. It turns out that I can reduce a sizeable chunk of that to 25 accesses to the image (I still have to sample all of the points in the image) but only 9 accesses to the kernel array (or conceivably to registers if I explicity write out the code). How? CASE 1 : ORIGINAL CODE - ARBITRARY CONVOLUTION The original code would look something like this: (BTW, I'm doing everything in floats - I assume that the image is scaled 0..1 and that the original kernel is already normalized... I'm also leaving the end conditions as an exercise to the reader) register float *LinePtr[5]; register float *Kernel[5]; register float *TempLine; register float ImageOffset; //for many kernels (Guassian, Sharpen, etc) //this is 0.0; For a Laplacian should be 0.5 //... skip over the preloading / image mirroring code for the top of the image //... assume that LinePtr[0], LinePtr[1], LinePtr[2], LinePtr[3], and LinePtr[4] // point at the current 5 lines of the image. I do this by allocating storage // and rotating pointers to the storage at the end of the y-loop for (y = 0; y < height; y++) { //LinePtr[5] = GETLINE(y + 2); //inner loop - convolution occurs here for (x = 0; x < width; x++) { register double sum = 0.0; //for *big* kernels, need a double to accumulate result. for (dy = 0; dy < 5; dy++) { float *kLine = Kernel[dy]; float *iLine = LinePtr[dy]; for (dx = 0; dx < 5; dx++) sum += (double)(kLine[dx - 2] * iLine[dx - 2]; // <--- convolution inner product }//dy-loop RESULTLINE[x] = sum + ImageOffset; }//x-loop //rotate pointers TempLine = LinePtr[0]; LinePtr[0] = LinePtr[1]; LinePtr[1] = LinePtr[2]; LinePtr[2] = LinePtr[3]; LinePtr[3] = LinePtr[4]; LinePtr[4] = TempLine; //PUTRESULTLINE[y] - write result to the image }//y-loop CASE 2 :BILATERAL SYMMETRY (only need a 3x3 for the kernel) {//y - loop {//x - loop ... register double sum = 0.0; //doubles take a *lot* longer on PCs. //for small kernels, use floats instead... register float *kLine; for (dy = 0; dy < 2; dy++) //top & next to top lines { register float *iLineTop = LinePtr[dy]; register float *iLineBot = LinePtr[4 - dy]; kLine = Kernel[dy]; for (dx = 0; dx < 2; dx++) //small enough kernel, you could write it out explicitly { register int idx = 4 - dx; //do 4 entries at a time. Note, if your image is integer, this is even faster since // integer multiply is a *lot* slower that integer add sum += Kernel[dx] * (iLineTop[dx - 2] + iLineTop[idx - 2] + iLineBot[dx - 2] + iLineBot[idx - 2]); }//dx-loop }//dy - loop //middle horiz line (dy = 2, dx = 0..5) TempLine = LinePtr[2]; kLine = Kernel[2]; for (dx = 0; dx < 2; dx++) { register int idx = 4 - dx; sum += kLine[dx] * (TempLine[dx - 2] + TempLine[idx - 2]); } //middle value sum += kLine[2] * iLinePtr[2][x]; //middle vert line for(dy = 0; dy < 2; dy++) { register float *iLineTop = LinePtr[dy]; register float *iLineBot = LinePtr[4 - dy]; sum += Kernel[dy][2] * (iLineTop[0] + iLineBot[0]); } RESULTLINE[x] = sum + ImageOffset; ... }//x-loop }//y-loop CASE 3 : RADIAL SYMMETRY Here's where it gets interesting and can be faster still, depending upon the shortcuts you're willing to take: (The goal is to reduce the computational complexity, so we're going to sum up values that should be scaled by the same values.) Let's look at our 5x5 kernel again: A B C B A D E F E D G H I H G D E F E D A B C B A (remember, we only needed 9 multiplies for this one) What if we can make the assumption of radial symmetry and the kernel looks like this: A B C B A B D E D B C E F E C B D E D B A B C B A Now there only need to be 6 multiplies if we add the right values together first. So, if I have an array "ImageSum[6]" that initializes to {0, 0, 0, 0, 0, 0}, I can add the image values into the array and THEN multiply those sums by a reduced form of the kernel that we compute ahead of time "ReducedKernel[6]". What I need then is either an array that tells me which element in ImageSum I need to cast the image pixels into, OR to hard code it. (For a small kernel, the array will be much slower. Also, that array can make use of the same Bilateral symmetry that the kernels did in CASE 2.) //This is where the Image values need to be cast in ImageSum: int KernelDestination[3][3] = {{0, 1, 2}, {2, 4, 5}, {3, 5, 6}}; //For that matter, it is also where the original Kernel values //need to be cast in ReducedKernel int ReducedKernel[6] = ??? //input from user? ... {//y - loop {//x - loop ... float ImageSum[6] = {0, 0, 0, 0, 0, 0}; //clear accumulators float partial; int target; for (dy = 0; dy < 2; dy++) //top & next to top lines { register float *iLineTop = LinePtr[dy]; register float *iLineBot = LinePtr[4 - dy]; for (dx = 0; dx < 2; dx++) //small enough kernel, you could write it out explicitly { register int idx = 4 - dx; //do 4 entries at a time. Note, if your image is integer, this is even faster since // integer multiply is a *lot* slower that integer add target = KernelDestination[dy][dx]; partial = (iLineTop[dx - 2] + iLineTop[idx - 2] + iLineBot[dx - 2] + iLineBot[idx - 2]); ImageSum[target] += partial; }//dx-loop }//dy - loop //middle horiz line (dy = 2, dx = 0..5) TempLine = LinePtr[2]; for (dx = 0; dx < 2; dx++) { register int idx = 4 - dx; target = KernelDestination[2][dx]; partial = (TempLine[dx - 2] + TempLine[idx - 2]); ImageSum[target] += partial; } //middle value target = KernelDestination[2][2]; //will probably always be the 0 element partial = iLinePtr[2][x]; ImageSum[target] = partial; //middle vert line for(dy = 0; dy < 2; dy++) { register float *iLineTop = LinePtr[dy]; register float *iLineBot = LinePtr[4 - dy]; target = KernelDestination[dy][2]; partial = (iLineTop[0] + iLineBot[0]); ImageSum[target] += partial; } //Apply ReducedKernel to ImageSum { double sum = 0.0; for (i = 0; i < 6; i++) sum += ReducedKernel[i] * ImageSum[i]; } RESULTLINE[x] = sum + ImageOffset; ... }//x - loop }//y - loop So, with a 5x5 kernel, we can reduce the number of multiplications to 9 using bilateral symmetry and to 6 using radial symmetry. This ratio will scale as the kernel gets larger. Remember, when using integer math, the integer multiply can take 3-5 times longer than the integer add. Also, if you can set up an async fetch for line y+3 while you're processing (y-2..y+2), you won't be bound by the memory sub-system. CONCLUSION Major optimizations using registers, factoring out expensive numerical operations, and removing IF and CASE statements from inner loops will have a dramatic impact in how fast your code is. Unfortunately, the more "optimal" your code becomes, the harder it is to read. Depending upon the memory bottlenecks and your compiler, it is not unusual to get a 4x or more speed improvement by optimizing your algorithm and code. Often another factor of 2-4x can be achieved in assembly language if you are *really* in need. -Chris Russ Reindeer Games, Inc. http://members.aol.com/ImagProcTK