From: Chris Russ
Subject: Re: Fast convolution algorithm.
Date: 30 Aug 1999 00:00:00 GMT
Message-ID: <37CA8D35.B29E1FA3@aol.com>
Content-Transfer-Encoding: 7bit
References: <37C64DEE.24A46550@torsana.com> <7qckk4$7oh$1@newsfeed.unitel.co.kr>
X-Accept-Language: en
X-Notice: should be reported to postmaster@ibm.net
Content-Type: text/plain; charset=us-ascii; x-mac-type="54455854"; x-mac-creator="4D4F5353"
X-Complaints-To: postmaster@ibm.net
X-Trace: 30 Aug 1999 13:58:48 GMT, 32.100.213.153
Organization: Reindeer Games, Inc.
MIME-Version: 1.0
Reply-To: jcr6@aol.com
Newsgroups: sci.image.processing,comp.graphics.algorithms
I got a few emails over the weekend:
> So how *do* you code a fast spatial convolution?
I guess I deserve it for taunting people. Here we go.
NOTE:
While my examples are all using FP math, you will find that INT multiplies
are a *lot* slower that INT adds, and that by reducing the multiplies in
your code, we shift the throughput bottleneck to the memory system.
There are video game tricks that are suitable for speeding up those
accesses (like tickling an address before you need it so that it will
be cached when its turn comes), but that is subject for another discussion
*and* becomes highly processor dependent. I also neglect the
overhead of converting from INT to FP and back. My goal is to show
algorithms, not quality code.
Nor is there a warranty on the code of any kind. ;-)
USING SYMMETRY TO REDUCE COMPUTATION
Let's talk about symmetry. This discussion assumes that the convolution
kernel is radially symmetric. Thus, it applies to gaussians, laplacians,
d.o.g's, cross-correlation templates (if they're symmetric), etc. If not,
then you may be able to use some portion of this symmetry to get speed.
H & V Bilateral Symmetry (works with kernels in the form):
A B C B A
D E F E D
G H I H G (remember "I" is the center pixel)
D E F E D
A B C B A
This is the 5x5 case of a bilaterally symmetric kernel. There are 25
entries in the kernel, but only 9 unique ones. How can
we make use of this?
Normally, I would convolve this with a 2-D array Kernel[5][5] in a doubly
nested FOR-loop. There will be 25 accesses to the various lines in the
image (5 on each line, 5 lines) and 25 accesses to the Kernel array. It
turns out that I can reduce a sizeable chunk of that to 25 accesses to
the image (I still have to sample all of the points in the image) but only
9 accesses to the kernel array (or conceivably to registers if I explicity
write out the code).
How?
CASE 1 : ORIGINAL CODE - ARBITRARY CONVOLUTION
The original code would look something like this: (BTW, I'm doing everything
in floats - I assume that the image is scaled 0..1 and that the original
kernel is already normalized... I'm also leaving the end conditions as
an exercise to the reader)
register float *LinePtr[5];
register float *Kernel[5];
register float *TempLine;
register float ImageOffset; //for many kernels (Guassian, Sharpen, etc)
//this is 0.0; For a Laplacian should be 0.5
//... skip over the preloading / image mirroring code for the top of the
image
//... assume that LinePtr[0], LinePtr[1], LinePtr[2], LinePtr[3], and
LinePtr[4]
// point at the current 5 lines of the image. I do this by allocating
storage
// and rotating pointers to the storage at the end of the y-loop
for (y = 0; y < height; y++)
{
//LinePtr[5] = GETLINE(y + 2);
//inner loop - convolution occurs here
for (x = 0; x < width; x++)
{
register double sum = 0.0; //for *big* kernels, need a double to
accumulate result.
for (dy = 0; dy < 5; dy++)
{
float *kLine = Kernel[dy];
float *iLine = LinePtr[dy];
for (dx = 0; dx < 5; dx++)
sum += (double)(kLine[dx - 2] * iLine[dx - 2]; // <---
convolution inner product
}//dy-loop
RESULTLINE[x] = sum + ImageOffset;
}//x-loop
//rotate pointers
TempLine = LinePtr[0];
LinePtr[0] = LinePtr[1];
LinePtr[1] = LinePtr[2];
LinePtr[2] = LinePtr[3];
LinePtr[3] = LinePtr[4];
LinePtr[4] = TempLine;
//PUTRESULTLINE[y] - write result to the image
}//y-loop
CASE 2 :BILATERAL SYMMETRY (only need a 3x3 for the kernel)
{//y - loop
{//x - loop
...
register double sum = 0.0; //doubles take a *lot* longer on PCs.
//for small kernels, use floats
instead...
register float *kLine;
for (dy = 0; dy < 2; dy++) //top & next to top lines
{
register float *iLineTop = LinePtr[dy];
register float *iLineBot = LinePtr[4 - dy];
kLine = Kernel[dy];
for (dx = 0; dx < 2; dx++) //small enough kernel, you could
write it out explicitly
{
register int idx = 4 - dx;
//do 4 entries at a time. Note, if your image is integer,
this is even faster since
// integer multiply is a *lot* slower that integer add
sum += Kernel[dx] * (iLineTop[dx - 2] + iLineTop[idx - 2]
+ iLineBot[dx - 2] + iLineBot[idx - 2]);
}//dx-loop
}//dy - loop
//middle horiz line (dy = 2, dx = 0..5)
TempLine = LinePtr[2];
kLine = Kernel[2];
for (dx = 0; dx < 2; dx++)
{
register int idx = 4 - dx;
sum += kLine[dx] * (TempLine[dx - 2] + TempLine[idx - 2]);
}
//middle value
sum += kLine[2] * iLinePtr[2][x];
//middle vert line
for(dy = 0; dy < 2; dy++)
{
register float *iLineTop = LinePtr[dy];
register float *iLineBot = LinePtr[4 - dy];
sum += Kernel[dy][2] * (iLineTop[0] + iLineBot[0]);
}
RESULTLINE[x] = sum + ImageOffset;
...
}//x-loop
}//y-loop
CASE 3 : RADIAL SYMMETRY
Here's where it gets interesting and can be faster still, depending upon the
shortcuts you're willing to take:
(The goal is to reduce the computational complexity, so we're going to sum up
values that should be scaled by the same values.)
Let's look at our 5x5 kernel again:
A B C B A
D E F E D
G H I H G
D E F E D
A B C B A (remember, we only needed 9 multiplies for this one)
What if we can make the assumption of radial symmetry and the kernel
looks like this:
A B C B A
B D E D B
C E F E C
B D E D B
A B C B A
Now there only need to be 6 multiplies if we add the right values together
first.
So, if I have an array "ImageSum[6]" that initializes to {0, 0, 0, 0, 0, 0},
I can add the image values into the array and THEN multiply those sums
by a reduced form of the kernel that we compute ahead of time
"ReducedKernel[6]".
What I need then is either an array that tells me which element in ImageSum
I need to cast the image pixels into, OR to hard code it. (For a small
kernel,
the array will be much slower. Also, that array can make use of the same
Bilateral symmetry that the kernels did in CASE 2.)
//This is where the Image values need to be cast in ImageSum:
int KernelDestination[3][3] = {{0, 1, 2}, {2, 4, 5}, {3, 5, 6}};
//For that matter, it is also where the original Kernel values
//need to be cast in ReducedKernel
int ReducedKernel[6] = ??? //input from user?
...
{//y - loop
{//x - loop
...
float ImageSum[6] = {0, 0, 0, 0, 0, 0}; //clear accumulators
float partial;
int target;
for (dy = 0; dy < 2; dy++) //top & next to top lines
{
register float *iLineTop = LinePtr[dy];
register float *iLineBot = LinePtr[4 - dy];
for (dx = 0; dx < 2; dx++) //small enough kernel, you could
write it out explicitly
{
register int idx = 4 - dx;
//do 4 entries at a time. Note, if your image is integer,
this is even faster since
// integer multiply is a *lot* slower that integer add
target = KernelDestination[dy][dx];
partial = (iLineTop[dx - 2] + iLineTop[idx - 2]
+ iLineBot[dx - 2] + iLineBot[idx - 2]);
ImageSum[target] += partial;
}//dx-loop
}//dy - loop
//middle horiz line (dy = 2, dx = 0..5)
TempLine = LinePtr[2];
for (dx = 0; dx < 2; dx++)
{
register int idx = 4 - dx;
target = KernelDestination[2][dx];
partial = (TempLine[dx - 2] + TempLine[idx - 2]);
ImageSum[target] += partial;
}
//middle value
target = KernelDestination[2][2]; //will probably always be the 0
element
partial = iLinePtr[2][x];
ImageSum[target] = partial;
//middle vert line
for(dy = 0; dy < 2; dy++)
{
register float *iLineTop = LinePtr[dy];
register float *iLineBot = LinePtr[4 - dy];
target = KernelDestination[dy][2];
partial = (iLineTop[0] + iLineBot[0]);
ImageSum[target] += partial;
}
//Apply ReducedKernel to ImageSum
{
double sum = 0.0;
for (i = 0; i < 6; i++)
sum += ReducedKernel[i] * ImageSum[i];
}
RESULTLINE[x] = sum + ImageOffset;
...
}//x - loop
}//y - loop
So, with a 5x5 kernel, we can reduce the number of multiplications to 9
using bilateral symmetry and to 6 using radial symmetry. This ratio will
scale as the kernel gets larger. Remember, when using integer math, the
integer multiply can take 3-5 times longer than the integer add.
Also, if you can set up an async fetch for line y+3 while you're processing
(y-2..y+2), you won't be bound by the memory sub-system.
CONCLUSION
Major optimizations using registers, factoring out expensive numerical operations,
and removing IF and CASE statements from inner loops will have a dramatic impact
in how fast your code is. Unfortunately, the more "optimal" your code becomes,
the harder it is to read.
Depending upon the memory bottlenecks and your compiler, it is not unusual to
get a 4x or more speed improvement by optimizing your algorithm and code.
Often another factor of 2-4x can be achieved in assembly language if you are
*really* in need.
-Chris Russ
Reindeer Games, Inc.
http://members.aol.com/ImagProcTK