# Faster Linear Convolution Using Careful Zero Padding

When computing a linear convolution between an image I and a kernel K, the convention is to use FFT, pad with zeroes (also called zero filling) and use the convolution theorem to do this efficiently. FFT is a “divide and conquer” algorithm, meaning that the input data are divided into parts and then transformed using known fast algorithms for known sizes. The fastest transforms are those vectors consisting of power of two sizes, because we can divide the data in a nice way. There are other algorithms that can handle powers of 3, 5 and 7 and subsequently combination of them. However, transforming data of larger factor sizes are rare and rely on generic algorithms, which are slow.

In order to fix this problem, I pad the data with extra zeros, in addition to the padding required by linear convolution. How do I know how much I should pad? First, if we don’t know anything about the size of the input but still want a fast transform, we have to know the prime factors. I could waste a lot of memory and just pad to the nearest power of 2, 3, 5 or 7 for instance, or we could think about it and do something smarter.

If the total size is 1002, for instance. The factors are then 2, 3 and 167. The problem is the factor 167. So we could try to pad to 1100 instead? The extra padding does not affect the result when applying linear convolution, it becomes part of the ordinary padding.

So, which padding size should we choose? 1100? The largest factor of 1100 is 11, which is much better than 167. But, we could do even better.

I created a function I call fft_efficient_padding. It takes as its second argument, k, which means the largest factor in the new recommended size. The first argument is the size we begin with, n=1002 in our example.

The function goes through each factor of its argument. Each factor larger than k is increased by 1 and by using recursion is also tested until a suitable size is found. In the (naive) worst case scenario we recursively call the function $$log_2(n)$$ times, so it hardly affects performance.

If we use our function to try to find a better size using the above example (using k=3) we get 1152, the largest factor is by definition k, so this is a very efficient transform. It is also smaller than our naive approach. If we are more interested in saving memory, we could choose k=7 and get 1008 with a k as largest prime factor.

tmp = factor(a);
res = [];

for i=1:length(tmp)
if tmp(i) > k
res = [res, fft_efficient_padding(tmp(i)+1, k)];
else %finito
res = [res, tmp(i)];
end
end