Efficient Real-Time Convolution in iamReverb
Ever wondered how iamReverb achieves zero-latency convolution with minimal CPU usage? In this article, we provide a high-level overview of iamReverb’s convolution engine.
Convolution – When Rooms Turn Into Sound
iamReverb is a convolution reverb. It uses impulse responses (IRs) captured in real physical spaces and convolves them with the dry input signal. But what does it mean to convolve two signals?
Formally, the discrete convolution $y$ of two signals $x$ and $h$, with length $N$ and $M$, respectively, is defined as
$$ y[n] = (x * h)[n] = \sum_{m=0}^{M-1} x[n-m]h[m] $$
and has a length of $N+M-1$.
This definition may not be intuitive, so let’s take a look at a concrete example and interpret it in the context of reverberation. We will assume that $x$ represents the input signal and $h$ represents the impulse response. Consider the following example where we reconstruct how a single output sample is computed.
We can see that output sample $y[2]=27$ depends on past samples from the input signal and all samples from the IR. The IR can be interpreted as a sequence of gain coefficients that determines how much each past input sample contributes to the current output sample. In the context of reverberation the IR is essentially a fingerprint of a room or space capturing its acoustic characteristics.
Unfortunately, the computational complexity of convolution grows with both the input length and the IR length. If you’re familiar with Big-O notation, the operation’s time complexity is $\mathcal{O}(N \cdot M)$. This direct-form convolution is therefore impractical for processing longer signals. Luckily, there’s a more efficient way to convolve two signals: fast convolution.
Fast Convolution – Making Convolution Practical
Fast convolution refers to the family of efficient algorithms that reduce the time complexity of computing the convolution operation.
FFT-based convolution
The most widely used approach is FFT-based convolution. It leverages the Fast Fourier Transform (FFT) and the convolution theorem, which states that time-domain convolution is equivalent to element-wise multiplication in the frequency domain.
FFT-based convolution performs the following steps:
- Zero-pad the input signal and the IR to the same length
- Transform both signals to the frequency domain using FFT
- Perform a pointwise multiplication of both signals in the frequency domain
- Transform the result back to the time domain using inverse FFT
Applying these operations to two signals gives us the same result as with direct-form convolution.
For longer signals, FFT-based convolution is much more efficient than direct-form convolution. The advantage comes from the fact that all of these four steps are cheap to compute compared to direct-form convolution. Both signals can be transferred between frequency and time domain efficiently using FFT and inverse FFT. Also, pointwise multiplication in the frequency domain is comparatively cheap. Assuming that both signals are zero-padded to length $N$, the overall time complexity is dominated by the (inverse) FFT and is therefore $\mathcal{O}(N \cdot \log N)$. A clear derivation of this result can be found here.
Equipped with FFT-based convolution, we have an efficient method to compute convolution of long signals. However, there’s one unaddressed challenge: so far we have assumed that the input signal and the IR are given in full length before convolving. In real-time audio applications, the input signal is typically processed in blocks of samples and is unbounded in length. This raises the question: how can FFT-based convolution be applied to a segmented, potentially infinite input signal?
Overlap-Add Method
One way to process segmented input signals is the overlap-add method. It is based on the steps from the FFT-based convolution described above. However, the overlap-add method applies these steps to each input block independently and then sums the overlapping portions of the resulting blocks – hence the name. These blocks overlap as a result of the convolution output length, effectively giving each block a reverb tail which is then summed with the next block.
The illustration below shows the segmented input signal, the individual overlapping blocks, and the aggregated output signal. Notice that by summing the overlaps with their subsequent blocks we obtain the output signal.
We now know how to handle segmented input signals. However, a key challenge remains when working with long impulse responses, as commonly used in reverberation and room simulation. Recall that in step 0 of FFT-based convolution, both signals are zero-padded to the same length. If the IR is much longer than the input blocks, most of the padded input blocks consist of zeros. As a result, a large portion of the FFT convolution performs redundant computations, leading to poor overall efficiency.
The figure below depicts such an example. Notice that the output blocks are disproportionately long compared to the input block size.
This inefficiency could be reduced by increasing the input block size and waiting until a larger buffer of input samples has been accumulated before performing the convolution. However, this approach directly introduces additional input–output delay: no output samples can be produced until the input buffer is filled and the FFT convolution with that buffer has been executed. Such buffering-induced latency is unacceptable with zero-latency convolution reverbs. Fortunately, there exists a method that preserves zero-latency while avoiding the aforementioned redundancies for long impulse responses: partitioned convolution.
By "zero-latency" algorithm, we mean an algorithm that introduces no additional algorithmic latency beyond the inherent audio buffer latency.
Partitioned Convolution – Efficient Real-Time Convolution With Long IRs
The core idea of partitioned convolution is to split the impulse response into smaller blocks (partitions), performing block-wise FFT-based convolution of each IR partition with the corresponding input signal segments. By dividing large FFT convolution operations into smaller ones, it allows us to match the block sizes of the input signal and impulse response partitions, thereby achieving higher overall efficiency with FFT-based convolution.
For real-time applications with minimal latency and high efficiency requirements, non-uniform partitioned convolution is commonly used. This method splits the IR into short segments at the beginning and progressively longer segments toward the end. This provides low latency for the early part of the impulse response while allowing the later segments to be processed more efficiently. There are different ways to partition the IR, each with different computational properties such as cost and load distribution. A discussion of these partitioning strategies is beyond the scope of this article – interested readers can find more details here.
Further Optimizations
iamReverb’s convolution engine builds on non-uniform partitioned convolution using the overlap-add method. These methods already achieve great performance with zero-latency. We have pushed these algorithms even further by extensively using the SIMD capabilities of modern CPUs. SIMD (Single Instruction, Multiple Data) enables the processing of multiple data elements within a single CPU instruction. This parallelism significantly accelerates individual steps of FFT-based convolution, including the forward and inverse FFT operations as well as the frequency-domain multiplication.
Another key factor that makes iamReverb feel lightweight in your DAW is our carefully designed multithreading architecture. It reduces the workload on the DAW’s audio thread by offloading a large part of the processing load to background threads. Only a small partition of the impulse response is convolved on the audio thread, while convolution with the IR tail runs entirely in the background, leaving more CPU power to other plugins.
Putting It All Together
iamReverb achieves zero-latency convolution with exceptional CPU efficiency. At its core, iamReverb employs non-uniform partitioned convolution with the overlap-add method, enabling efficient processing of long impulse responses without introducing delay. Advanced SIMD optimizations and smart multithreading guarantee that iamReverb works seamlessly, even in complex projects.