Why convolutions are everywhere

Modern deep learning architectures often include convolutions. Convolutions are optimized and efficient operations, but the reason why they are so widespread is much more profound. We are going to show that convolutions arise from simple assumptions on data.

Signals

There are many kinds of data: images, sounds, text, time series are very well-known examples. All these data are discrete. In some cases, this property is intrinsic: for example, written text is naturally a sequence of characters and words. In most cases, when data are the result of a measurement, analog signals are digitized during the acquisition process.

The devices that perform the measurements are called sensors. Sensors produce discrete signals by sampling or aggregating information from the analog world around us. For example, a sound recorder samples audio waves to register a discretized version of them. On the other hand, a camera collects photons in every pixel aggregating the light coming from all directions in a short amount of time.

From a mathematical point of view, a signal is any function defined on a discrete set. This definition is very general, and it includes all the examples above.

DEFINITION A discrete signal is a function whose domain is $\mathbb{Z}$, $f: \mathbb{Z} \longrightarrow \mathbb{R}$.
We denote the $n$-th value of the function as $f[n]$. The definition provided is very similar to the one of a sequence. We will use this parallelism to derive some theorems and properties of signals. In particular, we will use $\ell^{p}(\mathbb{R})$ spaces.

A very particular signal is the discrete Dirac delta.

DEFINITION The discrete Dirac delta is the signal $\delta[n] = \begin{cases} 1 & \text{if } n = 0 \\ 0 & \text{if } n \neq 0 \end{cases}$.

Discrete convolution

The convolution operation is a well-known building block of deep neural networks. We are going to prove some properties of the discrete convolution. We focus our attention on the monodimensional convolution, but all theorems generalize to multiple dimensions.

DEFINITION Let $f$, $g$ be two real signals, the discrete convolution of $f$ and $g$ is $$ (f * g)[n] = \sum_{k=-\infty}^{+\infty}f[k]g[n-k] \quad \forall\: n \in \mathbb{Z}. $$

The meaning of such a formula is not immediately understandable. It is easier to understand it through $\overline{g}[n] = g[-n]$, the flipped version of $g$. When $g$ has finite support in $\{0, ..., m-1\}$, we can visualize the convolution as the scalar product of $\overline{g}$ with every window of $m$ elements of the signal $f$. The support of a function $f: D \longrightarrow \mathbb{R}$ is the subset of the domain $D$ containing those elements which are not mapped to zero:
$\text{supp}(f) = \{x \in D | f(x) \neq 0\}$.

In general, the summation in the convolution definition is not always finite. Recalling Holder's inequality, we can find a sufficient condition to establish whether the convolution is well-defined for every element.

THEOREM Let $f \in \ell^{p}(\mathbb{R})$ and $g \in \ell^{q}(\mathbb{R})$ be signals with $p, q \in [1, \infty]$ such that $$ \frac{1}{p} + \frac{1}{q} = 1 \quad \text{or} \quad \{p, q\} = \{1, \infty\}, $$ then $f * g$ is bounded. For every $p > 1$, $\ell^{p}(\mathbb{R})$ is the space of $p$-summable sequences, i.e. $f \in \ell^{p}(\mathbb{R})$ if and only if $\left(\sum_{k=-\infty}^{\infty} f[k]^p\right)^{\frac{1}{p}} = ||f||_{p}< +\infty$.
$\ell^{\infty}(\mathbb{R})$ is the space of bounded sequences.

Proof

$f * g$ is a bounded signal if $\exist \:M \in \mathbb{R}$ such that $$ |(f*g)[n]| \leq M \quad \forall\:n \in \mathbb{Z}. $$ Using Hölder's inequality and with a simple change of variable, we prove the theorem: Hölder's inequality states that $||fg||_{1} \leq ||f||_{p}||g||_{q}$ if $p$ and $q$ satisfy the condition of the theorem. The special case $p=q=2$ gives the Cauchy–Schwarz inequality. $$ \begin{aligned} |(f*g)[n]| & \leq \sum_{k=-\infty}^{+\infty}|f[k]g[n-k]| \\ & \leq \left[\sum_{k=-\infty}^{+\infty}|f[k]|^{p}\right]^{\frac{1}{p}} \left[\sum_{k=-\infty}^{+\infty}|g[n-k]|^{q}\right]^{\frac{1}{q}} & \textit{\footnotesize Hölder's ineq.} \\ & = \left[\sum_{k=-\infty}^{+\infty}|f[k]|^{p}\right]^{\frac{1}{p}} \left[\sum_{k'=-\infty}^{+\infty}|g[k']|^{q}\right]^{\frac{1}{q}} & {\footnotesize k' = n - k } \\ & = ||f||_{p}||g||_{q}. & \square \end{aligned} $$

Two remarkable properties of the convolution are commutativity and linearity. These two properties are sufficient to explain why convolutions are a fundamental component of many processing systems.

THEOREM Let $f, g$ be two real signals such that its convolution is well-defined, then $$ (f * g)[n] = (g * f)[n] \quad \forall\: n \in \mathbb{Z}. $$ That means that the convolution is a commutative operation.

Proof

For any fixed $n \in \mathbb{Z}$, let $j = n-k$, then $k = n-j$. $$ \begin{aligned} (f * g)[n] & = \sum_{k=-\infty}^{+\infty}f[k]g[n-k] \\ & = \sum_{j=-\infty}^{+\infty}g[j]f[n-j] = (g * f)[n]. & \hspace*{6em} \square \end{aligned} $$

THEOREM Let $f, g, h$ be real signals and $a, b \in \mathbb{R}$. If the convolutions below are all well-defined, then we have $$ ((af + bg) * h)[n] = a(f * h)[n] + b(g * h)[n] \quad \forall\: n \in \mathbb{Z}. $$ Thus the convolution is a linear operation.

Operators

To extract information from a stream of data, we have to process it. Every operation applied to a signal is associated with an operator. For example, we can use operators to reduce noise, to recognize features, to detect peaks or discontinuities.

DEFINITION A discrete operator $L$ is linear if $\forall\: a, b \in \mathbb{R}$ $$L(a \cdot f + b \cdot g)[n] = a \cdot L(f)[n] + b \cdot L(g)[n].$$

DEFINITION A discrete operator is shift-invariant if $\forall\: p \in \mathbb{Z}$ $$L(f[n-p]) = L(f)[n-p].$$

When shift-invariance holds, a shift of the signal causes a corresponding displacement of the result of the operator. For example, element-wise operations are trivially shift-invariant. The consequence is that the result of a shift-invariant operator does not depend on the absolute position of elements in a signal, only the relative position of values matters.

This new perspective shades some light on why this property is of paramount importance. Most processing operations should not depend on the absolute position of the values in the data because, in most situations, this position is arbitrary. For example, when we analyze an image, we do not want to rely on the absolute location of pixels because those positions change as soon as the image is cut or resized. The same applies to sound waves or time series.

Operators and convolution

There is a tight relationship between linear shift-invariant operators and convolution. To reveal this connection, we need to recall the discrete Dirac. Translating the discrete Dirac by $p$, we obtain $$ \delta[n - p] = \begin{cases} 1 & n = p \\ 0 & n \neq p \end{cases}. $$ Using this expression, we can represent every signal as a linear combination of Dirac signals $$ f[n] = \sum_{p = -\infty}^{+\infty} f[p] \cdot \delta[n - p]. $$ Now we have all the elements necessary to prove the main theorem of the article.

THEOREM A discrete operator $L$ is linear and shift-invariant if and only if it exists a discrete signal $h[n]$ such that for every signal $f[n]$ $$ L(f)[n] = (f * h)[n]. $$

Proof

If $L$ is linear and shift-invariant, we have to prove the existence of $h[n]$. Setting $h[n] = L(\delta)[n]$, we have $$ \begin{aligned} L(f)[n] & = L\left(\sum_{p = -\infty}^{+\infty} f[p] \cdot \delta[n - p]\right) & \\ & = \sum_{p = -\infty}^{+\infty} f[p] \cdot L(\delta[n - p]) & \textit{\footnotesize linearity} \\ & = \sum_{p = -\infty}^{+\infty} f[p] \cdot L(\delta)[n - p] & \textit{\footnotesize shift-invariance}\\ & = \sum_{p = -\infty}^{+\infty} f[p] \cdot h[n - p] = (f * h)[n]. & \end{aligned} $$ To prove the inverse implication, we have to show that for every signal $h[n]$, $L(f) = f * h$ is a linear shift-invariant operator. The linearity comes from the linearity of the convolution. Now, we show that the resulting operation is also shift-invariant. $$ \begin{aligned} L(f[n-p]) & = \sum_{k = -\infty}^{+\infty} f[k - p] \cdot h[n - k]) & \\ & = \sum_{k' = -\infty}^{+\infty} f[k'] \cdot h[n - p - k']) & \qquad {\footnotesize k' = k - p} \\ & = L(f)[n - p]. & \square \end{aligned} $$

This theorem shows that convolutions are used in many applications because there is no alternative to express linear shift-invariant operations. We have already examined why shift-invariance is frequently an essential property. While theoretical considerations implied the importance of shift-invariance, linear operators are prevalent for practical reasons. Most of the fundamental operators, like the ones associated with integration and differentiation, are linear. Furthermore, they are simple to express and very efficient on machines.

Sometimes even linearity is desired because of some characteristics of the data. Sound waves are such a type of data because they satisfy the superposition property. That means that the result of the superposition of two waves is equivalent to the sum of the single waves. In such a case, it is natural to use linear operators because linearity assures the validity of the superposition principle for the processed waves too.

Generalizations

In mathematics, a convolution is an operation defined between functions of a real variable. This continuous form is a generalization of the discrete convolution presented.

DEFINITION Let $f$, $g$ be two real function, the convolution of $f$ and $g$ is $$ (f * g)(x) = \int_{-\infty}^{+\infty}f(t)g(x-t) dt \quad \forall\: x \in \mathbb{R}. $$

All the theorems we have proved hold for the continuous case too, replacing the summation symbol with the integral sign and signals with functions or tempered distributions. Tempered distributions are a generalization of functions. The Dirac delta is characterized by the property $\int_{-\infty}^{+\infty} f(x) \delta(x) dx = f(0)$. No function can satisfy this property, but a tempered distribution can.

Data and signals can have many dimensions: a sound wave is monodimensional; a grayscale image has two dimensions; a color image has several channels that compose the third dimension. We can define the discrete convolution for signals of arbitrary dimension and prove all the results following the same steps. This generalization extends the main theorem to all kinds of data, in particular to images where convolutions have gained their fame. We report here the definition of convolution for the general case.

DEFINITION Let $f$, $g: \mathbb{Z}^{d} \longrightarrow \mathbb{R}$ be two $d$-dimensional signals, the convolution of $f$ and $g$ is $$ (f * g)[n] = \sum_{k \in \mathbb{Z}^{d}}f[k]g[n-k] \quad \forall\: n \in \mathbb{Z}^{d}. $$

Convolutional Neural Networks

The convolution operation has become a fundamental operation in the context of deep learning, in particular in the field of computer vision. A Convolutional Neural Network (CNN) is a sequence of 3D convolutions and pooling operations. Pooling operations are non-linear, and this allows the network to learn non-linear relationships between input and output. The most common ones are max-pooling and average-pooling.

Images are a canonical example of a signal that necessitates shift-invariant operators, so it is interesting to know if CNNs are shift-invariant.

Both convolutions and pooling operations perform the same computation on all equally-sized small portions of the input signal. Such calculations are independent of the absolute location of the values, and thus, shift-invariant, at one condition. The stride of the rolling window, the difference between two consecutive positions of the window, has to be 1. A convolution with stride greater than 1 is equivalent to a classic convolution followed by a sub-sampling. The sampling picks only a specific subset of the processed signal, and this selection depends on the absolute location of values. For example, if the stride is 2, values in even or odd positions are chosen.

Currently, most successful CNNs relies on pooling operations with stride 2. Therefore they do not represent a shift-invariant operation! A researcher has noticed this flaw, and he has proposed alternative pooling operations as a replacement in the paper "Making Convolutional Networks Shift-Invariant Again".

Conclusions

The convolution operation in machine learning does not come out-of-the-blue. It emerges naturally from simple principles and assumptions on data: linearity and shift-invariance. Machine learning often seems to progress only through empirical experiments and chance, but many foundational components have a solid theory behind. We have clarified the reason behind the extensive utilization of convolutions in many applications.