Artificial neurons as ridge functions
Artificial neurons are a particular case of ridge functions. Thus, linear combinations of ridge functions are a generalization of multilayer perceptrons with one hidden layer. Approximation theory of ridge functions provides an upper bound on the approximation capabilities of neural networks.
Approximation theory
Approximation theory is the branch of mathematics that studies how to approximate a large class of functions combining simpler ones. It looks for sets of functions described by a finite or countable number of parameters that can faithfully reproduce functional classes with an uncountable number of dimensions. Continuous mappings are the most studied example of such a large family.
This topic has a great practical interest in solving optimization problems on functions without making strong hypotheses on the optimal solution. Given a functional , an optimization problem takes the form where is the search space of functions determined by the assumptions on . For example, if we know that the solution is a continuous univariate function, then .
In practice, we cannot solve the optimization problem on a function space with an infinite number of dimensions, because machines can work only with a finite number of parameters. So we need to restrict to a more manageable subset . This restriction adds more assumptions on . Computational requirements dictate these assumptions, not the problem itself. For instance, in many cases, the search space is restricted to linear functions because of their simplicity and interpretability.
Representing uncountable dimensions with a countable amount of them seems impossible. The density property makes it possible. We use this same property in every calculation on a computer. Since is dense in , we can represent real numbers on machines with a precision that is only limited by the available memory.
Similarly, the approximation theory of functions looks for a countable set dense in . Density assures that the actual search space is not reduced because functions in can reproduce with arbitrary precision. Let us recall the definition of a dense set.
DEFINITION Let be a topological space and let . is dense in if for every , any neighborhood of contains at least one point from . Equivalently, is dense in if the closure of is : . The definitions of neighborhood and closure come from general topology. A neighborhood of a point is a set that includes an open subset containing . The closure of a set is the smallest closed set that covers . It is the union of and its limit points.
The most known result of approximation theory is Weierstrass Approximation Theorem.
THEOREM Let be a continuous real-valued function defined on the interval . For every , there exists a polynomial such that , we have , or equivalently, . The notation indicates the uniform or supremum norm. The norm assigns to a bounded function on the interval the value .
Multilayer perceptron
Neural networks encode a subset of continuous functions. To examine the approximation capabilities of this class, we introduce the multiplayer perceptron and one of its generalization.
The multilayer perceptron (MLP) is a fundamental neural network model consisting of a sequence of layers. Every layer is composed by neurons, the elementary processing units of the network. Every artificial neuron in the intermediate hidden layers processes the outputs of the previous layer with a non-linear function such that where and are parameters that change for every neuron. The non-linearity of comes from the activation function , a non-linear function that is the same for every neuron in the network. In scientific literature about neural networks, the values are called weights. They model the strength of the synapse linking the -th neuron of the previous layer with the current neuron. , called bias, recalls the threshold potential involved in the firing of biologic neurons.
In the final layer, said the output layer, neurons operate differently. They perform a linear combination of their inputs. Therefore, the output of multilayer perceptron with a single hidden layer with units is It is possible to stack more hidden layers of various sizes to obtain a deeper network. We will focus on this shallow model because it can already approximate a large class of functions. The considered MLP model outputs a single real number. The analysis can be generalized to outputs in by repeating it on every component. A model whose output is multi-dimensional can be used for both regression and classification tasks. Indeed, classification problems are framed as a regression of class probabilities.
Ridge functions
The function has a particular structure: it is the composition of a univariate function sigma with the inner product, one of the simplest multivariate functions. Functions of such form are called ridge functions.
DEFINITION A ridge function is a function of the form where is a univariate function and is a fixed vector.
The vector is called direction because a ridge function is constant on the parallel hyperplanes orthogonal to . These hyperplanes are defined by the equation , with .
We denote the set of ridge functions with direction Since the function can be arbitrarily scaled, it follows that if for some , then .
For a set , we define A span of a set is the set of linear combinations of elements in , then every has the form We can notice that includes the set of MLPs with one hidden layer, but MLPs have the additional constraint for every .
If we require the continuity of the activation function in an MLP, the functions defined by an MLP belong to the smaller set of linear combinations of continuous ridge functions with .
We are interested in the approximation theory of the functional vector spaces to characterize the class of functions defined by MLPs. If an MLP is capable of approximating a function, the same holds for , a larger family of functions. Conversely, if cannot approximate a function, then neither MLPs could. Therefore gives us an upper bound on the approximation capabilities of MLPs.
Uniform convergence on compact sets
The breadth of the class of continuous functions depends on . For instance, we have already seen that, represents a limited family of functions constant on parallel hyperplanes. Thus, not every is a dense subspace of . We will present the conditions on that ensure that this property holds.
Since density is a property of a topological space, we have to define a topology on to clarify the meaning of this concept. In the approximation theory of continuous functions, the topology of uniform convergence on compact subsets is the most common. The general topological definition of a compact involves covers and subcovers. The Heine-Borel theorem gives a much simpler characterization of compactness in a euclidean space: in , a set is compact if and only if it is closed and bounded.
DEFINITION A sequence of functions is said to converge uniformly on compact sets as to some function if, for every compact set and every , there exist such that If we can prove density in this topology, then the subset is also dense in many other topologies. For example, density would hold in , where is any compact subset of and . Indeed .
A very powerful tool to prove results with this topology is the Stone-Weierstrass Theorem.
THEOREM Suppose is a compact Hausdorff space and is the space of continuous real-valued functions over . A subalgebra of containing a non-zero constant function, is dense in if and only if it separates points. A set of real-valued functions separates points if for any pair , there exists a function such that .
This theorem implies Weierstrass Approximation Theorem: indeed, polynomials on form a subalgebra of , which contains constant functions and separates points. Let us recall the definition of algebra and subalgebra.
DEFINITION Let be a vector space over equipped with a binary operation (a product) from . Then is an algebra if the following identities hold for all elements , , and of , and all scalars and of :
DEFINITION A subalgebra is a subset of an algebra, closed under addition, multiplication by scalar and product. A set is closed under an operation if the result of the operation between elements of the set is still included in the set itself.
Density of continuous ridge functions
We now have all the ingredients to state a density theorem about in . Actually, the theorem is even stronger because the proof shows that is dense in .
THEOREM is dense in in the topology of uniform convergence on compact subsets.
Proof
It is sufficient to prove that linear combinations of the functions , where , are dense in in the topology of uniform convergence on compact subsets. This fact is a consequence of the Stone-Weierstrass Theorem.
For every compact set , is an algebra: it is a vector space, and the product between functions satisfies all the properties listed in the definition of algebra. The set is a linear subspace of , therefore it is closed under addition and multiplication by a scalar. Thanks to the properties of the exponential function, the product of two elements is still included in : Thus, is a subalgebra of .
Putting , we obtain the non-zero constant function . Moreover separates points. Let , then there exists a coordinate such that . The function , where is -th vector of the canonical basis, separates from
All the hypotheses of Stone-Weierstrass theorem are satisfied. Thus, for every compact , is dense in with respect to the uniform norm. That is equivalent to say that is dense in in the topology of uniform convergence on compact subsets.
The density of in tells us that for every continuous functions it exists a sequence of functions in whose limit is in the chosen topology. Thus, any continuous function can be approximated by functions in with an arbitrary small approximation error on every compact of . The uniform convergence on compact sets is not equivalent to uniform convergence. In practice, that difference is irrelevant as we always approximate functions whose domain can be included in a compact set. Most of the time, the approximated function is known only on a finite set of points.
We have already observed that every only depends on the set of normalized directions in . By consequence, the theorem implies that is dense in for an arbitrary constant . In general, we can bound the norm of ridge function directions without any loss in expressive power.
The theorem implies nothing about MLPs. A negative result would have been valid for any subset of , while such a positive result only give us an idea of what kind of approximation the MLPs could attain and under which conditions.
A more general theorem by Vostrecov and Kreines states a necessary and sufficient condition for the density of in .
THEOREM
is dense in in the topology of uniform convergence on compact subsets,
if and only if there is no homogeneous polynomial such that .
This final theorem provides the complete characterization of the density of linear combination of ridge functions. We have not discussed if all these properties are preserved by MLPs, but this is a solid starting point to deepen the subject.
Conclusions
Single hidden layer MLPs are a particular linear combination of ridge functions. We have proved that the class of functions spanned by continuous ridge functions approximates every continuous function with arbitrary precision. This fact is relevant in applications because it allows designing parametric models without imposing compelling constraints.
The density of in does not imply that the same property holds for MLPs, but we can hope that functions represented by MLPs have the same approximation capability. Actually, like the linear combinations of ridge functions, MLPs are universal approximators. That will be the topic of a follow-up article.