NeuralNetworks ApproximationTheory MachineLearning

The field of artificial intelligence has seen a revolution led primarily by the application of neural networks (NNs) to problems in applied mathematics, statistics, and engineering. Over the last seven decades there have been major contributions leading to the empirical success of neural networks, yet the theoretical work has been completed in piecemeal and has been primarily overshadowed by the capabilities of this class of algorithms in practice. Neural networks are commonly referred to as "black-boxes" in the sense that they serve as replacements for input-output mappings, i.e. functions, and therefore a full understanding of them necessitates a mathematical understanding of their ability to compute accurate function approximations. This thesis aims to collect and distill some of the major results in the approximation theoretical literature of neural networks, identifying common themes, proof techniques, and major intuitions gained about neural networks as purely mathematical objects. As a first step, the classical results on Universal Approximation for shallow networks are presented. This begins with qualitative results demonstrating that certain shallow neural network architectures are dense in the space of continuous functions and culminates with Barron's seminal work that gives us both our initial look at quantitative approximation results as well as the viewpoint that infinite-width shallow neural networks are integral transforms. Indeed, Barron's quantitative result is an upper bound on the approximation error as a result of sampling such an integral representation. The section culminates with a discussion on Banach spaces of functions amenable to Barron's approximation result and how these spaces relate to dimension-independent approximation rates. The next section then introduces deep neural networks (DNNs). The primary question being investigated is the role depth plays in the context of approximation. As a first step in this direction, we present a collection of results known as depth separations, i.e. constructions of functions which can be efficiently approximated by deep networks but which require exponentially wide shallow(er) networks for efficient approximation. Two key intuitions developed in this section are that 1) the Fourier transform of shallow networks is supported on a collection of 1-D subspaces, providing insight into the types of functions that shallow networks cannot efficiently approximate, and 2) deep networks can more efficiently approximate compositional functions. In addition, we see that a common method for deriving approximation lower bounds is by first approximating using a classical model class (orthogonal polynomials, Taylor polynomials, Fourier bases, etc) and then showing that these basis elements can be approximated by neural networks. We discuss how this leads to exponential Lipschitz constants for the approximating networks. The section also introduces lower bounds on the approximation error, leading to a discussion on the continuity of parameter selection, and non-linear -widths. The third and final section unifies the prior two sections by identifying open questions related to the limitations of the proof techniques and results proved thus far. In particular, generalizing depth separations to wider classes of functions beyond radial and piecewise oscillatory functions is discussed by analyzing the Fourier transform of compositional functions and deep networks. Achieving measure independent lower bounds and the hurdles of approximating functions with large Lipschitz constant is also discussed.

0. Mathematical Tools

0.1 Functional Analysis

Convolutions, radon measures, Riesz-representation, closed graph theorem, Hahn-Banach, Uniform Boundedness. Possibly interpolation.

0.2 Harmonic Analysis

convergence of Fourier series

0.3 Approximation Theory

0.3.1 Linear Approximation

0.3.2 Nonlinear Approximation

0.3.2.1 Nonlinear widths

It is known that some nonlinear methods for approximation outperform any linear method. For example, for the Sobolev Space with , we have where is piecewise constant with parameters, in this case, free (non-uniformly spaced) breakpoints. On the contrary, any -dimensional space has a lower bound Nonlinear widths give a framework to quantify the sense (via lower bounds) in which the approximation rates for nonlinear methods are optimal.

Given a Banach space we consider the manifold of of parameterizations of elements in , namely . The approximation error is defined as and say that is a near best approximation of if there exists such that Likewise for a set , we measure the worst case approximation error We want to choose as to minimize this error for any , unfortunately, taking the infimum over all is trivially as one can always, given a dense subset of , construct a space-filling curve (in ).

To circumvent this, one places conditions on how the approximation is chosen, i.e. on how is chosen for a particular . This can be defined by a map . If is compact and is continuous w.r.t. some norm on , then we call a continuous selection. Given this definition we define the approximation error for a nonlinear method, which depends now on the choice of selection and the continuous nonlinear width defines the best possible approximation amongst all manifolds and continuous selections The width is non-decreasing (??) function of and for .

1. Shallow Networks

1.1 Universal Approximation

A shallow network, also called a 2-layer neural network or a 1-hidden-layer neural network, is a linear combination of the compositions of affine maps with a univariate activation function , which we denote by : This yields a family of functions parameterized by where the activation function is understood to be applied coordinate-wise to the -dimensional vector of the hidden layer. We refer to this class as $𝕟$ . Note that is the space of 1-D affine subspaces in . Typically, the domain is restricted to a bounded subset , and we consider the functions .

The initial question we aim to resolve is identifying the widest class of functions that can be effectively expressed by this family. The ability to express (or approximate) functions of a given class is equivalent to the notion of density in that class. Furthermore, “effectively” refers to identifying the rate at which the largest class of functions can be approximated by with respect to the number of parameters . Our eventual concern will be to understand this question in the context of networks with an arbitrary number of hidden layers and to understand how this rate depends on the depth of the network. Throughout, I will refer to functions with exact representation by a neural network as being represented by a network and functions that can be approximated by a neural network as being expressed by a network.

I begin by presenting some elementary examples of approximation of univariate functions by shallow neural networks of both finite and infinite width, due to Telgarsky:

Simple functions can be represented by a shallow neural network with step function activations. Note that no polynomial can approximate such functions (? quantify the approximation rate of flat region by a polynomial of order ! ?). In particular, neural networks with step function activations are simple functions with indicators of half-lines: (? maybe add picture here ?)
L-Lipschitz functions can be represented by an infinite-width shallow neural network with step activations. By the Fundamental Theorem of Calculus, we can represent any such function by an infinite-width neural network: where defines a measure over step functions. The above infinite-width network gives an average-case estimate (why and what is the estimate?) of .
functions can be represented by an infinite-width shallow neural network with ReLU activations. Again by the Fundamental Theorem of Calculus and integrating by parts, we can represent any such function by an infinite-width neural network: where defines a measure over ReLUs.

We will see that a common theme in neural network approximation is in defining a measure that quantifies the regularity of functions from some class.

Discussion:

The final layer of a network of two or more layers computes a linear combination of all nodes in the preceding hidden layer. Could one view this neural network as a superposition of multiple networks, each approximating an arbitrary activation function? A multilayer network then could be seen as approximating combinations of activations to form the final linear combination of this collection of functions, possibly making Kolmogorov’s superposition theorem immediately relevant.

I now review, in chronological order, some of the key papers on universal approximation:

1.1.1 Hornik (1989)

The main results of this paper is that is dense in the following sets:

- Borel measurable functions
- continuous functions
- Lebesgue -integrable functions for given that the activation function is either continuous or “squashing” i.e. monotonic and sigmoidal, in the following sense: the latter such functions have at most a countable number of discontinuities.

The paper highlights two techniques for showing density of a function class:

Algebras of functions e.g. Stone-Weierstrass
Uniform convergence on compact sets implies convergence in measure for regular, finite measures

The first is used to show shallow networks with either (non-trivial i.e. constant) continuous or squashing activations are dense in the set for any compact subset . This is done by establishing a general form of (given by linear combinations of multiplications of multiple shallow networks) as a separating and nowhere vanishing algebra (closed under addition, multiplication, and scalar multiplication) of continuous functions. The separating property is guaranteed by the fact that the activation function is non-constant, the nowhere vanishing property follows from the existence of the constant affine transformation , and continuity of ensures continuity of any . This result is extended to (possibly discontinuous) squashing functions by further approximation of the squashing function by a continuous activation shallow network.

The second technique is used to show density of shallow networks with either continuous or squashing activations in where the psuedo-norm is defined by: This norm induces the topology given by convergence in measure and is zero if and only if Under this norm it can be shown that is dense in . The proof of the density of follows from being a locally compact metric space, where all finite measures are regular, so that one can always find a compact set arbitrarily close to being full measure (w.l.o.g. assume 1). Uniform convergence on compact sets makes the distance small on this particular compact set, and outside this set, the measure is arbitrarily small, dominating any large differences in distance.

Discussion:

The definition of shallow networks in terms of multiplications of activation with differing affine transformations is used to establish shallow networks as a sub-algebra of . This could also be done by assuming some other activation function (exponential) and approximating these networks by sigmoidal networks.
The authors claim any function with finite support has an exact representation in terms of a shallow network. A natural question is does this hold for compact support?
How do the authors go from density for functions defined on any compact subset to functions defined on all of ?

1.1.2 Cybenko (1989)

This paper gives a proof of the same result as Hornik but using functional analysis, namely the Hahn-Banach and Riesz representation theorems. The key intuition is to understand shallow networks as a particular subspace of continuous functions (why is it a linear subspace?).

The main result of this paper is that is dense in given that the activation function is continuous and discriminatory, in the sense that (the set of finite, signed regular Borel measures) so that: , then . The contrapositive makes the name discriminatory more informative, if a non-zero measure then it induces a linear functional determining the sign of , namely either or , where is determined by the parameters . Then we can view the measure as inducing a linear functional that separates all such into two classes.

The paper highlights two methods for showing density of a function class:

Hahn-Banach and Riesz representation theorems
Translation invariant subspaces e.g. Wiener-Tauberian theorem

The first method is used to show shallow networks with continuous and discriminatory activations are dense in the set . This is done by showing that the subspace is annihilated by (the linear functional induced by) a finite measure. The proof follows from contradiction: if then any non-trivial, continuous linear functional in the annihilator which, by continuity, also annihilates the closure has a non-trivial extension to all of by Hahn-Banach. By the Reisz Representation theorem, has the form , which in particular holds for . So , which is a contradiction as is discriminatory and is not trivial thus .

This holds for sigmoidal activations, given that any sigmoidal function is discriminatory. The proof that a sigmoidal function is discriminatory is as follows: define a pointwise convergent and bounded sequence of sigmoids whose limit is constant on a hyperplane, 1 above, and 0 below. By assumption, each sequence member has integral equal to 0 and in particular so does the limit (by monotone convergence), so that the measure of the half-space lying above the hyperplane is 0 (the measure of the hyperplane is zero, and the function is zero below the hyperplane). This implies that the integral of any step function is 0, and by density of simple functions in , the integral annihilates . In particular, it annihilates , showing that the Fourier transform of the measure is zero. Therefore, is discriminatory.

When the activation is discontinuous but bounded, then is dense in . The proof is similar but uses the fact that (so uses these functionals other than those defined by measures).

The paper also points out the connection between approximation of continuous functions by neural networks and the approximation of decision regions: arbitrary compact, disjoint subsets of can be (approximately) partitioned. The extends a previous result that any set of points can be partitioned into arbitrary sub-regions by a shallow network, i.e. neural networks have infinite VC-dimension (??). The result stems from Littlewood’s intuitive view of Lusin’s theorem: “every measurable function is nearly continuous”. Given that a decision function (that outputs the index of the measurable set the input lies in) is measurable, by Lusin’s there is a continuous restriction of the decision function on a compact set of almost full Lebesgue measure. This continuous restriction can now be approximated by a shallow network, by the above result. It is important to note that the restriction of continuity means that this partitioning is only approximate in the sense that there will always be some set of positive measure that is incorrectly classified, albeit of arbitrarily small measure. A specific application of this is that points sufficiently far from and inside the decision region can be correctly classified!

Cybenko importantly notes the key goal of our investigations: “… Namely [we wish to solve] how many terms in the summation (or equivalently how many hidden nodes) are required to yield an approximation of a given quality? What properties of the function being approximated play a role in determining the number of terms?”

Discussion:

There is an error in the proof (thanks to Telgarsky), by taking the Fourier transform of a function whose Fourier transform possibly doesn’t exist. Technically, assuming , it is defined only in the sense of distributions!

1.1.3 Barron (1993)

This paper aims to answer the questions Cybenko left us with. Namely, it shows that functions in a subspace of for some bounded , which we call the Barron space, can be approximated by a shallow sigmoidal network with error in the number of hidden nodes, with a uniform lower bound of in the case where the activations are fixed.

We define the Barron space as functions with : Such functions have bounded first partial derivatives as a necessary but not sufficient condition - this is clear, as they are a strict subspace of the Sobolev space . As we will see, this space is a Banach space (??) with norm given by the Barron norm

Paying in Attention

Explorer

Neural Network Approximation

0. Mathematical Tools

0.1 Functional Analysis

0.2 Harmonic Analysis

0.3 Approximation Theory

0.3.1 Linear Approximation

0.3.2 Nonlinear Approximation

0.3.2.1 Nonlinear widths

1. Shallow Networks

1.1 Universal Approximation

Discussion:

1.1.1 Hornik (1989)

Discussion:

1.1.2 Cybenko (1989)

Discussion:

1.1.3 Barron (1993)

Graph View

Table of Contents

Backlinks