The field of artificial intelligence has seen a revolution led primarily by the application of neural networks (NNs) to problems in applied mathematics, statistics, and engineering. Over the last seven decades there have been major contributions leading to the empirical success of neural networks, yet the theoretical work has been completed in piecemeal and has been primarily overshadowed by the capabilities of this class of algorithms in practice. Neural networks are commonly referred to as "black-boxes" in the sense that they serve as replacements for input-output mappings, i.e. functions, and therefore a full understanding of them necessitates a mathematical understanding of their ability to compute accurate function approximations. This thesis aims to collect and distill some of the major results in the approximation theoretical literature of neural networks, identifying common themes, proof techniques, and major intuitions gained about neural networks as purely mathematical objects. As a first step, the classical results on Universal Approximation for shallow networks are presented. This begins with qualitative results demonstrating that certain shallow neural network architectures are dense in the space of continuous functions and culminates with Barron's seminal work that gives us both our initial look at quantitative approximation results as well as the viewpoint that infinite-width shallow neural networks are integral transforms. Indeed, Barron's quantitative result is an upper bound on the approximation error as a result of sampling such an integral representation. The section culminates with a discussion on Banach spaces of functions amenable to Barron's approximation result and how these spaces relate to dimension-independent approximation rates. The next section then introduces deep neural networks (DNNs). The primary question being investigated is the role depth plays in the context of approximation. As a first step in this direction, we present a collection of results known as depth separations, i.e. constructions of functions which can be efficiently approximated by deep networks but which require exponentially wide shallow(er) networks for efficient approximation. Two key intuitions developed in this section are that 1) the Fourier transform of shallow networks is supported on a collection of 1-D subspaces, providing insight into the types of functions that shallow networks cannot efficiently approximate, and 2) deep networks can more efficiently approximate compositional functions. In addition, we see that a common method for deriving approximation lower bounds is by first approximating using a classical model class (orthogonal polynomials, Taylor polynomials, Fourier bases, etc) and then showing that these basis elements can be approximated by neural networks. We discuss how this leads to exponential Lipschitz constants for the approximating networks. The section also introduces lower bounds on the approximation error, leading to a discussion on the continuity of parameter selection, and non-linear -widths. The third and final section unifies the prior two sections by identifying open questions related to the limitations of the proof techniques and results proved thus far. In particular, generalizing depth separations to wider classes of functions beyond radial and piecewise oscillatory functions is discussed by analyzing the Fourier transform of compositional functions and deep networks. Achieving measure independent lower bounds and the hurdles of approximating functions with large Lipschitz constant is also discussed.
It is known that some nonlinear methods for approximation outperform any linear method. For example, for the Sobolev Space with , we have where is piecewise constant with parameters, in this case, free (non-uniformly spaced) breakpoints. On the contrary, any-dimensional space has a lower bound Nonlinear widths give a framework to quantify the sense (via lower bounds) in which the approximation rates for nonlinear methods are optimal.
Given a Banach space we consider the manifold of of parameterizations of elements in , namely . The approximation error is defined as and say that is a near best approximation of if there exists such that Likewise for a set , we measure the worst case approximation error
We want to choose as to minimize this error for any , unfortunately, taking the infimum over all is trivially as one can always, given a dense subset of , construct a space-filling curve (in ).
To circumvent this, one places conditions on how the approximation is chosen, i.e. on how is chosen for a particular . This can be defined by a map . If is compact and is continuous w.r.t. some norm on , then we call a continuous selection. Given this definition we define the approximation error for a nonlinear method, which depends now on the choice of selection and the continuous nonlinear width defines the best possible approximation amongst all manifolds and continuous selections
The width is non-decreasing (??) function of and for .
1. Shallow Networks
1.1 Universal Approximation
A shallow network, also called a 2-layer neural network or a 1-hidden-layer neural network, is a linear combination of the compositions of affine maps with a univariate activation function, which we denote by :
This yields a family of functions parameterized by where the activation function is understood to be applied coordinate-wise to the -dimensional vector of the hidden layer. We refer to this class as . Note that is the space of 1-D affine subspaces in . Typically, the domain is restricted to a bounded subset , and we consider the functions .
The initial question we aim to resolve is identifying the widest class of functions that can be effectively expressed by this family. The ability to express (or approximate) functions of a given class is equivalent to the notion of density in that class. Furthermore, “effectively” refers to identifying the rate at which the largest class of functions can be approximated by with respect to the number of parameters . Our eventual concern will be to understand this question in the context of networks with an arbitrary number of hidden layers and to understand how this rate depends on the depth of the network. Throughout, I will refer to functions with exact representation by a neural network as being represented by a network and functions that can be approximated by a neural network as being expressed by a network.
I begin by presenting some elementary examples of approximation of univariate functions by shallow neural networks of both finite and infinite width, due to Telgarsky:
Simple functions can be represented by a shallow neural network with step function activations. Note that no polynomial can approximate such functions (? quantify the approximation rate of flat region by a polynomial of order ! ?). In particular, neural networks with step function activations are simple functions with indicators of half-lines: (? maybe add picture here ?)
L-Lipschitz functions can be represented by an infinite-width shallow neural network with step activations. By the Fundamental Theorem of Calculus, we can represent any such function by an infinite-width neural network: where defines a measure over step functions. The above infinite-width network gives an average-case estimate (why and what is the estimate?) of .
functions can be represented by an infinite-width shallow neural network with ReLU activations. Again by the Fundamental Theorem of Calculus and integrating by parts, we can represent any such function by an infinite-width neural network: where defines a measure over ReLUs.
We will see that a common theme in neural network approximation is in defining a measure that quantifies the regularity of functions from some class.
Discussion:
The final layer of a network of two or more layers computes a linear combination of all nodes in the preceding hidden layer. Could one view this neural network as a superposition of multiple networks, each approximating an arbitrary activation function? A multilayer network then could be seen as approximating combinations of activations to form the final linear combination of this collection of functions, possibly making Kolmogorov’s superposition theorem immediately relevant.
I now review, in chronological order, some of the key papers on universal approximation:
1.1.1 Hornik (1989)
The main results of this paper is that is dense in the following sets:
- Borel measurable functions
- continuous functions
- Lebesgue -integrable functions for
given that the activation function is either continuous or “squashing” i.e. monotonic and sigmoidal, in the following sense:
the latter such functions have at most a countable number of discontinuities.
The paper highlights two techniques for showing density of a function class:
Algebras of functions e.g. Stone-Weierstrass
Uniform convergence on compact sets implies convergence in measure for regular, finite measures
The first is used to show shallow networks with either (non-trivial i.e. constant) continuous or squashing activations are dense in the set for any compact subset . This is done by establishing a general form of (given by linear combinations of multiplications of multiple shallow networks) as a separating and nowhere vanishing algebra (closed under addition, multiplication, and scalar multiplication) of continuous functions. The separating property is guaranteed by the fact that the activation function is non-constant, the nowhere vanishing property follows from the existence of the constant affine transformation , and continuity of ensures continuity of any . This result is extended to (possibly discontinuous) squashing functions by further approximation of the squashing function by a continuous activation shallow network.
The second technique is used to show density of shallow networks with either continuous or squashing activations in where the psuedo-norm is defined by:
This norm induces the topology given by convergence in measure and is zero if and only if Under this norm it can be shown that is dense in . The proof of the density of follows from being a locally compact metric space, where all finite measures are regular, so that one can always find a compact set arbitrarily close to being full measure (w.l.o.g. assume 1). Uniform convergence on compact sets makes the distance small on this particular compact set, and outside this set, the measure is arbitrarily small, dominating any large differences in distance.
Discussion:
The definition of shallow networks in terms of multiplications of activation with differing affine transformations is used to establish shallow networks as a sub-algebra of . This could also be done by assuming some other activation function (exponential) and approximating these networks by sigmoidal networks.
The authors claim any function with finite support has an exact representation in terms of a shallow network. A natural question is does this hold for compact support?
How do the authors go from density for functions defined on any compact subset to functions defined on all of ?
1.1.2 Cybenko (1989)
This paper gives a proof of the same result as Hornik but using functional analysis, namely the Hahn-Banach and Riesz representation theorems. The key intuition is to understand shallow networks as a particular subspace of continuous functions (why is it a linear subspace?).
The main result of this paper is that is dense in given that the activation function is continuous and discriminatory, in the sense that (the set of finite, signed regular Borel measures) so that:
, then . The contrapositive makes the name discriminatory more informative, if a non-zero measure then it induces a linear functional determining the sign of , namely either or , where is determined by the parameters . Then we can view the measure as inducing a linear functional that separates all such into two classes.
The paper highlights two methods for showing density of a function class:
Hahn-Banach and Riesz representation theorems
Translation invariant subspaces e.g. Wiener-Tauberian theorem
The first method is used to show shallow networks with continuous and discriminatory activations are dense in the set . This is done by showing that the subspace is annihilated by (the linear functional induced by) a finite measure. The proof follows from contradiction: if then any non-trivial, continuous linear functional in the annihilator which, by continuity, also annihilates the closure has a non-trivial extension to all of by Hahn-Banach. By the Reisz Representation theorem, has the form , which in particular holds for . So , which is a contradiction as is discriminatory and is not trivial thus .
This holds for sigmoidal activations, given that any sigmoidal function is discriminatory. The proof that a sigmoidal function is discriminatory is as follows: define a pointwise convergent and bounded sequence of sigmoids whose limit is constant on a hyperplane, 1 above, and 0 below. By assumption, each sequence member has integral equal to 0 and in particular so does the limit (by monotone convergence), so that the measure of the half-space lying above the hyperplane is 0 (the measure of the hyperplane is zero, and the function is zero below the hyperplane). This implies that the integral of any step function is 0, and by density of simple functions in , the integral annihilates . In particular, it annihilates , showing that the Fourier transform of the measure is zero. Therefore, is discriminatory.
When the activation is discontinuous but bounded, then is dense in . The proof is similar but uses the fact that (so uses these functionals other than those defined by measures).
The paper also points out the connection between approximation of continuous functions by neural networks and the approximation of decision regions: arbitrary compact, disjoint subsets of can be (approximately) partitioned. The extends a previous result that any set of points can be partitioned into arbitrary sub-regions by a shallow network, i.e. neural networks have infinite VC-dimension (??). The result stems from Littlewood’s intuitive view of Lusin’s theorem: “every measurable function is nearly continuous”. Given that a decision function (that outputs the index of the measurable set the input lies in) is measurable, by Lusin’s there is a continuous restriction of the decision function on a compact set of almost full Lebesgue measure. This continuous restriction can now be approximated by a shallow network, by the above result. It is important to note that the restriction of continuity means that this partitioning is only approximate in the sense that there will always be some set of positive measure that is incorrectly classified, albeit of arbitrarily small measure. A specific application of this is that points sufficiently far from and inside the decision region can be correctly classified!
Cybenko importantly notes the key goal of our investigations:
“… Namely [we wish to solve] how many terms in the summation (or equivalently how many hidden nodes) are required to yield an approximation of a given quality? What properties of the function being approximated play a role in determining the number of terms?”
Discussion:
There is an error in the proof (thanks to Telgarsky), by taking the Fourier transform of a function whose Fourier transform possibly doesn’t exist. Technically, assuming , it is defined only in the sense of distributions!
1.1.3 Barron (1993)
This paper aims to answer the questions Cybenko left us with. Namely, it shows that functions in a subspace of for some bounded , which we call the Barron space, can be approximated by a shallow sigmoidal network with error in the number of hidden nodes, with a uniform lower bound of in the case where the activations are fixed.
We define the Barron space as functions with :
Such functions have bounded first partial derivatives as a necessary but not sufficient condition - this is clear, as they are a strict subspace of the Sobolev space . As we will see, this space is a Banach space (??) with norm given by the Barron norm
Misplaced &||f||_{\mathbb{B}(U)} :&= \int \sup_{x \in U} | \xi \cdot x| \left| \hat{f}(\xi) \right| \space d\xi \\ &= \int || \xi ||_{1} \left| \hat{f}(\xi) \right| \space d\xi \end{aligned}$$ Note that the equality stems from the fact that $\sup_{x \in U} | \xi \cdot x|$ is a dual norm when $U$ is the ball given by some norm and bounds $e^{i \xi \cdot x}$ in the Fourier representation of $f |_U$. The Barron norm gives a measure of the degree of regularity of $f$ that penalizes high-frequencies more so than would the $L^1$ norm of the Fourier transform of $f$ itself. Indeed, finite Barron norm implies sparsity in the Fourier domain in the sense that the Fourier transform vanishes at infinity. Importantly, Barron considers the approximability of functions from a ball of fixed radius $R$ in Barron space. A major contribution of this paper is to demonstrate a gap between approximation using adaptive basis functions vs. a fixed basis. In particular, the main result of the paper is a uniform upper bound on the approximation error in $L^2(B_r, \mu)$, with $\mu$ a finite measure: $\forall \space f \in \mathbb{B}(B_r)$ $\exists \space f_{NN} \in \mathcal{F}_{NN}^{h,1}$ s.t. $$\big{|}\big{|}f-f_{NN}\big{|}\big{|}_{L^2(B_r, \space\mu)}\leq\frac{2r}{\sqrt{h}}\big{|}\big{|}f\big{|}\big{|}_{\mathbb{B}(B_r)}$$ where the Barron norm involves an $n$-dimensional integral and can be exponentially large input dimension (ex. radial functions), even if all derivatives are bounded, requiring an exponential number of nodes for good approximation guarantees. This is a manifestation of the curse of dimensionality in error bounds for shallow networks. From our original definition of $f_{NN}$ it is clear that such a network has $h(n+2)$ parameters, so that the lower bound in high-dimensions is like $h^{-1/n} \to 1$, and the only way to combat this constant lower bound is by increasing the width. This is another demonstration of the curse of dimensionality for shallow networks. The two key results in the paper are the following: 1. Upper bound on approximation error when the parameters are unbounded and bounded 2. Lower bound on the approximation error for fixed basis functions (linear combinations/no affine map) via Kolomogorov $n$-widths #### Upper bounds for approximation in Barron space ##### Unbounded weights The proof is based on the intuition that if a function can be approximated in an *adaptive* basis, then a random sample of the basis elements from a well chosen distribution will yield low error, by the law of large numbers. More explicitly, convex hulls in a Hilbert space induce probability distributions over approximating sequences of functions. Here, the "adaptive basis" is given by bounded sigmoids composed with linear functions: $$G_\sigma = \{ w_1 \sigma(w_0^T x + b) \space | \space |w_1| \leq C\}$$ One first approximates elements $f \in \overline{\text{conv}}(G_\sigma)$ by an element in the convex hull of $G_\sigma \subset L^2(\mathbb{R})$. Namely, pick a function of the form $\tilde{g}_\varepsilon = \sum_{i=1}^m \tilde{\theta}^i \tilde{g}_i$ for $\tilde{g}_i \in G_\sigma$ and $\sum_{i=1}^m \tilde{\theta}^i = 1$ with $m$ large enough so that $|| f - \tilde{g}_\varepsilon ||_{L^2(\mathbb{R})}^2 \leq \frac{\varepsilon}{h}$. Note that this defines a probability measure $\mathbb{P}$ over the set $\{\tilde{g}_1, \cdots, \tilde{g}_m\}$ such that $\mathbb{P}(g = \tilde{g}_i) = \tilde{\theta}^i$. Take $h$ independent draws from this distribution, $g_1, \cdots, g_h \sim \text{Multinoulli}(\tilde{\theta}^1, \cdots, \tilde{\theta}^m)$ and form their sample average, $g = \frac1h \sum_{i=1}^h g_i$ so that $\mathbb{E}[g] = \mathbb{E}[\frac1h \sum_{i=1}^h g_i] = \frac1h \sum_{i=1}^h \mathbb{E}[g_i] = \frac1h \sum_{i=1}^h \tilde{g}_\varepsilon = \tilde{g}_\varepsilon$. By the law of large numbers (??), it follows that for any: $$\begin{aligned} \mathbb{E}\left[||f - g||_{_{L^2(B_r, \mu)}}^2 \right] &\leq \mathbb{E}\left[\left|\right|f - \tilde{g}_\varepsilon||_{_{L^2(B_r, \mu)}}^2 + ||\tilde{g}_\varepsilon - g||_{_{L^2(B_r, \mu)}}^2 \right] \\ &\leq \frac{\varepsilon}{h} + \mathbb{E}\left[||g - \tilde{g}_\varepsilon||_{_{L^2(B_r, \mu)}}^2 \right] \\ &\leq \frac{\varepsilon}{h} + \mathbb{E}\left[\left\langle g - \tilde{g}_\varepsilon, \space g - \tilde{g}_\varepsilon \right\rangle_{_{L^2(B_r, \mu)}} \right] \\ &= \frac{\varepsilon}{h} + \mathbb{E}\left[\left\langle \frac1h \sum_{i=1}^h g_i - \tilde{g}_\varepsilon, \space \sum_{i=1}^h g_i - \tilde{g}_\varepsilon \right\rangle_{_{L^2(B_r, \mu)}} \right] \\ &= \frac{\varepsilon}{h} + \frac1h \mathbb{E}\left[\left\langle g_i - \tilde{g}_\varepsilon, \space g_i - \tilde{g}_\varepsilon \right\rangle_{_{L^2(B_r, \mu)}} \right] \\ &= \frac{\varepsilon}{h} + \frac1h \left( \mathbb{E}\left[||g_i||_{_{L^2(B_r, \mu)}}^2 \right] - ||\tilde{g}_\varepsilon||_{_{L^2(B_r, \mu)}}^2 \right) \\ &\leq \frac{\varepsilon}{h} + \frac1h \left(b^2 - ||\tilde{g}_\varepsilon||_{_{L^2(B_r, \mu)}}^2 \right) \\ &\overset{\varepsilon \to 0}{\to} \frac1h \left(b^2 - ||f||_{_{L^2(B_r, \mu)}}^2 \right) \end{aligned}$$ Given that the mean is bounded by this quantity, there exists such a sequence of approximating elements and corresponding $g$, with the given bound. The upper bound in the main result follows by showing that the Barron space is contained in the space of infinite combinations of translates of sinusoids, and that an infinite convex combination of sinusoids can be approximated by infinite convex combinations of step functions and therefore sigmoids, i.e. that $\mathbb{B}(U) \subset \overline{\text{conv}}(G_{\cos}) \subset \overline{\text{conv}}(G_\sigma)$: 1. $\mathbb{B}(U) \subset \overline{\text{conv}}(G_{\cos})$: The intuition (see Telgarsky) is that one can view the inverse Fourier transform as an infinite width neural network with complex exponential activations: $$\begin{aligned} f(x) - f(0) &= \text{Re}\left(\int ( e^{i \xi \cdot x} - 1) \hat{f}(\xi) \space d\xi \right) \\ &= \int \text{Re}\left(( e^{i \xi \cdot x} - 1) \hat{f}(\xi) \right) \space d\xi \\ &= \int \text{Re}\left(( e^{i \xi \cdot x} - 1) e^{i \theta(\xi)} \big|\hat{f}(\xi)\big| \right) \space d\xi \\ &= \int \big(\cos(\xi \cdot x + \theta(\xi)) - \cos(\theta(\xi))\big) \big|\hat{f}(\xi)\big| \space d\xi \\ &= \int \frac{||f||_{\mathbb{B}(U)}}{\sup_{x \in U} | \xi \cdot x|} \big(\cos(\xi \cdot x + \theta(\xi)) - \cos(\theta(\xi))\big) \space d\mathbb{P}(\xi) \\ \end{aligned}$$ Where the last line is an infinite convex combination in the Hilbert Space $L^2(\mathbb{R}^n, \mathbb{P})$ and $d\mathbb{P} = \frac{\sup_{x \in U} | \xi \cdot x|}{||f||_{\mathbb{B}(U)}} \big|\hat{f}(\xi)\big| \space d \xi$. 2. $\overline{\text{conv}}(G_{\cos}) \subset \overline{\text{conv}}(G_{\text{step}})$: The elements of $G_{\cos}$, shown in the integral above, are univariate sinusoids which can be approximated uniformly on the bounded interval $\left[-\frac{1}{\sup_{x \in U} | \xi \cdot x|}, \frac{1}{\sup_{x \in U} | \xi \cdot x|}\right]$ by piecewise constants (i.e. step functions). We are therefore approximating a univariate function, which can be done using a FTC argument (as in the introduction, see Telgarsky): $$\begin{aligned} \cos(\xi \cdot x + \theta(\xi)) - \cos(\theta(\xi)) &= -\int_0^{ \xi \cdot x} \sin(b + \theta(\xi)) \space db \\ &= \int_0^{\sup_{x \in U} | \xi \cdot x|} \sin(b + \theta(\xi)) \chi_{_{\xi \cdot x + b \geq 0}}(x) \space db - \int_{-\sup_{x \in U} | \xi \cdot x|}^0 \sin(b + \theta(\xi)) \chi_{_{\xi \cdot x + b \leq 0}}(x) \space db \\ &= \int_0^{\sup_{x \in U} | \xi \cdot x|} \big( \sin(b + \theta(\xi)) - \sin(-b + \theta(\xi)) \big) \chi_{_{\{\xi \cdot x + b \geq 0\}}}(x) \space db \end{aligned}$$ Which demonstrates that $\cos$ can be written as an infinite convex combination of step functions (given the correct normalization). Barron's original proof demonstrates the above by an analogous argument approximating by piecewise constants uniformly over a partition of $\left[-\frac{1}{\sup_{x \in U} | \xi \cdot x|}, \frac{1}{\sup_{x \in U} | \xi \cdot x|}\right]$, namely by approximating the two integrals in the second line by Riemann sums. 3. $\overline{\text{conv}}(G_{\text{step}}) \subset \overline{\text{conv}}(G_{\sigma})$: The argument involving Riemann sums also holds when the partition is restricted to the set of points where $\frac{\xi \cdot x}{\sup_{x \in U} | \xi \cdot x|}$ is continuous, which is dense in $[-1,1]$ (??). Noting that $\sigma(|z| (\xi \cdot x + b)) \to \chi_{\{\xi \cdot x + b) \geq 0\}}(x)$ uniformly as $|z| \to \infty$, it holds by Dominated Convergence that $$\int \big| \sigma(|z| (\xi \cdot x + b)) \big|^2 \space dx \to \int \chi_{\{\xi \cdot x + b) \geq 0\}}(x) \space dx$$ In order to make the connection to infinite-width neural networks clear, I state the above Barron representation of Fourier inversion as follows (see Telgarsky): $$f(x) = f(0) + \int \int_0^{\sup_{x \in U} | w^T x|} \Big( \sin(b + \theta(w)) - \sin(-b + \theta(w)) \Big) \Big| \hat{f}(w) \Big| \chi_{_{\{w^T x + b \geq 0\}}}(x) \space db \space dw $$ ##### Bounded weights Unbounded weights are impractical due to memory constraints as well as numerical instability in gradient based learning. Given that the above result for unbounded weights was derived by approximating sigmoids by step functions, it is unsurprising that the error bound for bounded weights now involves a distance between a step function and the sigmoid activation function. Indeed, the norm of the weights determines the scale of the sigmoidal activation: $$\sigma(||w_{0,k}||_U \cdot(\hat{w}_k^Tx + \frac{b_k}{||w_{0,k}||_U}))$$ for the norm $||w_{0,k}||_U = \sup_{x \in U}|w_{0,k}^T x|$ and where $\hat{w}_k = \frac{w_{0,k}}{||w_{0,k}||_U}$ is a unit vector. As we will see, the distance between the unit step and the sigmoid depends on the norm $||w_{0,k}||$ in such a way that this norm must be large enough in order to preserve the $O(\frac1h)$ approximation error, and thus depends on the regularity of the chosen sigmoid. The bound follows from the triangle inequality, after first approximating by sinusoids, as done in the above section. Consider an infinite-width 1-layer sigmoidal neural network with bounded weights, i.e. $f_M \in \overline{\text{conv}}(G_\sigma)$ such that $||w_{0,k}||_U \leq M$. By similar arguments as above one can approximate a function $g \in \overline{\text{conv}}(G_{cos})$ to accuracy $\varepsilon \in (0, \frac12]$, first by step functions $s \in G_{\text{step}}$ on a partition of $\left[-\frac{1}{\sup_{x \in U} | \xi \cdot x|}, \frac{1}{\sup_{x \in U} | \xi \cdot x|}\right]$ with width $\varepsilon \leq \frac1k \leq \frac{\varepsilon}{1-\varepsilon}$: $$\begin{aligned} ||g - f_{NN}||_{L^2(U, \mu)} &\leq ||f_{NN} - f_M||_{L^2(U, \mu)} + ||f_M - g||_{L^2(U, \mu)} \\ &\leq \frac{2 ||g||_\mathbb{B}}{\sqrt{h}} + ||f_M - g||_{L^\infty(U, \mu)} \\ &\leq \frac{2 ||g||_\mathbb{B}}{\sqrt{h}} + ||g - s||_{L^\infty(U, \mu)} + ||s - f_M||_{L^\infty(U, \mu)} \\ &\leq \frac{2 ||g||_\mathbb{B}}{\sqrt{h}} + \sup_{\substack{z = \hat{w}_k^Tx + b_k \\ x \in U}} |g(z) - s(z)| + \sup_{\substack{z = \hat{w}_k^Tx + b_k \\ x \in U}} |s(z) - f_M(z)| \\ &\leq \frac{2 ||g||_\mathbb{B}}{\sqrt{h}} + \frac{2 ||g||_\mathbb{B}}{k} + 2 ||g||_\mathbb{B} \sup_{|z| \geq \frac1k} \Big|\sigma(||w_{0,k}||_U z) - \chi_{\{z > 0\}}(z)\Big| \\ &\leq \frac{2 ||g||_\mathbb{B}}{\sqrt{h}} + 2 ||g||_\mathbb{B} \left(\frac{\varepsilon}{1-\varepsilon} + \sup_{|z| \geq \varepsilon} \Big|\sigma(||w_{0,k}||_U z) - \chi_{\{z > 0\}}(z)\Big| \right) \end{aligned}$$ ** Note some errors in the proof above: cannot bound the L2 norm exactly by supremum norm without a constant, unless it is a probability space (it is I believe). Also how to justify taking an infimum in the last step? ** As the LHS does not depend on the choice of epsilon, one can take the infimum over $\varepsilon \in (0, \frac12]$, i.e. take $\varepsilon \to 0$. which yields the desire bound: $$\begin{aligned} ||g - f_{NN}||_{L^2(U, \mu)} &\leq 2 ||g||_\mathbb{B} \left(\frac{1}{\sqrt{h}} + \inf_{\varepsilon \in (0, \frac12]} (2\varepsilon + \sup_{|z| \geq \varepsilon} \Big|\sigma(||w_{0,k}||_U z) - \chi_{\{z > 0\}}(z)\Big| ) \right) \end{aligned}$$ #### Lower bounds for approximation with a fixed basis The second proof is based on the intuition that given some function in Barron space $f \in \mathbb{B}$, there are exponentially many orthonormal functions with the same degree of oscillation as $f$, and in order to efficiently approximate $f$ with a fixed basis, one must use all such orthonormal functions. In particular, Barron lower bounds the **Kolmogorov $n$-width** of Barron space by considering the $n$-width of sinusoids with oscillations no more than $f$ (??) and then again lower bounding the number of all such functions using a combinatorial method. First, note that for a $2h$-dimensional linear space $V$ with ONB $\{e_1, \cdots, e_{2h}\}$, every $h$-dimensional subspace $G_h = \text{span}(g_1, \cdots, g_h) \subset V$ must have: $$d(e_j, G_h) := \inf_{g \in G_h} d(e_j, g) \geq \frac12$$ for every basis element of $V$. A simple computation shows that the average value of the squared norm of the projections of the $e_j$'s onto $G_h$ is $\frac12$ so that there must exist some $e_j$ with $1 - d(e_j, G_h) = ||P e_j||^2 \leq \frac12$. Now consider a $2h$-dimensional subspace of $L^2(U, \mu)$, $V := B_{2h} = \text{span}(\cos(2\pi k_1 \cdot x), \cdots, \cos(2\pi k_{2h} \cdot x))$, ordered by $||k||_1$ where $k$ is in [[Multi-Index Notation]]. We can bound the **Kolmogorov $n$-width** of Barron space $\mathbb{B}$ from below by restricting to the subspaces $B_{2h} \cap \mathbb{B}$ and projecting an arbitrary basis of $L^2(U, \mu)$ onto $B_{2h}$: $$\begin{aligned} \inf_{e_1, \cdots, e_h \in L^2(U, \mu)} \sup_{f \in \mathbb{B}} \space d(f, \text{span}(e_1, \cdots, e_h)) &\geq \inf_{e_1, \cdots, e_h \in L^2(U, \mu)} \sup_{f \in B_{2h} \cap\mathbb{B}} \space d(f, \text{span}(Pe_1, \cdots, Pe_h)) \\ &\geq \inf_{\substack{G_h \subset L^2(U, \mu) \\ G_h \space \text{subspace}}} \sup_{f \in B_{2h} \cap\mathbb{B}} \space d(f, G_h) \\ &\geq \inf_{\substack{G_h \subset L^2(U, \mu) \\ G_h \space \text{subspace}}} \sup_{f \in \{\frac{||f||_\mathbb{B}}{2\pi|k_1|}\cos(2\pi k_1 \cdot x), \cdots, \frac{||f||_\mathbb{B}}{2\pi||k_{2h}|}\cos(2\pi k_{2h} \cdot x)\}} \space d(f, G_h) \\ &\geq \min_{j \in \{ 1, \cdots, 2h\}} \left(\frac{||f||_\mathbb{B}}{2\pi|k_j|} \right) \left( \inf_{\substack{G_h \subset L^2(U, \mu) \\ G_h \space \text{subspace}}} \sup_{f \in \{\cos(2\pi k_1 \cdot x), \cdots, \cos(2\pi k_{2h} \cdot x)\}} \space d(f, G_h) \right) \\ &\geq \frac12 \min_{j \in \{ 1, \cdots, 2h\}} \left(\frac{||f||_\mathbb{B}}{2\pi|k_j|} \right) \\ &\geq \frac{||f||_\mathbb{B}}{8\pi e^{\pi -1} n} \left(\frac1h\right)^{1/n} \end{aligned}$$ **Discussion**: - What is the nature of these "local" approximation bounds? Why can we not get more general bounds for unbounded domains? The bounds become meaningless as the the domain becomes larger; is this a sort of "approximation" uncertainty principle i.e. function regularity must be more controlled when you have a larger domain of interest? - Likewise, why the assumption of finite measure for such bounds? The connection between the bound and training is when one considers the empirical measure. There are three quantities of interest in the generalization domain: the total error $f - \hat{f}_n$, the approximation error $f - f_n$, and the estimation error $f_n - \hat{f}_n$. - What are the conditions on $f$? The Fourier transform must exist so either $L^1$ or $L^2$. It is also important to note the assumption on the underlying measure space (Lebesgue?) It is also clearly assuming that $f$ is differentiable (why?). This is due to the fact that requiring the Fourier transformed gradient to be integrable does not place an integrability constraint on the Fourier transform near $\xi =0$. This allows one to guarantee the Fourier transform is also well-defined at that point. One consequence of the fact that $\hat{f}$ can blow up near $\xi =0$ is that smooth functions are not contained in the above class! Note linear functions do not have Fourier transform that defines a measure, but do when restricted to a bounded domain. - The measure $\mathbb{P}$ is function dependent. It gives a weight (distribution) over various frequency components of the function itself. - In the way that the Fourier transform is rewritten as an infinite-width neural network, it can be seen that the weights play the role of frequencies and the biases play the role of thresholding for the FTC (as in the introduction of this paper)! ## 1.2 Banach Spaces of Functions Represented by Infinite Width Shallow Networks Barron's result yields two important insights 1. Infinite-width shallow networks correspond to integral representations of functions 2. One can sample such infinite-width networks to obtain finite-width network approximations of functions # 2. Deep Networks A deep, L-hidden-layer neural network, is a linear combination of the compositions of affine maps with a univariate **activation function** $\sigma$, which we denote by $f_{NN}: \mathbb{R}^n \to \mathbb{R}$: $$\begin{aligned} f_{NN}(x; \space \theta) &:= {w_{L+1}}^T \sigma({W_{L}}^T \sigma({W_{L-1}}^T \sigma(\cdots {W_2}^T \sigma({W_1}^Tx + b_1) + b_2 \cdots) + b_{L-1}) + b_{L}) \\ &= w_{L+1}^T(\sigma \circ A_L \circ \sigma \circ A_{L-1} \circ \sigma \circ \cdots \circ A_2 \circ \sigma \circ A_1)(x) \end{aligned}$$ This yields a family of functions parameterized by $\theta = \{W_1 \in \mathbb{R}^{n \times h_1}, W_2 \in \mathbb{R}^{h_1 \times h_2}, \cdots W_{L-1} \in \mathbb{R}^{h_{L-2} \times h_{L-1}}, W_{L} \in \mathbb{R}^{h_{L-1} \times h_{L}}, w_{L+1} \in \mathbb{R}^{h_L}, b_1 \in \mathbb{R}^{h_1}, b_2 \in \mathbb{R}^{h_2}, \cdots b_{L-1} \in \mathbb{R}^{h_{L-1}}, b_L \in \mathbb{R}^{h_L}\}$ where, again, the activation function is understood to be applied coordinate-wise to the $h_k$-dimensional vectos of the hidden layers. We refer to this family of classes as $\mathcal{F}_{NN}^{h,L} := \{f_{NN}(\cdot \space; \space \theta) \space | \space \forall \space \theta \}$, where $h = (h_1, \cdots, h_L)$ is now a [[Multi-Index Notation|multi-index]]. Sometimes we will write $\mathcal{F}_{NN}(h,L)$ in order to reduce notational clutter and when the domain is obvious. Here the number of parameters of the network is $N = h_1(n+1) + \sum_{k=2}^{L} h_k (h_{k-1} + 1)$. ## 2.1 Depth Separations We will be mainly focused on results which display exponential separation: showing that there are deep networks for which any shallow(er) network would require exponential width. In what sense the networks are "deep" and "shallower" will be made precise in the context of the results presented in this section. ### 2.1.1 Telgarsky (2015) & Telgarsky (2016) Summary: Exhibits a $O(2^k)$-Lipschitz sawtooth function that a $k$-layer, constant width univariate ReLU networks can approximate efficiently but that no polynomial width univariate ReLU network of $o(\frac{k}{\log{k}})$-depth can approximate. This implies exponential separation. In particular, the paper establishes a lower bound on the width of shallow networks approximating highly-oscillatory, non-smooth functions by showing that each "oscillation" requires many nodes to be approximated. ### 2.1.2 Yarotsky (2017) ### 2.1.3 Eldan & Shamir (2015) This paper demonstrates a *tight* exponential separation (in the sense that it is for 1 layer difference) by constructing a radial function $f: \mathbb{R}^n \to \mathbb{R}$ expressible by a 2-hidden-layer network with number of parameters polynomial in input dimension ($N \lesssim O(\text{poly}(n))$), but has at least constant approximation error for any sub-exponential ($h \lesssim O(e^n)$) width shallow network and where the approximation error is with respect to a constructed probability measure. The intuition behind their proof is a sort of converse of Barron's result in the following sense: the Fourier transform of a shallow network is concentrated around the origin, unless the number of hidden units is exponential in the input dimension ($h \gtrsim \Omega(e^n)$). Indeed, Barron spaces demonstrate the sufficiency of sparsity in the Fourier domain for approximability by shallow networks. The consequence is that any function with significant high frequency content would require a shallow network with an exponential (in $n$) number of parameters. The following proof sketch outlines the intuition of the paper: 1. **3-layer networks can approximate radial functions** - By the above construction we have a shallow ReLU network that approximates $x_k \mapsto x_k^2$ coordinate wise on an arbitrary bounded domain - Alter the initial layer to compute $r(x) := ||x||^2 = \sum_{k=1}^n x_k^2$ by summing the previous layer's neurons (i.e. the squared components) - The final layer approximates an arbitrary univariate function taking the norm as input $g(x) = \tilde{g}(||x||)$ 2. **Construct a radial function $f$ with constant lower bound on $L^2$ approximation error for shallow networks $g_{NN} \in \mathcal{F}_{NN}^{h,1}$** - Take $d\mathbb{P} := \mathcal{F}^{-1}(\chi_{_{B(0,1)}})\space dx$, and by Plancherel's theorem, $||f-g_{NN}||_{L^2(\mathbb{P})} = ||\hat{f} \ast \chi_{_{B(0,1)}} - \widehat{g_{NN}} \ast \chi_{_{B(0,1)}}||_{L^2(\lambda)}$ where $\lambda$ is the Lebesgue measure - $g_{NN}$ is constant in directions orthogonal to the rows of the weight matrix $W_1$ and has Fourier transform in those corresponding directions in the frequency domain equal to the dirac delta function - By linearity, $g_{NN}$ has support that is the union of all 1-D subspaces spanned by the rows of $W_1$ - The support of $\widehat{g_{NN}} \ast \chi_{_{B(0,1)}}$ is contained in the union of infinite cylinders with principle axes parallel to the rows of $W_1$ - As $r := ||\xi||^2 \to \infty$ the fraction of space contained in such tubes goes to zero exponentially in the dimension $n$ - Construct $f$ so that $\hat{f}$ contains significant mass (in the $L^2$ sense) outside a ball of radius $R$ and such that $\hat{f} \ast \chi_{_{B(0,1)}}$ also has mass far from the origin (on high-frequencies) (**The difficult part of the proof!!** - convolving with the indicator of the unit ball increases $L^1$ norm but not necessarily $L^2$ norm, use Young's inequality or consider the example of indicator of $[-1,1]$ and the Heaviside function) - Take a random superposition of indicators of epsilon shells of width $O(\frac{1}{n^d})$, such that $d\mathbb{P}$ is approximately constant on each shell, i.e. take $f(x) = \sum_{k=1}^N \varepsilon_k \chi_{\{r_k - \frac1{n^d} \leq ||x|| \leq r_k + \frac1{n^d}\}}$ where $f_k \space d\mathbb{P} \approx c_k f_k \space dx$ - Then $\hat{f} \ast \chi_{_{B(0,1)}} = \sum_{k=1}^N \varepsilon_k c_k \mathcal{F}(\chi_{\{r_k - \frac1{n^d} \leq ||x|| \leq r_k + \frac1{n^d}\}})$ where $\mathcal{F}(\chi_{\{r_k - \frac1{n^d} \leq ||x|| \leq r_k + \frac1{n^d}\}})$ are Bessel functions with sufficient mass away from the origin - Choose $\{\varepsilon_k\}_{k=1}^N$ so that $f$ also has sufficient mass away from the origin > [!theorem]+ Eldan & Shamir > > Given a Lebesgue measurable activation function $\sigma: \mathbb{R} \to \mathbb{R}$ that satisfies > 1. (Polynomial growth): $\space|\sigma(x)| \leq C_1 (1+ |x|^\alpha) \space\space \forall \space x \in \mathbb{R}$ > 2. (Universal univariate approximation): $\exists \space C_2 \geq 1 \space \forall \varepsilon > 0$ and $\space f: \mathbb{R} \to \mathbb{R}$ L-Lipschitz with $\text{sppt}(f) = [-M,M]$ $\exists \space s_{NN} \in \mathcal{F}_{NN}^{w,1}(\mathbb{R})$ with $w \leq \frac{C_2 M L}\varepsilon$ and $||f - s_{NN}||_{L^\infty(\mathbb{R})} \leq \varepsilon$ > > for some constants $\alpha, C_1, C_2$. Then $\exists \space \delta, n_0 > 0$ such that > $\forall \varepsilon > 0, n > n_0 \space \exists \space \mathbb{P}$ a probability measure on $\mathbb{R}^n, f: \mathbb{R}^n \to \mathbb{R}$ with > 1. $f$ bounded in $[-2,2]$ > 2. $\text{sppt}(f) \subset B(0, n_0 \sqrt{n})$ > 3. $\exists \space f_{NN} \in \mathcal{F}_{NN}(3 n_0 n^{19/4}, 2)$ with $||f - f_{NN}||_{L^\infty(\mathbb{R}^n)} \leq \varepsilon$ > 4. $\forall \space g_{NN} \in \bigcup_{h \leq Ce^{Cn}} \mathcal{F}_{NN}^{h,1}$, $||f - f_{NN}||_{L^2(\mathbb{P})} \geq \delta$ > Discussion: 1. Is depth important for efficiently approximating high-dimensional functions with symmetries? Can these symmetries always be defined in terms of compositions? 2. Are depth separations *fully* captured by compositions of functions with sufficient high frequency content? Can we frame these results in a more general framework of how composition commutes (?) with the Fourier transform? ### 2.1.4 Venturi, Jelassi, Ozuch & Bruna (2021) This paper extends the depth-separation result of Eldan & Shamir to a class of piece-wise complex exponential functions with oscillations of order $O(\text{poly}(n))$ that can be approximated with $O(\text{poly}(n))$ rates for 2-hidden layer networks but have constant lower bound on the approximation error with respect to heavy-tailed probability measures for $O(\text{poly}(n))$-width shallow networks. Importantly, they show that polynomial oscillation and heavy-tailed measures are necessary conditions. The authors also identify the approximation domain as a source of the gap between the upper and lower bounds, and prove tight (how tight?) bounds for approximation of functions on the unit sphere $\mathbb{S}^{n-1}$, by proving sufficiency of a sparse spherical harmonic decomposition for $O(\text{poly}(n))$ approximation rates, as well as sufficiency of sufficient high-order coefficients in the spherical harmonic decomposition in order to guarantee efficient shallow approximation cannot hold. We first define the class of adversarial functions and error measures. For both we assume there is some universal constant $C \in (0, 1]$. We take $\mathcal{F}_{k,C}$ as the functions $f: \mathbb{R}^n \to \mathbb{C}$ with the following properties: 1. (Piece-wise complex exponential): $$f(x) := e^{2\pi i \omega (v_0^T x + \sigma_{\text{ReLU}}(v_1^Tx))}$$ 2. (Oscillations growing polynomially): $\exists \space k > 0$ $$\sup_{S \subseteq \{1,\cdots, n\}} ||v_0 + v_1 \chi_{_S}(v_1)||_{l_\infty} = \Theta(n^k)$$ and where the indicator is applied coordinate-wise. 3. (Have sufficient mass away from the origin in the Fourier domain): $\exists \space \gamma > 0$ $$\frac{\big|\{k \in \{1, \cdots, n \} \space | \space \omega |v_{1,k}| \geq \gamma n^2 \}\big|}{n} \geq C$$ We next define the class of measures that will be used to define the approximation error lower bounds, in particular we define the class of measures $$\mathcal{M}_C := \{\mu \space \text{a probability measure with density} \space \varphi^2 \space\}$$ where $\varphi$ has the following properties: 1. (Defines a product measure): $$\varphi(x) := \prod_{k=1}^n \psi(x_j)$$ 2. (Has a Fourier transform) $$\space \psi \in L^2(\mathbb{R}) \space \cap \space L^1(\mathbb{R})$$ 3. (Has unit norm): $$||\psi||_{L^2} = 1$$ 4. (Has compactly supported Fourier transform): $$\text{sppt}(\hat{\psi}) \subset [-K, K]$$ 5. (Has sufficient mass away from the origin): $$\begin{aligned} \sqrt{\frac1{2K}} \leq ||\psi||_{L^1} &< \sqrt{\frac{2}{K}} \\ ||\psi||_{L^1} &< \sqrt{\frac{1}{2^{(1-2C)/2} K}} \end{aligned}$$ 6. (Has heavy-tails): $$|\psi(x)| = O(\frac1{|x|})$$ Their main result is as follows: > [!theorem]+ Venturi, Jelassi, Ozuch & Bruna > > Given an activation function $\sigma: \mathbb{R} \to \mathbb{R}$ that satisfies > 1. ($L_{\sigma}$-Lipschitz and bounded at $0$): $\sigma(0) \leq L_{\sigma}$ > 2. (Universal univariate approximation with bounded weights (??)): $\exists \space C \geq 1 \space \forall \space \varepsilon > 0$ and $\space f: \mathbb{R} \to \mathbb{R}$ L-Lipschitz with $\text{sppt}(f) = [-M,M]$ $\exists \space s_{NN} \in \mathcal{F}_{NN}^{w,1}(\mathbb{R})$ with $w + \max_{l,k} ||w_{l,k}||_{l_\infty} \leq \frac{C M L}\varepsilon$ and $||f - s_{NN}||_{L^\infty(\mathbb{R})} \leq \varepsilon$ > > for some constants $L_\sigma, C$. Then $\forall \space f \in \mathcal{F}_{k,C}, \mu \in \mathcal{M}_C$ $\exists \space \alpha \in (0,1)$ such that > 1. $\forall \varepsilon > 0 \space \exists \space f_{NN} \in \mathcal{F}_{NN}(n^{2(1+k)}\varepsilon^{-3/2}, 2)$ with $||f - f_{NN}||_{L^2(\mu)}^2 \leq \varepsilon$ > 2. $\inf_{f_{NN} \in \mathcal{F}_{NN}^{h,1}}||f - f_{NN}||_{L^2(\mu)}^2 \geq 1-h\alpha^n \cdot O(n^{k+1})$ ### 2.1.5 Poggio (2017) # 3. Approximation of Lipschitz Functions ## 3.1 Random Features A random features model is a shallow network with random hidden parameters, i.e. a linear combination of $h$ functions of the form $x \mapsto \sigma(w_{1,k}(\omega)^T x + b_{1,k}(\omega))$ independently drawn from a distribution $\mathbb{P}$, where $\{(W_1^{i}, b_1^i) \in \mathbb{S}^{n-1} \times \mathbb{R}\}_{k=1}^h$ are a collection of random variables with probability law $\mathbb{P}_{\theta}$. Therefore the random features model is given by $$\begin{aligned} f_{NN}(\omega, x; \space \theta) &:=\sum_{k=1}^{h} w_{2,k}\sigma({w_{1,k}}(\omega)^{T}x+b_{1,k}(\omega)) \end{aligned}$$ which yields the family of *random* functions parameterized by $\theta = \{W_1 \in \mathbb{R}^{n \times h}, w_2, b_1 \in \mathbb{R}^h\}$, $\mathcal{F}_{NN}^{h,1}(\mathbb{R^n}, \mathbb{P}) = \{f_{NN}(\cdot \space; \space \theta): \mathbb{R}^n \to \mathbb{R} \space | \space \forall \space \theta \}$. We refer to this family with $\sigma_{\text{ReLU}}$ as a random ReLU network. ### 3.1.1 Hsu, Sanford, Servedio & Gkaragkounis (2017) This paper provides polynomially-tight upper and lower bounds for $L^2(\lambda)$-approximation of $\Theta(1)$- Lipschitz functions on the unit hypercube by random ReLU networks. In particular, they provide a lower bound on the width of the network for polynomial approximation rates. Their result is stated below: > [!theorem]+ Hsu, Sanford, Servedio & Gkaragkounis > > Given any $\varepsilon, \delta, L > 0$ with $\varepsilon \leq \frac{L}{2}$ and $h_0 = e^{\min \left\{ \frac{L^2}{\varepsilon^2} \log{\left(\frac{n \varepsilon^2}{L^2} + 2 \right)}, \space n \log{\left(\frac{L^2}{n \varepsilon^2} + 2 \right)} \right\}}$ > > 1. $\forall \space f: [-1,1]^n \to \mathbb{R}$ $L$-Lipschitz $\exists \space f_{NN} \in \mathcal{F}_{NN}^{h,1}([-1,1]^n, \mathbb{P}_\theta)$ with $h \gtrsim O(h_0)$ and $$\mathbb{P}_\theta\left(\left\{||f - f_{NN}||_{L^2(\mu)}^2 \leq \varepsilon \right\}\right) = 0.9$$ > 2. $\exists \space f: [-1,1]^n \to \mathbb{R}$ $L$-Lipschitz $\forall \space f_{NN} \in \mathcal{F}_{NN}^{h,1}([-1,1]^n, \mathbb{P}_\theta)$ with $h \gtrsim \Omega(h_0)$ and $$\mathbb{P}_\theta\left(\left\{||f - f_{NN}||_{L^2(\mu)}^2 \leq \varepsilon \right\}\right) \geq \frac12$$ > 3. The proof of the upper-bound, (1) follows by first approximating an $L$-Lipschitz functions $f$ by a low-degree trigonometric polynomial $P$ via the Fourier transform of $f$. Bounded Lipschitz constant implies that the Fourier coefficients decay rapidly. Next, $P$ can be represented exactly by an infinite mixture of ReLUs by noting that each sinusoidal component of $P$ is a ridge function. Finally, the result holds by a concentration argument, showing that the empirical average of random ReLUs gives a close approximation to $P$ with high probability. Specifically, $P$ is constructed by taking the following orthonormal basis over $L^2([-1,1]^n)$: $$e_k(x) := \begin{cases} 1 & k = (0,\cdots,0) \\ \sqrt{2} \sin{(\pi k^Tx)} & k \in \mathcal{I}_{\text{odd}} \\ \sqrt{2} \cos{(\pi k^Tx)} & k \in \mathcal{I}_{\text{even}} \end{cases}$$ with the index sets defined over the [[Multi-Index Notation|multi-indices]] $k \in \mathbb{Z}^n$ $$\begin{aligned} \mathcal{I}_\text{odd} := \{k \in \mathbb{Z}^n \space | \space k_i > 0, i = \min_{\alpha_i \neq 0}{\{1, \cdots, n\}} \} \\ \mathcal{I}_\text{even} := \{k \in \mathbb{Z}^n \space | \space k_i < 0, i = \min_{\alpha_i \neq 0}{\{1, \cdots, n\}} \} \end{aligned}$$Given an $L$-Lipschitz function $f$, one approximates by $\tilde{f} \in C_c^\infty([-1,1]^n)$ so that $||f||_k \leq L$ which implies $||\nabla f(x)|| \leq L \space \forall \space x \in [-1,1]^n$ and so the coefficients of $f = \sum_{k \in \mathbb{Z}^n} \langle f, e_k\rangle e_k$ are bounded by $L$ $$\begin{aligned} L^2 &\geq \int_{[-1,1]^n} || \nabla f(x) ||^2 \space dx \\ &= \sum_{i=1}^n \int_{[-1,1]^n} \left( \sum_{k \in \mathbb{Z}^n} \langle f, e_k\rangle \partial_{x_i} e_k \right)^2 \space dx \\ &= \sum_{i=1}^n \sum_{k \in \mathbb{Z}^n} \big| \langle f, e_k\rangle \big|^2 \big|\big| \partial_{x_i} e_k \big|\big|_{L^2([-1,1]^n)}^2 \\ &= \sum_{i=1}^n \sum_{k \in \mathbb{Z}^n} k_i^2 \pi^2 \big| \langle f, e_k\rangle \big|^2 \\ &= \pi^2 \sum_{k \in \mathbb{Z}^n} \big| \langle f, e_k\rangle \big|^2 \big|\big| k \big|\big|^2 \\ \end{aligned}$$which follows by orthogonality of $\partial_{x_i} f, \partial_{x_j} f$ for $i \neq j$ and that $\partial_{x_i} f \in L^2([-1, 1]^d)$. There we see that $\big|\langle f, e_k\rangle \big| \leq \frac{L}{\pi} \leq \frac{L}{2}$. We then take our approximation to be the trigonometric polynomial with coefficients indexed by lattice points lying in some ball of radius $r$, $P(x) := \sum_{k \in B_r(\mathbb{Z}^n)} \langle f, e_k\rangle e_k$. Therefore, the error for this *linear* approximation would be $$\begin{aligned} ||f-P||_{L^2([-1,1]^n)}^2 &= \sum_{k \in \mathbb{Z}^n \setminus B_r(\mathbb{Z}^n)} \big|\langle f, e_k\rangle \big|^2 \\ &\leq \sum_{k \in \mathbb{Z}^n \setminus B_r(\mathbb{Z}^n)} \big|\langle f, e_k\rangle \big|^2 \frac{||k||^2}{r^2} \\ &\leq \frac{L^2}{\pi^2 r^2} \\ &\leq \left( \frac{L}{2 r} \right)^2 \end{aligned}$$We then take $r = \frac{L}{2 \varepsilon}$ to prove the upper bound. We immediately see that the dimension dependent bound originates from the curse of dimensionality of this linear approximation, determined by the number of lattice points lying in a ball in $\mathbb{Z}^n$. The lower-bound (2) is shown similar to Barron (using $n$-widths?), in essence, given enough highly-oscillatory, orthonormal functions in $L^2([-1,1]^n)$, no $h$-dimensional subspace could well approximate them. ### 3.1.2 Bubeck & Selke (2021) This paper demonstrates a universal tradeoff between the number of parameters and the ability to approximately fit data by $O(1)$-Lipschitz functions. In particular, they prove that overparameterization by a factor of $n$ is *necessary* for approximate interpolation by a function with sufficient regularity, under the assumption of a Lipchitz parameterization of the model class with parameters bounded by $O(\text{poly}(n, m))$, where $m$ is the number of datapoints. The result hinges on concentration of measure for Lipschitz functions in high-dimensions, in particular that Lipschitz functions of random variables are $O(1)$-subgaussian. The formal result is as follows > [!theorem]+ Bubeck & Selke > Given $\{(X_i, Y_i) \in \mathbb{R}^d \times [-1,1]\}_{k=1}^m$ *i.i.d.* samples with distribution $\mathbb{P}$ such that > 1. ($\mathbb{P}_x$ is $c-$isoperimetric for Lipschitz functions): $\forall \space f: \mathbb{R}^n \to \mathbb{R}$ $L$-Lipschitz, $\varepsilon \geq 0$ $$\mathbb{P}_x\left(\{|f(X) - \mathbb{E}[f(X)] \geq \varepsilon|\}\right) \leq 2 e^{-\frac{n}{2c}(\frac{\varepsilon}{L})^2}$$ > 2. (Lipschitz parameterization with bounded parameters): $$\mathcal{F} = \left\{f_\theta: \mathbb{R}^n \to \mathbb{R} \space \Big| \space |f_{\theta_1}(x) - f_{\theta_2}(x)| \leq J|| \theta_1 - \theta_2|| \space \text{and} \space ||\theta|| \leq O(\text{poly}(n,m)) \space \forall \space \theta, \theta_1, \theta_2 \in \Theta \subseteq \mathbb{R}^p\right\}$$ > 3. (Noisy data): $$\sigma^2 := \mathbb{E}_x[\text{Var}(Y \space | \space X)] > 0$$ > $\implies$ $$\mathbb{P}\left(\left\{ \frac1n \sum_{k=1}^m (f(X_k) - Y_k)^2 \leq \sigma^2 - \varepsilon \right\} \right) \geq 4 e^{-\frac{m \varepsilon^2}{8^3}} + 2e^{\log{|\mathcal{F}}| - \frac{mn \varepsilon^2}{9^4 c L^2}}$$