Thursday, July 18, 2024

Fourier analytic Barron Space theory

I recently gave a talk about applications of Fourier analytic Barron space theory (or Barron-E theory for short) at Erice International School on Complexity: the XVIII Course "Machine Learning approaches for complexity" and have released a toolbox to enable such applications: gbtoolbox. But what is Fourier analytic Barron space theory?

Fourier analytic Barron space theory is a theory that combines a theoretical understanding of the approximation error of neural networks (or the difference between the prediction of the best neural network in the function space of neural networks with some given set of hyperparameters and the true value) with a theoretical understanding of the estimation error of the machine learning algorithm (how much data is required to distinguish one function from another in the function space of neural networks with some given set of hyperparameters) using the path norm. I refer to it as a theory, but really, there are several subtly different theories in the literature, and I have not seen a final theory yet (and I intend to shortly submit a paper that will have my own slightly different version, which I also don't think is the final theory). Where the theory was first presented in a completed form in "A priori estimates of the population risk for two-layer neural networks" by E, Ma, and Wu (see also "Towards a Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't" with Wojtowytsch), but a good understanding of Machine Learning theory is required (I recommend "Understanding Machine Learning: From Theory to Algorithms" by Shalev-Shwartz and Ben-David) and I think the original works by Barron (later with collaborator Klusowski) are also required to understand the theory: "Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks" and "Universal approximation bounds for superpositions of a sigmoidal function". The path norm used by E, Ma and Wu is introduced in "Norm-Based Capacity Control in Neural Networks" by Neyshabur, Tomioka and Srebro. The specific purpose of the theory was to develop an a priori bound on the generalization error.

One obvious way that the theory is incomplete is the absence of optimization error (or the difference between the best possible neural network and the neural network found by the hyperparameters that define the training procedure).

In this first post, I will discuss how I think of the approximation part of the theory. My slides are also available at the link above.

Consider the task of a neural network. Basically, you have some data \begin{equation}\{\mathbf{x_k},y_k\}\end{equation} where k identifies the data point and \begin{equation}\mathbf{x_k}\in [-1,1]^d\end{equation} where d is the number of features. We are essentially assuming that there is some function \begin{equation}f(\mathbf{x}) \ni f(\mathbf{x_k})=y_k~.\end{equation}

In Barron-E theory, the task for the neural network is to approximate the effective target function, which is the extension of f(x) to all the Reals. This extension is both in domain and selected such as to minimize the Barron norm \begin{equation} \gamma(f^*) = \inf_{f^*} \int \|\mathbf{\omega}\|_1^2 |\hat{f}^*(\mathbf{\omega})| d\mathbf{\omega} < \infty\end{equation} where \begin{equation}\|\mathbf{\omega}\|_1=\sum_j |\mathbf{\omega_j}| \end{equation} is the Manhattan norm.

There are many mathematical subtleties here that I am going to ignore. One could imagine selecting some set of points from some true function that is nowhere continuous. The effective target function defined above would not likely match the true function anywhere, and Barron-E theory would not be applicable. But for some finite number of discontinuities, we would expect that for some number of data points that we wish to use to define the effective target function, that we could find a function that both would have a finite Barron norm and that we would be almost certain would match some test of points that we would want to apply the function to. This representativeness of the training data and test data is a concern of the machine learning theory of estimation error, and we will move on at this moment (but may return in a later post).

Since we can define a Fourier transform for the effective target function, \begin{equation}\tilde{f}^*(\mathbf{\omega})~,\end{equation} we then have \begin{equation} f^*(\mathbf{x}) \simeq \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-i\mathbf{x}\cdot \mathbf{\omega}} e^{i\mathbf{y}\cdot \mathbf{\omega}} f^*(\mathbf{y})  d\mathbf{y}d\mathbf{\omega}\end{equation} where we have left of factors of 2π. Then, for well behaved effective target functions, \begin{equation}  \int \|\mathbf{\omega}\|_1^2 \sigma(\mathbf{\hat{\omega}}\cdot \mathbf{x} + \alpha )e^{i \|\mathbf{\omega}\|_1 \alpha } d\alpha \simeq e^{-i \mathbf{\omega} \cdot \mathbf{x}} ~, \end{equation} where σ(y) is the Ramp function and \begin{equation}\mathbf{\hat{\omega}} =  \mathbf{\omega} /  \|\mathbf{\omega}\|_1 ~. \end{equation} Applying this, once more leaving off factors of 2π, we have \begin{equation}   f^*(\mathbf{x}) \simeq \int \int \tilde{f}^*(\mathbf{\omega}) \|\omega\|_1^2 \sigma(\hat{\omega}\cdot x + \alpha )e^{i \|\omega\|_1 \alpha } d\alpha d\mathbf{\omega} ~.\end{equation}

This is very suggestive, especially when we recall that we aren't interested in functions defined on the Reals but rather only on functions defined over the domain of the data. We have \begin{equation}f^*(\mathbf{x}) \simeq \int\int h(\mathbf{\omega},\mathbf{x},\alpha) p(\mathbf{\omega},\alpha) d\mathbf{\omega} d\alpha\end{equation} where \begin{equation}h(\mathbf{\omega},\mathbf{x},\alpha) \simeq -\mathrm{sgn}(\cos{(\|\mathbf{\omega}\|_1\alpha+\phi(\mathbf{\omega}))}) \sigma(\mathbf{\hat{\omega}}\cdot\mathbf{ x}+\alpha)\end{equation} and \begin{equation}p(\mathbf{\omega},\alpha)\simeq \|\mathbf{\omega}\|_1^2 |\tilde{f}^*(\mathbf{\omega})| |\cos{(\|\mathbf{\omega}\|_1\alpha + \phi(\mathbf{\omega}))}|/\nu \end{equation} and \begin{equation} \tilde{f}^*(\mathbf{\omega})=|\tilde{f}^*(\mathbf{\omega})|e^{i \phi(\mathbf{\omega})} \end{equation} and \begin{equation}\nu\leq 2\gamma(f^*)~.\end{equation} Using this we can define a Monte Carlo estimator, \begin{equation}     f_m(\{\mathbf{\omega},\alpha\},\mathbf{x}) \simeq \sum_j^m h(\mathbf{\omega}_j,\alpha_j,\mathbf{x})  \end{equation} which approximates the effective target function for \begin{equation}\{ \mathbf{\omega}_j,\alpha_j \} \end{equation} drawn from probability density function p(ω,α). The variance of such simple Monte Carlo estimators is easy to calculate, and so we have a bound on the approximation error of this Monte Carlo estimator of \begin{equation} 4 \gamma^2(f^*)/m ~.\end{equation}

This Monte Carlo estimator looks very similar to a shallow neural network with m hidden nodes (which we will call a Barron-E neural network). There are some important differences between Barron-E neural networks and those that we work with in standard practice, some of which we can argue would give smaller approximation errors than those given in Barron-E theory. First and most simply, the inner weight parameters in the Barron-E neural network have a Manhattan norm of 1. However, this can be addressed with an easy scale invariant transform of a neural network. Also, the integral over the bias is generally going to be much less than 2, but this will result in a smaller bound. Most importantly, the outer weights of a Barron-E theory neural network are constants that depend on the Barron norm. In practice, this suggests using Barron-E theory for applications such as inner weight initialization and not a generalization bound (see my slides above or the gbtoolbox), and we do some second step where we take some large number of nodes with constant outer weights and interpolate to have a smaller number of nodes with non-constant outer weights. Or we improve the theory.

The development that I present here isn't meant to be a presentation of the best theory (which doesn't exist yet), but it is a useful presentation of Fourier analytic Barron space theory. This presentation was adapted from reports that I wrote. I intend to write additional blog posts on this topic, and I have a draft on Fourier analytic Barron space theory applications that should be approved this summer.

No comments:

Post a Comment