Standing on the shoulders of giants: machine learning

I recently gave a talk about applications of Fourier analytic Barron space theory (or Barron-E theory for short) at Erice International School on Complexity: the XVIII Course "Machine Learning approaches for complexity" and have released a toolbox to enable such applications: gbtoolbox. But what is Fourier analytic Barron space theory?

Fourier analytic Barron space theory is a theory that combines a theoretical understanding of the approximation error of neural networks (or the difference between the prediction of the best neural network in the function space of neural networks with some given set of hyperparameters and the true value) with a theoretical understanding of the estimation error of the machine learning algorithm (how much data is required to distinguish one function from another in the function space of neural networks with some given set of hyperparameters) using the path norm. I refer to it as a theory, but really, there are several subtly different theories in the literature, and I have not seen a final theory yet (and I intend to shortly submit a paper that will have my own slightly different version, which I also don't think is the final theory). Where the theory was first presented in a completed form in "A priori estimates of the population risk for two-layer neural networks" by E, Ma, and Wu (see also "Towards a Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't" with Wojtowytsch), but a good understanding of Machine Learning theory is required (I recommend "Understanding Machine Learning: From Theory to Algorithms" by Shalev-Shwartz and Ben-David) and I think the original works by Barron (later with collaborator Klusowski) are also required to understand the theory: "Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks" and "Universal approximation bounds for superpositions of a sigmoidal function". The path norm used by E, Ma and Wu is introduced in "Norm-Based Capacity Control in Neural Networks" by Neyshabur, Tomioka and Srebro. The specific purpose of the theory was to develop an a priori bound on the generalization error.

One obvious way that the theory is incomplete is the absence of optimization error (or the difference between the best possible neural network and the neural network found by the hyperparameters that define the training procedure).

In this first post, I will discuss how I think of the approximation part of the theory. My slides are also available at the link above.

Consider the task of a neural network. Basically, you have some data \begin{equation}\{\mathbf{x_k},y_k\}\end{equation} where k identifies the data point and \begin{equation}\mathbf{x_k}\in [-1,1]^d\end{equation} where d is the number of features. We are essentially assuming that there is some function \begin{equation}f(\mathbf{x}) \ni f(\mathbf{x_k})=y_k~.\end{equation}

In Barron-E theory, the task for the neural network is to approximate the effective target function, which is the extension of f(x) to all the Reals. This extension is both in domain and selected such as to minimize the Barron norm \begin{equation} \gamma(f^*) = \inf_{f^*} \int \|\mathbf{\omega}\|_1^2 |\hat{f}^*(\mathbf{\omega})| d\mathbf{\omega} < \infty\end{equation} where \begin{equation}\|\mathbf{\omega}\|_1=\sum_j |\mathbf{\omega_j}| \end{equation} is the Manhattan norm.

There are many mathematical subtleties here that I am going to ignore. One could imagine selecting some set of points from some true function that is nowhere continuous. The effective target function defined above would not likely match the true function anywhere, and Barron-E theory would not be applicable. But for some finite number of discontinuities, we would expect that for some number of data points that we wish to use to define the effective target function, that we could find a function that both would have a finite Barron norm and that we would be almost certain would match some test of points that we would want to apply the function to. This representativeness of the training data and test data is a concern of the machine learning theory of estimation error, and we will move on at this moment (but may return in a later post).

Since we can define a Fourier transform for the effective target function, \begin{equation}\tilde{f}^*(\mathbf{\omega})~,\end{equation} we then have \begin{equation} f^*(\mathbf{x}) \simeq \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-i\mathbf{x}\cdot \mathbf{\omega}} e^{i\mathbf{y}\cdot \mathbf{\omega}} f^*(\mathbf{y}) d\mathbf{y}d\mathbf{\omega}\end{equation} where we have left of factors of 2π. Then, for well behaved effective target functions, \begin{equation} \int \|\mathbf{\omega}\|_1^2 \sigma(\mathbf{\hat{\omega}}\cdot \mathbf{x} + \alpha )e^{i \|\mathbf{\omega}\|_1 \alpha } d\alpha \simeq e^{-i \mathbf{\omega} \cdot \mathbf{x}} ~, \end{equation} where σ(y) is the Ramp function and \begin{equation}\mathbf{\hat{\omega}} = \mathbf{\omega} / \|\mathbf{\omega}\|_1 ~. \end{equation} Applying this, once more leaving off factors of 2π, we have \begin{equation} f^*(\mathbf{x}) \simeq \int \int \tilde{f}^*(\mathbf{\omega}) \|\omega\|_1^2 \sigma(\hat{\omega}\cdot x + \alpha )e^{i \|\omega\|_1 \alpha } d\alpha d\mathbf{\omega} ~.\end{equation}

This is very suggestive, especially when we recall that we aren't interested in functions defined on the Reals but rather only on functions defined over the domain of the data. We have \begin{equation}f^*(\mathbf{x}) \simeq \int\int h(\mathbf{\omega},\mathbf{x},\alpha) p(\mathbf{\omega},\alpha) d\mathbf{\omega} d\alpha\end{equation} where \begin{equation}h(\mathbf{\omega},\mathbf{x},\alpha) \simeq -\mathrm{sgn}(\cos{(\|\mathbf{\omega}\|_1\alpha+\phi(\mathbf{\omega}))}) \sigma(\mathbf{\hat{\omega}}\cdot\mathbf{ x}+\alpha)\end{equation} and \begin{equation}p(\mathbf{\omega},\alpha)\simeq \|\mathbf{\omega}\|_1^2 |\tilde{f}^*(\mathbf{\omega})| |\cos{(\|\mathbf{\omega}\|_1\alpha + \phi(\mathbf{\omega}))}|/\nu \end{equation} and \begin{equation} \tilde{f}^*(\mathbf{\omega})=|\tilde{f}^*(\mathbf{\omega})|e^{i \phi(\mathbf{\omega})} \end{equation} and \begin{equation}\nu\leq 2\gamma(f^*)~.\end{equation} Using this we can define a Monte Carlo estimator, \begin{equation} f_m(\{\mathbf{\omega},\alpha\},\mathbf{x}) \simeq \sum_j^m h(\mathbf{\omega}_j,\alpha_j,\mathbf{x}) \end{equation} which approximates the effective target function for \begin{equation}\{ \mathbf{\omega}_j,\alpha_j \} \end{equation} drawn from probability density function p(ω,α). The variance of such simple Monte Carlo estimators is easy to calculate, and so we have a bound on the approximation error of this Monte Carlo estimator of \begin{equation} 4 \gamma^2(f^*)/m ~.\end{equation}

This Monte Carlo estimator looks very similar to a shallow neural network with m hidden nodes (which we will call a Barron-E neural network). There are some important differences between Barron-E neural networks and those that we work with in standard practice, some of which we can argue would give smaller approximation errors than those given in Barron-E theory. First and most simply, the inner weight parameters in the Barron-E neural network have a Manhattan norm of 1. However, this can be addressed with an easy scale invariant transform of a neural network. Also, the integral over the bias is generally going to be much less than 2, but this will result in a smaller bound. Most importantly, the outer weights of a Barron-E theory neural network are constants that depend on the Barron norm. In practice, this suggests using Barron-E theory for applications such as inner weight initialization and not a generalization bound (see my slides above or the gbtoolbox), and we do some second step where we take some large number of nodes with constant outer weights and interpolate to have a smaller number of nodes with non-constant outer weights. Or we improve the theory.

The development that I present here isn't meant to be a presentation of the best theory (which doesn't exist yet), but it is a useful presentation of Fourier analytic Barron space theory. This presentation was adapted from reports that I wrote. I intend to write additional blog posts on this topic, and I have a draft on Fourier analytic Barron space theory applications that should be approved this summer.

As someone who has worked closely with Machine Learning Algorithms for this decade, I have been very suspicious of calling them Artificial Intelligence. The ones that I use: Support Vector Machines, clustering (Nearest Neighbor), Boosted Decision Trees and (Deep Convolutional) Neural Networks do not look anything like Intelligence as we define it among humans. At least to me.

I have only followed at a distance the game playing machine learning algorithms like AlphaGo. That seemed to be interesting but the state space was still so small that it didn't seem so different to me. It (could be) just memorizing precise patterns and matching them and applying the correct response. They use reinforcement learning, which is something that I haven't spent much time thinking about.

I haven't had the time (with parenting, politics and my own projects in mathematics and physics) to follow the field closely. I was interested though when they announced that they were going to tackle Starcraft 2. I think that the best players end up knowing the game very well, but for me as an amateur it was an example where I needed to deal with new information and to display some intelligence. Although when I did my best in competitive play is when I had responses, which could be considered sort of automated, that I learned.

While the DeepMind AI did get defeated when hampered similarly to a human, it is still an impressive showing.

What I would like to see is taking the same AI (not the same network, but the same reinforcement trained AI) and have it play Warcraft 3 or Defense of the Ancients or even Civilization 6. There would need to be some mapping of controls and limitations, but if intelligence is actually being trained then the AI should be successful there after being trained on Starcraft 2.

After all the state space of Real Life is by some considerations effectively infinite. The fact that the computer can be trained to find patterns at a much increased rate to humans doesn't necessarily make it more intelligent, rather it is if a trained algorithm can be put into a real time situation and adapt and find/relate patterns to be successful in a new situation.

I haven't really read the papers, so when I get time to do so I should be able to think more intelligently on this topic.

The Vox article https://www.vox.com/future-perfect/2019/1/24/18196177/ai-artificial-intelligence-google-deepmind-starcraft-game which shows that the Starcraft 2 state space is (probably?) too large to be completely mapped for the machine learning algorithm and so it is displaying strategy and tactics and not just exact situational responses.

Standing on the shoulders of giants

Thursday, July 18, 2024

Fourier analytic Barron Space theory

Friday, January 25, 2019

Artificial Intelligence