Perspectives in physics, mathematics, economics, politics, philosophy and theology.
Thursday, November 28, 2024
US Election 2024
Thursday, July 18, 2024
Fourier analytic Barron Space theory
I recently gave a talk about applications of Fourier analytic Barron space theory (or Barron-E theory for short) at Erice International School on Complexity: the XVIII Course "Machine Learning approaches for complexity" and have released a toolbox to enable such applications: gbtoolbox. But what is Fourier analytic Barron space theory?
Fourier analytic Barron space theory is a theory that combines a theoretical understanding of the approximation error of neural networks (or the difference between the prediction of the best neural network in the function space of neural networks with some given set of hyperparameters and the true value) with a theoretical understanding of the estimation error of the machine learning algorithm (how much data is required to distinguish one function from another in the function space of neural networks with some given set of hyperparameters) using the path norm. I refer to it as a theory, but really, there are several subtly different theories in the literature, and I have not seen a final theory yet (and I intend to shortly submit a paper that will have my own slightly different version, which I also don't think is the final theory). Where the theory was first presented in a completed form in "A priori estimates of the population risk for two-layer neural networks" by E, Ma, and Wu (see also "Towards a Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't" with Wojtowytsch), but a good understanding of Machine Learning theory is required (I recommend "Understanding Machine Learning: From Theory to Algorithms" by Shalev-Shwartz and Ben-David) and I think the original works by Barron (later with collaborator Klusowski) are also required to understand the theory: "Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks" and "Universal approximation bounds for superpositions of a sigmoidal function". The path norm used by E, Ma and Wu is introduced in "Norm-Based Capacity Control in Neural Networks" by Neyshabur, Tomioka and Srebro. The specific purpose of the theory was to develop an a priori bound on the generalization error.
One obvious way that the theory is incomplete is the absence of optimization error (or the difference between the best possible neural network and the neural network found by the hyperparameters that define the training procedure).
In this first post, I will discuss how I think of the approximation part of the theory. My slides are also available at the link above.
Consider the task of a neural network. Basically, you have some data \begin{equation}\{\mathbf{x_k},y_k\}\end{equation} where k identifies the data point and \begin{equation}\mathbf{x_k}\in [-1,1]^d\end{equation} where d is the number of features. We are essentially assuming that there is some function \begin{equation}f(\mathbf{x}) \ni f(\mathbf{x_k})=y_k~.\end{equation}
In Barron-E theory, the task for the neural network is to approximate the effective target function, which is the extension of f(x) to all the Reals. This extension is both in domain and selected such as to minimize the Barron norm \begin{equation} \gamma(f^*) = \inf_{f^*} \int \|\mathbf{\omega}\|_1^2 |\hat{f}^*(\mathbf{\omega})| d\mathbf{\omega} < \infty\end{equation} where \begin{equation}\|\mathbf{\omega}\|_1=\sum_j |\mathbf{\omega_j}| \end{equation} is the Manhattan norm.
There are many mathematical subtleties here that I am going to ignore. One could imagine selecting some set of points from some true function that is nowhere continuous. The effective target function defined above would not likely match the true function anywhere, and Barron-E theory would not be applicable. But for some finite number of discontinuities, we would expect that for some number of data points that we wish to use to define the effective target function, that we could find a function that both would have a finite Barron norm and that we would be almost certain would match some test of points that we would want to apply the function to. This representativeness of the training data and test data is a concern of the machine learning theory of estimation error, and we will move on at this moment (but may return in a later post).
Since we can define a Fourier transform for the effective target function, \begin{equation}\tilde{f}^*(\mathbf{\omega})~,\end{equation} we then have \begin{equation} f^*(\mathbf{x}) \simeq \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-i\mathbf{x}\cdot \mathbf{\omega}} e^{i\mathbf{y}\cdot \mathbf{\omega}} f^*(\mathbf{y}) d\mathbf{y}d\mathbf{\omega}\end{equation} where we have left of factors of 2π. Then, for well behaved effective target functions, \begin{equation} \int \|\mathbf{\omega}\|_1^2 \sigma(\mathbf{\hat{\omega}}\cdot \mathbf{x} + \alpha )e^{i \|\mathbf{\omega}\|_1 \alpha } d\alpha \simeq e^{-i \mathbf{\omega} \cdot \mathbf{x}} ~, \end{equation} where σ(y) is the Ramp function and \begin{equation}\mathbf{\hat{\omega}} = \mathbf{\omega} / \|\mathbf{\omega}\|_1 ~. \end{equation} Applying this, once more leaving off factors of 2π, we have \begin{equation} f^*(\mathbf{x}) \simeq \int \int \tilde{f}^*(\mathbf{\omega}) \|\omega\|_1^2 \sigma(\hat{\omega}\cdot x + \alpha )e^{i \|\omega\|_1 \alpha } d\alpha d\mathbf{\omega} ~.\end{equation}
This is very suggestive, especially when we recall that we aren't interested in functions defined on the Reals but rather only on functions defined over the domain of the data. We have \begin{equation}f^*(\mathbf{x}) \simeq \int\int h(\mathbf{\omega},\mathbf{x},\alpha) p(\mathbf{\omega},\alpha) d\mathbf{\omega} d\alpha\end{equation} where \begin{equation}h(\mathbf{\omega},\mathbf{x},\alpha) \simeq -\mathrm{sgn}(\cos{(\|\mathbf{\omega}\|_1\alpha+\phi(\mathbf{\omega}))}) \sigma(\mathbf{\hat{\omega}}\cdot\mathbf{ x}+\alpha)\end{equation} and \begin{equation}p(\mathbf{\omega},\alpha)\simeq \|\mathbf{\omega}\|_1^2 |\tilde{f}^*(\mathbf{\omega})| |\cos{(\|\mathbf{\omega}\|_1\alpha + \phi(\mathbf{\omega}))}|/\nu \end{equation} and \begin{equation} \tilde{f}^*(\mathbf{\omega})=|\tilde{f}^*(\mathbf{\omega})|e^{i \phi(\mathbf{\omega})} \end{equation} and \begin{equation}\nu\leq 2\gamma(f^*)~.\end{equation} Using this we can define a Monte Carlo estimator, \begin{equation} f_m(\{\mathbf{\omega},\alpha\},\mathbf{x}) \simeq \sum_j^m h(\mathbf{\omega}_j,\alpha_j,\mathbf{x}) \end{equation} which approximates the effective target function for \begin{equation}\{ \mathbf{\omega}_j,\alpha_j \} \end{equation} drawn from probability density function p(ω,α). The variance of such simple Monte Carlo estimators is easy to calculate, and so we have a bound on the approximation error of this Monte Carlo estimator of \begin{equation} 4 \gamma^2(f^*)/m ~.\end{equation}
This Monte Carlo estimator looks very similar to a shallow neural network with m hidden nodes (which we will call a Barron-E neural network). There are some important differences between Barron-E neural networks and those that we work with in standard practice, some of which we can argue would give smaller approximation errors than those given in Barron-E theory. First and most simply, the inner weight parameters in the Barron-E neural network have a Manhattan norm of 1. However, this can be addressed with an easy scale invariant transform of a neural network. Also, the integral over the bias is generally going to be much less than 2, but this will result in a smaller bound. Most importantly, the outer weights of a Barron-E theory neural network are constants that depend on the Barron norm. In practice, this suggests using Barron-E theory for applications such as inner weight initialization and not a generalization bound (see my slides above or the gbtoolbox), and we do some second step where we take some large number of nodes with constant outer weights and interpolate to have a smaller number of nodes with non-constant outer weights. Or we improve the theory.
The development that I present here isn't meant to be a presentation of the best theory (which doesn't exist yet), but it is a useful presentation of Fourier analytic Barron space theory. This presentation was adapted from reports that I wrote. I intend to write additional blog posts on this topic, and I have a draft on Fourier analytic Barron space theory applications that should be approved this summer.
Tuesday, July 16, 2024
Science and Scientists
I left academia back in 2019 to become a research scientist at Onto Innovation. My title was research scientist, but was I still a scientist?
I read papers and thought up approaches using machine learning and the theory of machine learning to solve problems in semiconductor metrology (primarily for optical critical dimension (OCD) and a little X-ray critical dimension (XCD) metrology). I also worked on simulation. I mostly did applied research, looking to put ideas together from papers that could be used for OCD.
However, my work's results were primary trade secrets. Some were turned into preliminary patents, and some were handed off to engineers to be put into products. However, none of my results were communicated to other scientists. This was true even internally. I made presentations, but the other research scientists at Onto Innovation were not interested unless they were relevant to their work at the time. I did not go to conferences.
And I felt frustrated.
In January of 2022, I left to focus on Euler Scientific. Our main project was research, with a strong (maybe too strong) basic research component. But since our main customer was the Department of Defense, and because our main goal was to produce an application (the basic research was supposed to turn into applied research and then into an alpha application), and to have a successful company (and so more customers), there was no discussion outside of the limited number of involved scientists (really only myself and two with minimal time commitment from Fermi National Laboratory) and I didn't attend any conferences. There was an agreement to write papers, but without dedicated time/effort, they have been slow (there are two that are in the editing process (one that needs to be re-edited for submission to another journal and one that needs to be submitted for the first time) and two more that are almost finished and an additional one that requires more work).
Was I still a scientist?
I spent the last 8 months focused on finding new employment. Despite funds being limited, I attended a conference I was invited to: Erice International School on Complexity: the XVIII Course "Machine Learning approaches for complexity". I realized there what I had been missing. Since the fall of 2017 (when I took parental leave, which became a parental leave sabbatical in the spring of 2018 which continued until I left academia in 2019) I had not communicated my research with other scientists (non-collaborators). Science can not be done alone; it must be communicated. This was what I had been missing.
I am not sure if I will still be a scientist in my next career step. I think that being a scientist outside of laboratories and academia is a privilege and one that I can not maintain. I look forward to bringing my work to customers, and my title will be engineer.
Sunday, July 14, 2024
Physics heroes, classes and graduate school
Entrepreneurship
For most of the past three years, I have been engaged in a new adventure: trying to start a company. I was not, and am not, a natural for this endeavor as I am a scientist and not business-minded.
I left my former employer, Onto Innovation, in January 2022 to lead Euler Scientific and its efforts, primarily to develop a toolbox to enable the interpretability of neural networks, including a bound on the generalization error (or the difference between the neural network's prediction on the training data and the prediction on some unseen test dataset).
In 2023, we attempted to pivot, and in the winter of 2023, I shifted to focus on finding new employment (after releasing an alpha toolbox). Finally, in the summer of 2024, I have found a new position.
Looking back at my time as an entrepreneur, I really needed to be more customer-focused in 2022. I was focused on solving technical problems, which were great and required me to solve them, but I also needed to be focused on customers. Doing both required more time than I had available.
The other thing I needed, which I also really needed to find new positions (and to find positions in the past), is a good network. Most entrepreneurs, especially those focused on business-to-business sales, use their network to find their customers. My network is too international and academic to be useful in finding business-to-business sales.
Being an entrepreneur at this time is not right for me. Before I consider stepping out into entrepreneurship again, I think that I need to have a large network, not just ideas and technical ability (and even investors), to serve as a seed for business-to-business sales.