Thursday, July 18, 2024

Fourier analytic Barron Space theory

I recently gave a talk about applications of Fourier analytic Barron space theory (or Barron-E theory for short) at Erice International School on Complexity: the XVIII Course "Machine Learning approaches for complexity" and have released a toolbox to enable such applications: gbtoolbox. But what is Fourier analytic Barron space theory?

Fourier analytic Barron space theory is a theory that combines a theoretical understanding of the approximation error of neural networks (or the difference between the prediction of the best neural network in the function space of neural networks with some given set of hyperparameters and the true value) with a theoretical understanding of the estimation error of the machine learning algorithm (how much data is required to distinguish one function from another in the function space of neural networks with some given set of hyperparameters) using the path norm. I refer to it as a theory, but really, there are several subtly different theories in the literature, and I have not seen a final theory yet (and I intend to shortly submit a paper that will have my own slightly different version, which I also don't think is the final theory). Where the theory was first presented in a completed form in "A priori estimates of the population risk for two-layer neural networks" by E, Ma, and Wu (see also "Towards a Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't" with Wojtowytsch), but a good understanding of Machine Learning theory is required (I recommend "Understanding Machine Learning: From Theory to Algorithms" by Shalev-Shwartz and Ben-David) and I think the original works by Barron (later with collaborator Klusowski) are also required to understand the theory: "Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks" and "Universal approximation bounds for superpositions of a sigmoidal function". The path norm used by E, Ma and Wu is introduced in "Norm-Based Capacity Control in Neural Networks" by Neyshabur, Tomioka and Srebro. The specific purpose of the theory was to develop an a priori bound on the generalization error.

One obvious way that the theory is incomplete is the absence of optimization error (or the difference between the best possible neural network and the neural network found by the hyperparameters that define the training procedure).

In this first post, I will discuss how I think of the approximation part of the theory. My slides are also available at the link above.

Consider the task of a neural network. Basically, you have some data \begin{equation}\{\mathbf{x_k},y_k\}\end{equation} where k identifies the data point and \begin{equation}\mathbf{x_k}\in [-1,1]^d\end{equation} where d is the number of features. We are essentially assuming that there is some function \begin{equation}f(\mathbf{x}) \ni f(\mathbf{x_k})=y_k~.\end{equation}

In Barron-E theory, the task for the neural network is to approximate the effective target function, which is the extension of f(x) to all the Reals. This extension is both in domain and selected such as to minimize the Barron norm \begin{equation} \gamma(f^*) = \inf_{f^*} \int \|\mathbf{\omega}\|_1^2 |\hat{f}^*(\mathbf{\omega})| d\mathbf{\omega} < \infty\end{equation} where \begin{equation}\|\mathbf{\omega}\|_1=\sum_j |\mathbf{\omega_j}| \end{equation} is the Manhattan norm.

There are many mathematical subtleties here that I am going to ignore. One could imagine selecting some set of points from some true function that is nowhere continuous. The effective target function defined above would not likely match the true function anywhere, and Barron-E theory would not be applicable. But for some finite number of discontinuities, we would expect that for some number of data points that we wish to use to define the effective target function, that we could find a function that both would have a finite Barron norm and that we would be almost certain would match some test of points that we would want to apply the function to. This representativeness of the training data and test data is a concern of the machine learning theory of estimation error, and we will move on at this moment (but may return in a later post).

Since we can define a Fourier transform for the effective target function, \begin{equation}\tilde{f}^*(\mathbf{\omega})~,\end{equation} we then have \begin{equation} f^*(\mathbf{x}) \simeq \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-i\mathbf{x}\cdot \mathbf{\omega}} e^{i\mathbf{y}\cdot \mathbf{\omega}} f^*(\mathbf{y})  d\mathbf{y}d\mathbf{\omega}\end{equation} where we have left of factors of 2π. Then, for well behaved effective target functions, \begin{equation}  \int \|\mathbf{\omega}\|_1^2 \sigma(\mathbf{\hat{\omega}}\cdot \mathbf{x} + \alpha )e^{i \|\mathbf{\omega}\|_1 \alpha } d\alpha \simeq e^{-i \mathbf{\omega} \cdot \mathbf{x}} ~, \end{equation} where σ(y) is the Ramp function and \begin{equation}\mathbf{\hat{\omega}} =  \mathbf{\omega} /  \|\mathbf{\omega}\|_1 ~. \end{equation} Applying this, once more leaving off factors of 2π, we have \begin{equation}   f^*(\mathbf{x}) \simeq \int \int \tilde{f}^*(\mathbf{\omega}) \|\omega\|_1^2 \sigma(\hat{\omega}\cdot x + \alpha )e^{i \|\omega\|_1 \alpha } d\alpha d\mathbf{\omega} ~.\end{equation}

This is very suggestive, especially when we recall that we aren't interested in functions defined on the Reals but rather only on functions defined over the domain of the data. We have \begin{equation}f^*(\mathbf{x}) \simeq \int\int h(\mathbf{\omega},\mathbf{x},\alpha) p(\mathbf{\omega},\alpha) d\mathbf{\omega} d\alpha\end{equation} where \begin{equation}h(\mathbf{\omega},\mathbf{x},\alpha) \simeq -\mathrm{sgn}(\cos{(\|\mathbf{\omega}\|_1\alpha+\phi(\mathbf{\omega}))}) \sigma(\mathbf{\hat{\omega}}\cdot\mathbf{ x}+\alpha)\end{equation} and \begin{equation}p(\mathbf{\omega},\alpha)\simeq \|\mathbf{\omega}\|_1^2 |\tilde{f}^*(\mathbf{\omega})| |\cos{(\|\mathbf{\omega}\|_1\alpha + \phi(\mathbf{\omega}))}|/\nu \end{equation} and \begin{equation} \tilde{f}^*(\mathbf{\omega})=|\tilde{f}^*(\mathbf{\omega})|e^{i \phi(\mathbf{\omega})} \end{equation} and \begin{equation}\nu\leq 2\gamma(f^*)~.\end{equation} Using this we can define a Monte Carlo estimator, \begin{equation}     f_m(\{\mathbf{\omega},\alpha\},\mathbf{x}) \simeq \sum_j^m h(\mathbf{\omega}_j,\alpha_j,\mathbf{x})  \end{equation} which approximates the effective target function for \begin{equation}\{ \mathbf{\omega}_j,\alpha_j \} \end{equation} drawn from probability density function p(ω,α). The variance of such simple Monte Carlo estimators is easy to calculate, and so we have a bound on the approximation error of this Monte Carlo estimator of \begin{equation} 4 \gamma^2(f^*)/m ~.\end{equation}

This Monte Carlo estimator looks very similar to a shallow neural network with m hidden nodes (which we will call a Barron-E neural network). There are some important differences between Barron-E neural networks and those that we work with in standard practice, some of which we can argue would give smaller approximation errors than those given in Barron-E theory. First and most simply, the inner weight parameters in the Barron-E neural network have a Manhattan norm of 1. However, this can be addressed with an easy scale invariant transform of a neural network. Also, the integral over the bias is generally going to be much less than 2, but this will result in a smaller bound. Most importantly, the outer weights of a Barron-E theory neural network are constants that depend on the Barron norm. In practice, this suggests using Barron-E theory for applications such as inner weight initialization and not a generalization bound (see my slides above or the gbtoolbox), and we do some second step where we take some large number of nodes with constant outer weights and interpolate to have a smaller number of nodes with non-constant outer weights. Or we improve the theory.

The development that I present here isn't meant to be a presentation of the best theory (which doesn't exist yet), but it is a useful presentation of Fourier analytic Barron space theory. This presentation was adapted from reports that I wrote. I intend to write additional blog posts on this topic, and I have a draft on Fourier analytic Barron space theory applications that should be approved this summer.

Tuesday, July 16, 2024

Science and Scientists

 I left academia back in 2019 to become a research scientist at Onto Innovation. My title was research scientist, but was I still a scientist?

I read papers and thought up approaches using machine learning and the theory of machine learning to solve problems in semiconductor metrology (primarily for optical critical dimension (OCD) and a little X-ray critical dimension (XCD) metrology). I also worked on simulation. I mostly did applied research, looking to put ideas together from papers that could be used for OCD.

However, my work's results were primary trade secrets. Some were turned into preliminary patents, and some were handed off to engineers to be put into products. However, none of my results were communicated to other scientists. This was true even internally. I made presentations, but the other research scientists at Onto Innovation were not interested unless they were relevant to their work at the time. I did not go to conferences.

And I felt frustrated.

In January of 2022, I left to focus on Euler Scientific. Our main project was research, with a strong (maybe too strong) basic research component. But since our main customer was the Department of Defense, and because our main goal was to produce an application (the basic research was supposed to turn into applied research and then into an alpha application), and to have a successful company (and so more customers), there was no discussion outside of the limited number of involved scientists (really only myself and two with minimal time commitment from Fermi National Laboratory) and I didn't attend any conferences. There was an agreement to write papers, but without dedicated time/effort, they have been slow (there are two that are in the editing process (one that needs to be re-edited for submission to another journal and one that needs to be submitted for the first time) and two more that are almost finished and an additional one that requires more work). 

Was I still a scientist?

I spent the last 8 months focused on finding new employment. Despite funds being limited, I attended a conference I was invited to: Erice International School on Complexity: the XVIII Course "Machine Learning approaches for complexity". I realized there what I had been missing. Since the fall of 2017 (when I took parental leave, which became a parental leave sabbatical in the spring of 2018 which continued until I left academia in 2019) I had not communicated my research with other scientists (non-collaborators). Science can not be done alone; it must be communicated. This was what I had been missing.

I am not sure if I will still be a scientist in my next career step. I think that being a scientist outside of laboratories and academia is a privilege and one that I can not maintain. I look forward to bringing my work to customers, and my title will be engineer.

Sunday, July 14, 2024

Physics heroes, classes and graduate school

I wrote most of this five years ago but didn't publish it because I was still thinking about it. I am sharing it now, including my original opinions, despite my opinions changing. My new opinions are given in the last three paragraphs.

Over the years, I have thought about physics heroes. A lot of people love Feynman, and while I enjoyed his autobiography and had a professor I TAed for compare me to him, I didn't really consider him my hero. The same with many other physicists. I think that Einstein and Newton were my heroes as I started to pursue physics, but over the last 15 years, I have discovered that I consider Freeman Dyson a hero, and I have since read several of his books. While I understand the point of how heroes hurt science, I also think that they can do a lot of good. And not just by providing inspirational role models like Jim Gates.

When I was a freshman, Freeman Dyson visited my college. He taught a class for non-majors and gave a couple of lectures for the physics students. One I attended had several of us, including Dyson, leave the lecture hall to go to the theater and watch the Matrix. One thing he said at the time stuck with me, at least the concept (since the words didn’t). That was that physics was something you do and not what you study, that you needed to get involved in research and not just take classes.

I didn’t truly understand and internalize this idea until I almost dropped out of my third year of graduate school. It has become one of my guiding philosophies as a physicist and physics professor. 

I have observed that online graduate degrees are popular (universities withstood moocs but risk being outwitted by opms). I don’t see the point of them. Even a non-lab undergraduate degree loses out on a lot of value being online only, and graduate degrees lose out on most of their value. I think a good undergraduate degree should be 70-80% coursework, a master's degree should be 30-50% coursework, and a PhD should be around 10% coursework. The non-coursework component can be done with industrial mentors instead of academic mentors, but the good mentors will generally be at the same location as the good academic mentors. Who will do the legwork, and how is that legwork going to be valid for industrial mentors in a location without academic mentors?

I think the real signal with these online graduate degrees is that new things have been learned. But that isn’t the purpose of a graduate degree.

Since I graduated with my PhD, I have continually learned new things and worked in new fields. I have never taken a course, just reading papers (and books sometimes) to understand where the field is or to find a good technique. I think that instead of doing this, many people are taking a Master's degree (and spending money on it). They do get a certificate that others can see, but they don’t get the deep knowledge that traditionally comes from a Master's (or PhD).

This opinion of mine has changed.

In the last 8 months I have searched for a new position in industry. The requirements for finding a software engineering adjacent position have changed since I left academia for industry in 2019. I did not get the interviews I expected and ran into rounds of coding assessments that were well beyond my level (especially 8 months ago when I received my first interview at a top AI startup).

I didn't pursue an online graduate degree, but if I had the funds to do so, I would have done so, and it would have benefitted me. Both as a signal for the recruiters and hiring managers and because, while I have self-studied and learned a lot and have been following free online self-study courses (without reputable certificates) like those found at CodeSignal and NeetCode, it would have been beneficial to have the direction of a professor.

So, my position on this has changed because I think the signal is important and valuable. I may still never do an online Master's. But if I had had the assets to do one in the last year, I would have done one. And it would have been beneficial for me.

Entrepreneurship

 For most of the past three years, I have been engaged in a new adventure: trying to start a company. I was not, and am not, a natural for this endeavor as I am a scientist and not business-minded.

I left my former employer, Onto Innovation, in January 2022 to lead Euler Scientific and its efforts, primarily to develop a toolbox to enable the interpretability of neural networks, including a bound on the generalization error (or the difference between the neural network's prediction on the training data and the prediction on some unseen test dataset).

In 2023, we attempted to pivot, and in the winter of 2023, I shifted to focus on finding new employment (after releasing an alpha toolbox). Finally, in the summer of 2024, I have found a new position.

Looking back at my time as an entrepreneur, I really needed to be more customer-focused in 2022. I was focused on solving technical problems, which were great and required me to solve them, but I also needed to be focused on customers. Doing both required more time than I had available.

The other thing I needed, which I also really needed to find new positions (and to find positions in the past), is a good network. Most entrepreneurs, especially those focused on business-to-business sales, use their network to find their customers. My network is too international and academic to be useful in finding business-to-business sales.

Being an entrepreneur at this time is not right for me. Before I consider stepping out into entrepreneurship again, I think that I need to have a large network, not just ideas and technical ability (and even investors), to serve as a seed for business-to-business sales.