Gradient Descent Finds Global Minima of Deep Neural Networks

133 points by superfx 5 years ago

It's worth noting that the primary result of this paper has only to do with the error on the training data under empirical risk minimization. Zero training error =/= a model that generalizes. For any optimization problem, you can always add enough parameters to achieve zero error on a problem over a finite training set (imagine introducing enough variables to fully memorize the map from inputs to labels).

The major contribution of the work is showing that ResNet needs a number of parameters which is polynomial in the dataset size to converge to a global optimum in contrast to traditional neural nets which require an exponential number of parameters.

sytelus 5 years ago

There is bit of difference between fitting dataset to some convenient parameterized function vs finding global minima of non-convex function. Also, paper claims that this can be done in polynomial time.
> The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).
- p1esk 5 years ago
  
  There is bit of difference between fitting dataset to some convenient parameterized function vs finding global minima of non-convex function
  What's the difference? Any point where the loss is zero is global minimum.
MAXPOOL 5 years ago

The way to tackle the problem you state would be finding similar bounds with regularization.

iandanforth 5 years ago

My hope is that as these bounds are refined we can start to do BotE calculations such as, "I have 50k training images of 512x512x3 and 1k classes, this means I'll need a Convolutional Resnet of at most 12 layers and 12M params to fit the training data so let's start at half that." Rather than today which is 'let's use resnet 101 and see if that works.'

trevyn 5 years ago

You have to include some sense of what the classes encode for this to have meaning — for example, “pictures of correct mathematical proofs” vs “pictures of incorrect mathematical proofs” is going to require a much different architecture than “pictures of squares” vs “pictures of circles”.
- iandanforth 5 years ago
  
  Interesting classes! To use the heuristic Andrew Ng proposed if a human could tell the difference between correct and incorrect proofs in 1 second then this problem is likely no harder than most image recognition problems. If, instead, we're talking about analysis that requires symbolic manipulation then we're pretty far outside of the current capabilities of convolutional/residual/fully connected nets for which the paper provides bounds.

ramgorur 5 years ago

I did not understand the paper very well.

1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

2. The only way to guarantee is to start the gradient descent from different points in the search space or try with different step sizes if the algorithm only starts from the same point in the search space.

3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well.

Please correct me if I am wrong.

LolWolf 5 years ago

1. > It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.
This is false. See, e.g., [0][1].
2. I'm not really sure what the question is here.
3. If your loss is bounded from below (it is a square norm) by 0 and you achieve 0 loss, this means that 0 is a global optimum, since, by definition, no other objective value can be smaller than this number.
---
[0] Theorem A.2 in Udell's Generalized Low-Rank models paper https://arxiv.org/pdf/1410.0342.pdf
[1] B&V Convex Optimization (https://web.stanford.edu/~boyd/cvxbook/), Appendix B.1. In fact, I can't find the reference right now, but you can easily prove that GD with an appropriate step-size converges to a global optimum on this problem when initialized at (0,0), even though the problem is non-convex.
- DoctorOetker 5 years ago
  
  that reference you can't find right now seems rather pertinent?
  I think the OP intended and should have written:
  "It's theoretically impossible to guarantee a convergence to global optima using gradient descent for an arbitrary non-convex function."
  For example consider the function f(x)=sin^2(pi * x)+sin^2(pi * N/x) this function has multiple global minima at the divisors of N, where it is f(x)==0, if x or N/x is non-integer, it is guaranteed to be positive...
  I am not taking a stance on if gradient descent does or does not guarantee finding global minima and is thus able to factorize cryptographic grade RSA products of primes, but the claim does appear to imply it.
  Edit: the multiply symbols changed some cursive
  
  LolWolf 5 years ago
  
  > that reference you can't find right now seems rather pertinent?
  Here it is: https://arxiv.org/pdf/1707.08706.pdf (This isn't quite the one I was thinking of, so I'll dig a little deeper, but it covers the idea).
  Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points), but a (very!) slightly generalized variant for convex functions—sub-gradient descent—works.
  > for an arbitrary non-convex function.
  Sure, but this is also obvious since it is NP-hard to reach global optima in arbitrary non-convex problems. Additionally, specifically on the case of GD, I can give simple examples that always fail (consider f(x) = 1 everywhere except f(0) = 0. GD always fails whenever the initial point, x_0 ≠ 0, since the gradient is zero everywhere, except at one point. Additionally, picking initializations randomly, we reach the global minimum with probability 0 whenever we have support with non-empty interior).
  I'm afraid I disagree that this is what the OP intended, though, and I also disagree that the paper's claim implies what you've said, since they only study a very specific subproblem (e.g. minimization of empirical loss on ResNets).
  The relative "ease" of this task vs solving arbitrary NP-hard problems is not difficult to believe, since, given a bunch of training examples, I can always generate a resnet that fits those examples perfectly (i.e. with zero loss) in poly-space in a very dumb way: first, generate a circuit that matches the look-up table of the training samples (which is poly-space in the number of samples and can be done in poly-time), then map that circuit to an NN.
  
  DoctorOetker 5 years ago
  
  >Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points)
  which function is non-differentiable at points? if you refer to my example, it is only nondifferentiable at x=0 and x=inf which are both uninteresting points since they arent divisors of N, for all the rest f(x) I gave is differentiable and lipschitz continuous of order infinity
  This in contrary to your pathological example of f(x)= { 1 (x!=0); 0 (x==0) ... of course GD can not work there, and I wouldn't fault the paper for it...
  don't misunderstand me, the paper is interesting, but the title and certain phrasings are very misleading IMHO
  Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN
  
  LolWolf 5 years ago
  
  > which function is non-differentiable at points?
  Sorry, this was referring to the construction provided in the paper referenced.
  I do agree that the title is somewhat misleading, since, when I first read it (and thought, "this is probably wrong"), I imagined that it proved that given any resnet, you can show convergence to the global optimum via GD, not just "a resent of a given size converges to a global optimum, via GD, for a specific training set."
  That being said, the paper does not prove (nor claim to prove) general, globally-optimal convergence of GD, which is what I think you're saying (given, for example, what you mentioned about finding the factorization of a semiprime in the GGP and your specific function construction)—which is what I was pushing back against a bit. In particular, even in the title, they only claim to prove this for a specific class of problems (i.e. NNs).
  > Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN
  I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?
  
  DoctorOetker 5 years ago
  
  Thanks for your added comments, it's really helpful to see a more candid breakdown of others views of a paper.
  >I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?
  They are referenced in the paper in the section:
  >Another way to attack this problem is to study the dynamics of a specific algorithm for a specific neural network architecture. Our paper also belongs to this category. Many previous works put assumptions on the input distribution and assume the label is generated according to a planted neural network. Based on these assumptions, one can obtain global convergence of gradient descent for some shallow neural networks [Tian, 2017, Soltanolkotabi, 2017, Brutzkus and Globerson, 2017, Du et al., 2018a, Li and Yuan, 2017, Du et al., 2017b]. Some local convergence results have also been proved [Zhong et al., 2017a,b, Zhang et al., 2018]. In comparison, our paper does not try to recover the underlying neural network. Instead, we focus the empirical loss minimization problem and rigorously prove that randomly initialized gradient descent can achieve zero training loss.
  I had this idea independently but never pursued it due to lack of time. It's another reason I like this paper: for referencing this approach, at least if they investigate what I think they do, I still havent had time to read those references, but from the description in this section it appears they investigate nearly if not exactly what I wanted to investigate "some day" :)
nshm 5 years ago

The title of the paper is really misleading. The comments here are even more misleading. The key is their theorem where they say "with high probably over random initialization". They initialize many times and sometimes it converges. Single initialization can stuck in local minimum of course.
- DoctorOetker 5 years ago
  
  but then there is very little of interest: (assuming enough smoothness and Lipschitz continuity) one expects every global minima to have a convex neighbourhood such that gradient descent starting within the neighbourhood reaches the global minimum. The initialize many times and sometimes it converges is just saying the obvious "there exist initial positions for which GD succeeds in finding a global minima"... is my interpretation correct?
  
  nshm 5 years ago
  
  The contribution of the paper is the estimation of the convergence speed and number of parameters in neural network, that seems a valuable point.
  
  DoctorOetker 5 years ago
  
  The paper states for the main result:
  >In this section, we show gradient descent with a constant positive step size converges to the global minimum with a linear rate.
  This is rather ambiguous: it sounds like it guarantees "it WILL converge to the global minimum, btw at a linear rate" but I suspect they are really saying "IF it converges to the global minimum, THEN it will converge at a linear rate" in a way to purpousefully sound like the first statement.
  could you comment on if GD does or does not find the global minimum of the integer factorization cost function in the following comment?
  https://news.ycombinator.com/item?id=18439287
  
  nshm 5 years ago
  
  Another sin they have is that they write "with high probability" in the theorem but they do not strictly define that in the main section of the paper. If you look into appendix you'll see that they guarantee the probability 1 - \delta and delta affects the convergence. It means that if you add many many parameters then you just need a couple attempts to converge well. So "IF it converges" is becoming much better "VERY OFTEN it converges". Sorry, no real passion to read the paper in details, so this is just an intuition from a quick look.
TTPrograms 5 years ago

1) Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.
2) This is "a way" not "The only way".
(If A then B) does not imply (if not A then not B)
- ramgorur 5 years ago
  
  >Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.
  I don't understand. How do you prove a gradient descent is guaranteed to escape local minima?
  
  heyitsguay 5 years ago
  
  Convex functions aren't the only functions with a single local (and global) minimum - consider sqrt(|x|) for a simple 1d example.
  
  ramgorur 5 years ago
  
  Yes, that's true. But in optimization domain the concept of "convexity" is understood in terms of set, not always from the 2nd derivative of a function. Because you might have search spaces where you are not able to differentiate the objective function at all. In those cases the "convex" means a "convex set".
  
  LolWolf 5 years ago
  
  Sure: you can define convexity for a function that is equivalent to the 2nd derivative definition in the case that the function is twice-differentiable, using only the definition of the convexity of a set.
  Define the epigraph of a function to be the set given by {(x, t) | f(x) ≤ t}. Then, we say f is a convex function iff the epigraph is a convex set.
  This is equivalent (exercise for the reader!) to the usual definition that a function f is convex iff f((1-t)x + ty) ≤ (1-t)f(x) + tf(y), for all 0 ≤ t ≤ 1, with x, y in the domain of f.
  Note that neither of these two definitions require differentiability (or twice-differentiability), but the definitions are equivalent in this case.[0]
  ---
  [0] For proofs of all of these statements see B&V's Convex Optimization.
  
  TTPrograms 5 years ago
  
  When talking about "convex optimization" one nearly always means that both the function and the domain are convex.
  
  srean 5 years ago
  
  No need to define 'convex optimization' here, that follows from the definition of a convex function: F(ax + (1-a)y) <= aF(x) + (1-a) F(y) for 0<=a<=1 and x,y in domain of F. For the inequality to be satisfied ax + (1-a)y has to be in the domain. That mean the domain is convex.
RandyRanderson 5 years ago

FF NNs of even one hidden layer are universal approximators. That is, they do find the global min. What this doesn't tell you is that it's likely a huge graph and will take a looong time to optimize for even trivial data sets. There's lots of proofs around. That's why SGD is used, and for only a small subset of training points at a time.
Re 2: No.
Re 3: Yes.
- simonhughes22 5 years ago
  
  They are universal approximators, although the term approximator means there's an upper bound on the accuracy of how well they emulate a given function (based on the size of the network). However, just because they are universal approximators doesn't mean that you can automatically infer the optimal number of connections and weight for each of those connections in order to minimize the loss over some dataset (derived from some function). Being able to be a universal approximator does not imply you can automatically learn the best approximation of a given function. It's the difference between being capable of learning something, and having learned it. If that makes sense.
- srean 5 years ago
  
  I dont think you understand what universal approximation means. It means there are parameter settings that would reduce the approximation as much as you want. Its an existential property. It does not mean that those parameters can be found.
  Anyway this universal property of neural networks get a lot of airtime and people go gaga over it. Its a complete red herring. Its not the first example of a universal approximation and not the last. There is no scarcity of universal approximators. There was no such scarcity even 100s of years ago. The explanation of the success of DNNs lie elsewhere.

charleshmartin 5 years ago

An excellent paper which uses (some of the) results we have also found studying the weight matrices of neural networks...namely that they rarely undergo rank collapse

https://calculatedcontent.com/2018/09/21/rank-collapse-in-de...

But they miss something..the weight matrices also display power law behavior.

https://calculatedcontent.com/2018/09/09/power-laws-in-deep-...

This is also important because it was suggested in the early 90s that Heavy Tailed Spin Glasses would have a single local mimima.

This fact is the basis of my early suggestion that DNNs would exhibit a spin funnel

gogut 5 years ago

This paper appears in November, but in fact, Allen-Zhu (MSR, http://people.csail.mit.edu/zeyuan/ ) already posted his result in Oct. This is their first paper in Oct:https://arxiv.org/pdf/1810.12065.pdf, this is their second paper in Nov https://arxiv.org/pdf/1811.03962.pdf . In MSR Oct paper, they proved how to train RNN (which is even harder than DNN). In their Nov paper, they proved how to train DNN. Compared to their Oct one, the Nov one is actually much easier. The reason is, in RNN, every layer has the same weight matrix, but in DNN every layer could have different weight matrices. Originally, they were not planning to write this DNN paper. Since someone is complaining that RNN is not multilayer neural network, that’s why they did it.

In summary, the difference between MSR paper and this paper is: if H denotes the number of layers, let m denote the number of hidden nodes. MRS paper can show we only need to assume m > poly (H), using SGD, the model can find the global optimal. However, in Du et al.’s work, they have a similar result, but they have to assume m > 2^{O(H)}. Compared to MSR paper, Du et al.’s paper is actually pretty trivial.

brentjanderson 5 years ago

Although I'm no expert, isn't this result an incredibly important contribution? This paper claims to prove that:

> The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).

If this variant of gradient descent is able to reach global minima in polynomial time, and if neural networks are proven to approximate any function, then ostensibly this technique could be used to guarantee the lowest error possible in approximating any function. This seems incredibly important. Can someone correct my reading of the abstract?

cosmic_ape 5 years ago

This is not necessarily a trivial fact, but I wouldn't call it incredible. It says a net trained with gradient descent can fit the input perfectly if its much larger than the input.
But this says nothing about generalization, i.e. performance on test set -- which is what we really want.
- iandanforth 5 years ago
  
  Importantly not SGD, just GD.
  
  cosmic_ape 5 years ago
  
  you're right, fixed.
hooloovoo_zoo 5 years ago

Well, without reading the whole paper, two important things strike me.
1. Zero training loss is impossible in most networks because the last layer can only reach the targets asymptotically.
2. Zero training loss means nothing from a practical standpoint. We've had algorithms capable of this for a long time (knn [k=1], decision trees etc.).
- cosmic_ape 5 years ago
  
  Yeah. And they require that the number of parameters grows at least polynomially with the size of input. In this regime, we also have another algorithm capable of zero loss -- the linear regression!
- skeptic_69 5 years ago
  
  1. people overfit the baby datasets to zero training loss (MNIST) all the time. maybe you meant a "hard" dataset.
  2. You clearly have no idea what you are talking about. This paper is trying to argue a bit about why neural networks generalize well by showing with math that a nn with some of their conditions converges to the zero training loss. It isn't remotely meant to be practical. IT IS A THEORETICAL PAPER.
  And comparing it to nearest neighbors of 1 is so so so so so silly it isn't even wrong.
  edit. #1 is actually an entire research direction in the theory of machine learning fyi.
  It is possible to get neural networks that massively overfit but still generalize (which Is weird).
  https://arxiv.org/pdf/1611.03530.pdf
  That paper was really famous. It showed you can get zero training loss on data when you replace the labels with random noise.
  edit 2: I am sorry to be harsh. It is just hard to read such arrant nonsense.
  
  nilkn 5 years ago
  
  I don't see how you really addressed the concerns. You say that the paper is "trying to argue a bit about why neural networks generalize well," but in fact I don't see anything in this paper about generalization or test error. The first line of future research under section seven is to look into test error instead of training error:
  "The current paper focuses on the train loss, but does not address the test loss. It would be an important problem to show that gradient descent can also find solutions of low test loss. In particular, existing work only demonstrate that gradient descent works under the same situations as kernel methods and random feature methods [Daniely, 2017, Li and Liang, 2018]."
  
  skeptic_69 5 years ago
  
  have you heard of something called ERM? uniform convergence?
  The typical way of showing generalization in ML is to show that if we have some low or zero error solution on the test data-set, for a large enough dataset, with high probability, the error on our training data set is close to the error on the real and unknown distribution. The first step which is basically "find a low error hypothesis on the training data" is called the ERM principle.
  In practice we observe stochastic gradient descent works pretty well in solving the ERM problem and the solutions generalize well (perform well when deployed).
  This is very weird since neural networks are really weird objects with very non-linear and non-convex behavior and gradient descent shouldn't play well with weird bumps and curves and valleys.
  People want to show mathematically that stochastic gradient descent does well on neural networks.
  This paper claims gradient descent is effective at minimizing quadratic loss on the training data.
  If we could improve the results to show that on the true distribution we also have low loss-that might be compelling that gradient descent converges to the minimum error solution.
  None of this explicitly stated since this is a well understood part of basic literature in learning theory.
  Showing an algorithm can do erm on the hypothesis class is the first and (easier ) part of showing generalization.
  If you want a good reference that explains this in a more coherent way I recommend looking at the first 4 chapters of understanding machine learning theory by Shai-Shalev Schwartz.
  If you still think the comments I was responding to are not totally incoherent-take note of the fact that the very first sentence in the paper is "One of the mysteries in deep learning is random initialized first order methods like gradient descent achieve zero training loss"
  
  hooloovoo_zoo 5 years ago
  
  This is an intriguingly aggressive comment.
  1. No, it's impossible. Actually, the theorems in this paper do not claim to reach zero loss either, as they're all inequalities on the size of the loss. The paper you cite refers to converging to zero loss, as do you in point 2. Perhaps you're referring to error, which is not the loss that is directly optimized.
  2. This paper certainly isn't talking about generalization. It doesn't appear to be mentioned once. Your other paper is talking about generalization. The parent asked if this paper is super important. I gave a reason why it isn't super important for most people.
  3. Massively overfitting is antithetical to generalizing. Overfitting means fitting to the extent that you're generalizing less well.
  
  skeptic_69 5 years ago
  
  1.mmmmmmmmm ok I am willing to accept you meant the quadratic loss instead of 0-1 error. that seems reasonable.
  2. this is paper is centered in a research thrust that IS focused on generalization. see my below comment.
  I don't know who most people are but this paper COULD be important in understanding why stochastic gradient works well in practice.
  Personally I doubt it very much.
  3. massively overfitting to the training dataset BUT generalizing well is a real phenomenon and yes it is very weird. happens in deep nets and i believe adaboost. i.e. continuing to train after you have zero 0-1 loss. I agree this is a weird way to communicate this idea but that is what the community uses.
Xcelerate 5 years ago

The paper claims to reach the global minima of a given neural network in polynomial time. The time complexity of constructing a neural network that approximates any function is an entirely different matter. I’m not even sure how one would begin to approximate a highly algorithmic process (e.g., a hash function) using a neural network.
- LolWolf 5 years ago
  
  Re: "I’m not even sure how one would begin to approximate a highly algorithmic process (e.g., a hash function) using a neural network"
  You build the circuit corresponding to the function and map it to an NN. Can this be discovered easily via GD? Absolutely no clue (though this paper says "yes"), but is it possible to approximate it? Yes, you can nail it exactly in a polynomial number of layers (if the algorithm takes poly-space, which is a necessary condition for it to run in poly-time).
jey 5 years ago

I interpret it as saying: "non-convexity ain't no big deal in this context", and this is the latest contribution in a line of other "no spurious local minima" machine learning theory research. It's interesting for showing that it applies to a rather complex model like ResNet.
- cosmic_ape 5 years ago
  
  Its quite important that its not so much about "non-convexity", its more about "random initialization". All these results, and there are others, more sophisticated, basically say that after random init everything is already good, and the gradient descent just has to train the highest layer, without screwing up the previous ones. They are all pretty naive that way.

juskrey 5 years ago

How do you gradually detect a Dirac stick?

pj_mukh 5 years ago

With this much wide distribution of ML algorithms in use, its still funny to see papers begin with "One of the mysteries of deep learning is that...." and then goes on to lay out the multiple ways we have no idea why some of these DL techniques work.

orf 5 years ago

> One of the mysteries in deep learning is random initialized first order methods like gradient descent achieve zero training loss, even if the labels are arbitrary

Can someone expand on this? I've never heard of this before, at least not in the general case.

currymj 5 years ago

The paper “Understanding Deep Learning Requires Rethinking Generalization” was where this was first pointed out, I think.
They shuffled the labels on their datasets, so there can’t possibly be anything to learn, yet got zero training loss, meaning the network must be severely overfitting. Yet the same network trained with the actual labels shows quite good generalization. So the usual intuition about overfitting and the bias-variance tradeoff doesn’t seem to apply.
- charleshmartin 5 years ago
  
  I would say
  “Understanding Deep Learning Requires REMEMBERING Generalization”
  https://calculatedcontent.com/2018/04/01/rethinking-or-remem...
  https://arxiv.org/abs/1710.09553
  We can understand this using the traditional theory of Statistical Mechanics of Generalization
  Briefly, shuffling the labels corresponds to decreasing the effective load on the Neural Network, which pushes the system into the spin glass phase
- elcomet 5 years ago
  
  This seems quite intuitive to me:
  When you have nothing to learn, you need to memorize the data. But when there is structure, it is easier to memorize the structure, so the network will learn this first (and will memorize after).
  
  srean 5 years ago
  
  But thats the crux of the question: why and how does it not just memorize when we know it can do so easily ?
  
  elcomet 5 years ago
  
  Maybe just because it's easier to find patterns than to memorize (if you have a lot of data).
  
  currymj 5 years ago
  
  That sounds like it’s probably right to me. But so do lots of things that turn out to be wrong. I wish we had a better grasp of what is happening, not just plausible stories. I’m already sick of doing alchemical tinkering to find a model that works.
  
  elcomet 5 years ago
  
  Yeah, I agree it would be nice to have some theoretical guarantees on the architecture we need based on the problem, and the size of the data

lbj 5 years ago

I cant claim to fully understand the proof, but these guys have done an amazing job in terms of furthering our understanding of deep nets.