Newton’s Method — Best Optimization technique in Neural networks and Why?
Newton’s method is the alternative method to gradient descent. It is also very fast and very powerful.
Newton’s method in principle is used to find the zeros of a function. But we adapt it a bit to be used for our optimizations. It also works with multiple variables.
Consider the function with 1 variable and the graph is shown below
Our Goal is to find the zero which is crossing at the x- axis which we usually do the same thing in gradient descent where we find the minima.
Consider any point on this function say x₀ and draw a tangent to it and see where it hits the horizontal x-axis. Let us say this point is x₁ and notice that x₁ is much closer to zero than the x₀. Similarly, iterate over here, until we get the zero. By 3rd iteration, we get the point we would like to get.
How do we find the slope of the first point which is f’(x₀). From the formula rise over run and then manipulate the formula we get
We have x₀ and we can calculate x₁ with this. Generalizing this formula, we get
Now that we are clear on Newton’s method. How do we use this for optimization?
We know that for optimization, we minimize g(x) which means we find the zeros of g’(x). Similarly, applying the same technique here using Newton’s method whose goal is find the zero of f(x). we have to find the f’(x) → (g’(x))’, which is the double derivative
Repeat the step until we find the minimum.
Consider an example,
Now we should find the derivative of g(x) and find the minimum of it.
we know that g’(x) is
Now we consider the x₀ = 0.05 and substituting x value in g’(x) and (g’(x))’ in the newton’s formula we get x₁ and repeat the process until we get the minimum which we know for this equation is 0.567 and we get in the 6 iterations as shown in the below diagram which is very fast.
This is about 1 variable and how about 2 variables. i.e., if the rate of change is with respect to 2 variables. We use the Hessian Matrix
Where H(x , y) is the Hessian Matrix
If the second derivative is positive, then the function is concave up and there was a local minimum. If the second derivative is negative, then you had a function that was concave down and so we have a local maximum.
For multiple variables, it is similar.
Using the Hessian matrix, consider the example f(x , y) = 2x² + 3y² — xy and f’’(x) would be 4 and f’’(y) would be 6 f’ₓ(f’(y)) would be -1 and the same with reverse. Hence, we get the Hessian as the matrix above.
Let us calculate Eigen values.
As you can see, the eigen values are positive and which means the concave up is in the positive quadrant and so we can conclude that (0,0) is minimum.
Consider other example,
Calculate the Gradient and Hessian matrix. Now calculating the eigen values both the negative which means the function lies in negative quadrant and so the (0 , 0) is maximum.
Consider the below situation which is saddle point and it is neither minimum nor maximum
Since the eigen values are both not negative and not positive. It is inconclusive and it is a saddle point.
What happens with multiple variables?
So to conclude we remember this
Now let us conclude the newton’s method for multiple variables.
H is Hessian matrix and the other is Gradient
This is the generalization for the multiple variables.
From this we can conclude that, Newton’s method is that fastest way to get to minima in fewer steps than the Gaussian method which is one of the important step in neural networks which is always focused one getting the minimum error of the model.
Reference : DeepLearning.ai