Conditioning of a function = ratio between highest and smallest eigenvalues of its Hessian.
Higher conditioning => harder to minimize the function
Gradient Descent gets faster on function with decreasing conditioning L/mu 👇
Higher conditioning => harder to minimize the function
Gradient Descent gets faster on function with decreasing conditioning L/mu 👇
Comments
Also, I have a slight feeling that well-conditioned functions might lead to smaller generalization gap when using GD or SGD to learn ML/statistical models...