Concept of Momentum While Using Stochastic Gradient Descent (SGD)

Stochastic Gradient Decent (SGD) is an optimization algorithm used in neural networks to minimize the loss function by updating model parameters iteratively. However, vanilla SGD can be slow to converge and can get stuck in local minima. They introduced the concept of momentum to address this issue.

What is momentum? It is a technique that speeds up the convergence of SGD by accumulating a velocity vector in the direction of the gradients. It is used to update the parameters.

At each iteration, momentum computes an exponentially weighted average of past gradients. This average is used to update the model parameters (instead of using only the current gradient).

The model parameters are updated using the velocity vector. It smooths out the updates and speeds up convergence.

On account of momentum, there is faster convergence, reduction in oscillations in the updates (in neural networks, the loss surface can be highly non-convex) and there is better handling of stochasticity making optimization stable.

Most deep learning frameworks such as TensorFlow, Py Torch and Keras have in-built support for SGD with momentum.

In areas of shallow gradients, SGD updates are very small. Momentum helps the model accumulate updates in these regions. Thus, the model propels out towards steeper slopes.

Modern optimizers like Adam, RMSprop build upon SGD with momentum. They incorporate additional features such as adaptive learning rates.

Momentum enhances SGD’s effectiveness in neural network training. Still some tuning is required and SGD with momentum may not be the ultimate optimizer.

Momentum introduces a memory term that considers past gradients. It acts like a moving average. It smooths out the updates, giving them a consistent direction.

Momentum requires tuning a hyperparameter called momentum coefficient (which controls the influence of past gradients).

print

Leave a Reply

Your email address will not be published. Required fields are marked *