Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an optimization technique mostly used in ML and other fields to minimize an objective function, say an error with respect to its parameters. This method is iterative — it starts with a starting judgement for the parameters and iteratively improves upon them till the minimum is reached.

Each iteration tries to calculate the gradient of the objective function with respect to its parameters. The gradient indicates the direction of the steepest ascent. SGD uses it to decide the direction in which to update its parameters.

In the updating of parameters a scaled version of the gradient is deducted from their current values. The scaling factor is called the learning rate which determines the size of the step taken along the gradient direction.

The above two steps are repeated till the objective function converges to a minimum or a stopping criterion is met.

SGD is faster than other optimization algorithms, especially while dealing with large datasets. The calculation is done only for a small sub-set of data points — a mini-batch in each iteration (and not the whole data set).

It is scalable since it can be extended to large and complex models with multiple parameters.

It is less sensitive to noise in the data.

SGD’s limitation is that it can converge to a local mininum of objective function instead of a global minimum. Another limitation is its tuning — choice of optimal learning rate and other hyperparameters. There is lot of experimentation.

To avoid local minima, SGD’s variant momentum is used. Adagrad variant adjusts the learning rate for each parameter individually to improve convergence.

SGD is used in machine learning (ML) and neural networks and in optimization of signal processing and finance.

SGD is a variant of the gradient descent algorithm. Unlike standard gradient descent, SGD uses only a small batch of data points (mini-batch) to estimate gradient instead of the entire dataset. It is stochastic and hence more efficient and scalable.

It can perform regularization to prevent overfitting.

It is stochastic since it refers to the randomness of mini-batch updates chosen randomly and uses a learning rate schedule where the rate decreases over time. It uses an estimate of the true gradient rather than the actual gradient.

Randomness allows it explore parameter space more effectively. Otherwise, the algorithm follows a set of rules.

print

Leave a Reply

Your email address will not be published. Required fields are marked *