Issue of Vanishing Gradient

In neural network training, we come across the problem of vanishing gradient. The gradients become extremely small during the backpropagation. It hinders the training, especially when there are many layers, since the weights of these layers are not effectively updated.

Consequently, these layers are very slow to learn, or do not learn at all. It results into suboptimal performance, or a failure to converge.

Some techniques are used to overcome this issue — careful weight optimization, batch normalization, and skip connections.

The issue generally occurs in RNNs at the time of training. Here the gradients used to update weights of the network become so small while they are being backpropagated through the network layers. This is particularly so when activation function such as sigmoid function and process information are used.

ReLU and leaky ReLU are less prone to vanishing gradients. Weights should be so initialized that they help the flow of gradients easily through the network. One can use techniques such as Xavier initialization or He initialization. There should be gradient clipping to limit the magnitude of the gradients (to prevent them from becoming too small or too large).

As ReLU have gradients which are either 1 or 0, the vanishing effect is prevented.

print

Leave a Reply

Your email address will not be published. Required fields are marked *