Large language models are normalized so as to ensure stable and efficient training. The commonly used normalizations are :
1 Layer normalization Here the activations of each layer are independently normalized. Normalization is not extended to the entire batch. It is used for RNNs and transformers. (Here the batch sizes vary and are small).
2 Batch normalization Here activations of each layer is adjusted and scaled to have zero mean to normalize and unit variance over the mini batch are normalized during training. It stabilizes and speeds up training (by reducing internal covariate shift).
3 Instance normalization Activations are normalized across spatial dimensions independently for each sample. It is used for style transfer and image generation.
4 Group normalization The channels are divided into groups. Activations within each group are normalized separately. It is useful when batch sizes are small or where batch normalization is not suitable (fine-tuning pre-trained models).