Batch normalisation (BM) was proposed in a paper [S. Ioffe & C. Szegedy, 2015] published in February. Their method is applied to mini-batch gradient descent. The paper considers a problem when updating parameters of the neural net changes activation distributions of the neurons. As a result, it slows the training of the net. The solution they propose is to normalize each mini-batch.
Normalisation in neural networks was investigated before in [S. Wiesler & H. Ney, 2011] and [LeCun et al., 2012], where networks layers were normalized. However, new method uses optimizations for higher performance and parallelization. First optimization is independent normalization of the weights in a layer rather than normalizing them together. Even with the weights not decorrelated the convergence of the objective function is supposed to be accelerated. The second optimization is estimating the mean and variance per activation rather than the whole training set.
The authors claim the reduction of training time of up to 14x using their method without precision penalties. Of course, the BM itself cannot be solely responsible for it. The performance gains come also from changing the training paramenters for getting the full advantage of the new method. The most important change was increased learning rate. The authors were able to increase it up to 30(!) times without overfitting and even remove Dropout.
The BM method looks interesting and can greatly reduce the training in Deep Neural Networks.
- Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015).
- LeCun, Yann A., Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. “Efficient backprop.” In Neural networks: Tricks of the trade, pp. 9-48. Springer Berlin Heidelberg, 2012.
- Wiesler, Simon, and Hermann Ney. “A convergence analysis of log-linear training.” In Advances in Neural Information Processing Systems, pp. 657-665. 2011.