Paper: Speed Learning on the Fly, PY. Masse and Y. Ollivier, 2015
The paper describes a method of adapting step size (learning rate) for online SGD training with relatively low computational cost. The learning rate may have significant impact on the training process. This is even more important considering the nature of online training and the challenges it impose. The method in the paper attempts to solve this problem by unrolling model updates through multiple steps of online learning. Using previous steps the method, the method can calculate sum of loses that will later be used to approximate their gradients. With the approximate gradient of loses it is possible to approximate the gradient of the learning rate.
If you read the paper “Training recurrent networks online without backtracking” published in July 2015 by Y. Ollivier, G. Charpiat, that I briefly reviewed a few months ago, you may notice many similarities to this idea. In that earlier paper the authors also described unrolling through time with gradient approximation for online learning. Also a paper “Gradient-based Hyperparameter Optimization through Reversible Learning” by D. Maclaurin et al., published in February 2015, describes similar idea of using gradient descent on a learning rate by unrolling the previous steps. If you are interested, you can check my short review of that paper.
Paper findings suggest that adaptive step size performs better than fixed size or naively varied size. Overall, the method looks promising and is a good additional to rather limited number of solutions for online SGD learning. In my opinion, online SGD learning is not getting the attention it deserves even though it fits in some real life cases better than mini-batch SGD.