Paper: Gradient-based Hyperparameter Optimization through Reversible Learning, D. Maclaurin et al., 2015
The method introduced in the paper has an interesting view on hyper-parameter values selection by computing hyper-paremeters gradients. The computing involves unrolling the past training steps for calculating for performing a gradient descent on the learning rate. Computation of the complete history will be expensive hence the need to limit the number of the last steps. The authors also describe a few proof-of-concept experiments using their method.
The goal of the experiments was to find optimal training schedules for the network with fine grained optimization of specific hyper-parameters. For example, every training iteration had its own separate learning rate. Moreover, each layer had a separate learning rate for weight and biases. Such level of control helped them discover some interesting patterns. For example, it appears that the optimal weights for network layer initialization have a specific correlation with the number of neurons in the layer and can be calculated. Another pattern shows the correlation between the layer position (e.g. hidden, output, etc.) and the optimal learning rate scale for this layer.
The experiment in the paper used 100 iterations of Stochastic Gradient Descent for each “meta” iteration. However, even the unrolling of the 100 steps may not be practically applicable to large networks. This may explain why the authors used rather small 3 layer network with 50 hidden units. The authors have already made some optimizations in the process of weigh change encoding, reducing the memory footprint of the algorithm and outlined a few possible performance optimizations. Regardless of performance limitations, the method may provide useful insight for designing new networks. For example, optimizing the hyper-parameters for smaller network prototype and using it in the larger network.