Paper: Speed Learning on the Fly, PY. Masse and Y. Ollivier, 2015 The paper describes a method of adapting step size (learning rate) for online SGD training with relatively low computational cost. The learning rate may have significant impact on the training process. This is even more important considering the nature of online training and the challenges it impose. The method in the paper attempts to solve this problem by unrolling model updates through multiple steps of online learning.
Paper: Training recurrent networks online without backtracking, Y. Ollivier and G. Charpiat, 2015 After working with Recurrent Neural Nets (RNN) I have experienced the problem of online training for such networks. Current solutions to the problem include limiting back propagation on a small number of the last steps. This paper proposes an algorithm that does not require back propagation through time which could be extremely expensive for RNNs. The paper “Gradient-based Hyperparameter Optimization through Reversible Learning” by D.
Paper: Semi-Supervised Learning with Ladder Networks, A. Rasmus et al., 2015 It was nice to read a paper from my alma mater, Aalto University. The proposed network model performs the job similar to a stack of denoising autoencoders but with a more optimized approach. The model applies autoencoding for all network layers together which opens more opportunities for optimizing their reconstruction. In contrast, regular stacked autoencoders will attempt to reconstruct one layer at a time.
[Update from 17.04.2016]: The code examples were updated to Keras 1.0 and should work with future 1.0.x versions of Keras. In the previous part I covered basic concepts that will be used in the application. Today I will show how to implement it with Keras. I will try to keep only the parts of code related to Keras and not overburden the reader with infrastructure related code. In my example I use dataset with labeled movies reviews from IMBD, used in “Learning Word Vectors for Sentiment Analysis” (Mass et al, 2011).
Introduction I always strive to make my posts accessible and easy to understand. However, you may need a basic understanding of Neural Networks and NLP for reading this article. I highly recommend Stanford course CS 224d Deep Learning for Natural Language Processing. The course materials are available online. This includes lecture videos on Youtube and lecture slides on the course website. It worth to watch the whole course if you have time.
Recently, I’ve been asked about Docker containers by a few of my friends. After answering the questions a few times, I found a few common questions that interest people the most. Today I will try to answer to these questions in a user friendly way. 1. What is the difference between virtual machines and Docker containers? Docker uses Linux container technology (LXC) which lets the applications run inside the host OS.
Today I will share my thoughts on selecting a library for creating Deep Learning models for R&D purposes not limited to Computer Vision (e.g. text analytics). My use cases include experimenting with different network architectures, modifying models for better results and applying them in production. Due to specifics of “developing” Neural Networks, modifying model architecture and parameters are tightly coupled with its training hence shifting efforts towards Research part rather than engineering.
Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe & C. Szegedy, 2015 Batch normalisation (BM) was proposed in a paper [S. Ioffe & C. Szegedy, 2015] published in February. Their method is applied to mini-batch gradient descent. The paper considers a problem when updating parameters of the neural net changes activation distributions of the neurons. As a result, it slows the training of the net.
Paper: Gradient-based Hyperparameter Optimization through Reversible Learning, D. Maclaurin et al., 2015 The method introduced in the paper has an interesting view on hyper-parameter values selection by computing hyper-paremeters gradients. The computing involves unrolling the past training steps for calculating for performing a gradient descent on the learning rate. Computation of the complete history will be expensive hence the need to limit the number of the last steps. The authors also describe a few proof-of-concept experiments using their method.
Paper: “Neural Turing Machines”, A. Graves et al., 2014 The paper introduces an architecture of a Neural network that can write and read information to external memory selectively. The architecture bears resemblance with Von Neumann architecture of computers, proposed by J. von Neumann et al. 1945. In short, von Neumann model has CPU, memory and I/O units and is an instance of Universal Turing Machine proposed by A. Turing in 1936.