Why I prefer Theano for Deep Learning

Today I will share my thoughts on selecting a library for creating Deep Learning models for R&D purposes not limited to Computer Vision (e.g. text analytics). My use cases include experimenting with different network architectures, modifying models for better results and applying them in production. Due to specifics of “developing” Neural Networks, modifying model architecture and parameters are tightly coupled with its training hence shifting efforts towards Research part rather than engineering. Therefore, the library should be fit for research purposes and also be usable in production.

There are several popular Deep Learning libraries that supports GPU accelerated computing: Theano, Caffe, Torch. Let’s start with Torch which is a scientific framework based on Lua and is used by a few large companies including Twitter and Facebook. Advantages of Torch include availability of pre trained models, modules and more built-in functionality for creating Neural Nets compared to competitors like Theano. On the other hand, Lua language is not common neither among researchers nor in the industry. Moreover, the language ecosystem and toolchain cannot compete with Python. This may be an obstacle for the growth of adoption of this library. Finally, Torch does not support Recurrent Neural Networks (RNN) which are commonly used in text analysis tasks.

Another alternative is Caffe, which developed in C++ with main focus on convolution networks and is used mostly for computer vision. The library was initially a port of Matlab code of Convolutional Neural Networks(CNN) to C and C++ for performance reasons. As a result, the library is highly optimized for a specific kind of architecture which makes it a good choice for standard CNN models. Creation of such models is quite simple and involves describing the network and its solver in Protocol Buffers files. Unfortunately, it also means the library is less flexible for creating other architectures of Neural Nets e.g. Recurrent Nets because it wasn’t the original intention. If you need to go beyond standard models and their functionality you will need to use C++/ CUDA which has higher learning curve and less popular among researchers compared to Python. Additionally, it may be challenging to professional C++ developers and to maintain the code base or examine the source code of the library compared to Theano. Similarly to Torch, Caffe does not support (RNN) which is a serious disadvantage in my case.

Theano

Theano is a library for scientific computing written in Python. While it is not focused only on Deep Learning it simplifies creation of Neural Network models and offers flexibility in model architecture. There are a few frameworks built on top of Theano (See my post about Keras) that target DL specifically and simplify the creation of standard networks but many scientists still use pure Theano. The advantages of Theano are simple language and popular scripting language (Python), support of Python libraries and flexibility.The most important factor in my opinion is availability of wide range of Python libraries, especially scientific computing and visualization libraries. For example, NumPy, SciPy and Scikit-learn are widely used by scientific community. The usage of Python does not necessary results in low performance because perforce critical parts of Theano are either implemented as native code modules or generated to C code dynamically. Thus, it is possible to have performance comparable to C applications while enjoying the benefits of Python. Finally, Theano supports creation and efficient execution of RNN models thanks to Theano’s computational graph and a special Scan command that I will describe later. With a number of advantages over competitors Theano is one of the most popular libraries among researchers at the moment.

Theano was initially created by a machine learning research group from University of Montreal and is currently actively developed by Theano community. The essential part of Theano is operations on multidimensional arrays that can be easily parallelized. High parallelization is especially important for more efficient utilization of GPUs that typically have thousands of simple computing units. This is addressed with a computational graph which is a core concept of computing in Theano. The graph represents all computations in the code of the model where graph nodes are either variable type (tensor) or operations (e.g. dot product or sigmoid function) on tensors(multidimensional arrays of data). During compilation Theano optimizes the graph and generates high performance C code for each operation node. If your hardware has CUDA support Theano can generate CUDA by a simple change in configuration. After the model is compiled, Python code can use it for efficient computations, passing the data to variable nodes in the computation graph and receiving results. On the Figure 1 you can see a computation graph for a simple model of Neural Network that learns logical XOR operation.

Figure 1: Computation graph example

Earlier, I mentioned flexibility of Theano. One of the factors contributing to flexibility of models is support for symbolic loops with a special Scan operation. The command allows Theano to treat such operation node as another computation graph, isolated from the main graph. Similarly, the gradient of the loop is also computed with Scan operation. Such elegant solution allows to bypass a limitation of fixed computation steps in the model and extends the variety of supported models. For example, Scan command allows us to create RNN in a compact and straightforward way. This is demonstrated on a code fragment below. Another factor is ability to generate optimized code for CPU or GPU without changing the code of the model. After developing a model on a regular machine without GPU it is possible to automatically generate high performance CUDA code on a server with powerful GPU with a simple flag device=gpu.

#using scan command for creating RNN in Theano
[h_values, y_values], _ = theano.scan(fn=rec_function,
                                      sequences=X,
                                      outputs_info=[h0, None],
                                      n_steps=100
                                      )

One of few issues you can encounter in Theano is slow compilation of complex models. The issue is getting less pronounced with newer releases and compilation time is likely to decrease in the future. However, it may still be important at the experiment phase when a researcher tries different architectures or modifies parameters of an existing model. While there is a way to speed up compilation of new models with some tradeoffs using config.optimizer setting, it is also possible to re-use complied model using standard Pickle module from Python. There is an option config.reoptimize_unpickled_function that allows to do it. In addition to experimentation re-using models can be used for production ready models. For example, storing compiled model on the server so subsequent launches of the model will be much quicker or sharing pre-compiled models across multiple instances of servers with the same hardware.

Here is a small example (less than 50 lines) of training Neural Network to do logical XOR with Theano v0.7rc2. See Figure 1 for the computation graph generated for this code.

import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np


x = T.dvector()
y = T.dscalar()


def layer(x, weight):
    bias = np.array([1], dtype=theano.config.floatX)
    new_x = T.concatenate([x, bias])
    dot_produt = T.dot(weight.T, new_x)
    output = nnet.sigmoid(dot_produt)
    return output

def grad_descent(cost, theta):
    alpha = 0.1 #learning rate
    return theta - (alpha * T.grad(cost, wrt=theta))

# randomly initialize parameters
theta1 = theano.shared(np.array(np.random.rand(3,3), dtype=theano.config.floatX))
theta2 = theano.shared(np.array(np.random.rand(4,1), dtype=theano.config.floatX))

hidden_layer = layer(x, theta1)
output_layer = T.sum(layer(hidden_layer, theta2))
cost_function = (output_layer - y) ** 2

cost = theano.function(inputs=[x, y], outputs=cost_function,
                       updates=[
                            (theta1, grad_descent(cost_function, theta1)),
                            (theta2, grad_descent(cost_function, theta2))])

run_forward = theano.function(inputs=[x], outputs=output_layer)
# training data X
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2)
#training data Y
correct_y = np.array([1, 1, 0, 0]) 
epochs = 5000
current_cost = 0
for i in range(epochs):
    for k in range(len(inputs)):
        current_cost = cost(inputs[k], correct_y[k])
    if i % 100 == 0:
        print('Epochs passed %d. Cost: %s' % (i, current_cost,))
print('Final cost: %s' % (current_cost))