Why Stanford course CS231n may be a good introduction to Deep Learning

Two years ago I wrote about Stanford course CS229: Machine Learning by Andrew Ng and why it was still one of the best introductory Machine Learning (ML) courses. Today I will review another Stanford course CS231n: Convolutional Neural Networks for Visual Recognition and explain why it may be one of the the best introductory courses in Deep Learning (DL). Even though course’s name implies its focus on Computer Vision (CV) it may be useful for people outside of CV domain.

Large part of the course provides information about Machine Learning and Artificial Neural Networks(ANN) that is not limited to computer vision. The course starts with basic ML methods like classification, Nearest Neighbor search, then moves to Support Vector Machines (SVM) before going trough basics of ANN in general and CNN in particular. The authors also introduce core concepts like working with a dataset, loss function, hyper-parameters, regularization, and optimization of the model. These concepts are relevant for all ML tasks and in my opinion should be a pre-requisite for such course. Nevertheless, their introduction makes the course friendly for beginners and makes it a good introductory course into ANN and Deep Learning in general. The foundation of ANN remains the same whether you build a Recursive, Recurrent or Convolutional network and the course provides such foundation. Overall, the course is a great candidate for introducing people to the field of Deep Learning.

Competitors

There are not many competitors of this course at the moment. For example, course Neural Networks by G. Hinton from University of Toronto, 2012 is more advanced and challenging. The course covers ANN in more details and offers good knowledge on different network models without too much emphasis on particular architectures. However, we need to consider that even though some of it parts are relevant there have been many changes since 2012 in such a rapidly changing field as ANN. You can see outdated information if your keep track of the field but not if your are just starting your journey. Additionally, it does not include new methods that were adopted in the last few years such as Batch Normalization, Variational Autoencoders, new network weight initialization techniques (e.g. K. He et al, 2015 for RELU), etc. Thus, I would recommend to use it with a teacher or a colleague who already has the necessary expertise to guide you through it.

Another competitor is Deep Learning course by V. Vanhoucke on Udacity that was published earlier this year. Personally, I find the course rather controversial and not sure about its target audience. It seems to be too short (less than 1 hour) to cover the included topics and feels rushed. Many concepts are mentioned briefly and are not well explained. For example, a topic of word embedding is explained in less than 5 minutes which is extremely low even for high level explanation. In contrast, Stanford CS 224d dedicated two whole lectures to the topic word embedding, each over one hour. Of course, there is no need to go into such details in a overview course. However, even if you extract the most essential parts from the topic it is difficult to squeeze 2+ hour material in less than 5 minutes. If I was a beginner without prior experience with word embeddings I am sure I wouldn’t understand most of the things thrown at me. This feeling remained with me throughout the course where the author tried to explain each concept in only a few minutes. And this leads to conclusion: If you are a beginner you will face difficulties learning from this short course. Getting a lot of new information with brief and hasted explanation may lead to confusion. On the other hand, experienced people will understand these short videos because they already know the this but need a valid reason to do it. They can potentially use the course to recap the material but even in this case the course misses too much information and lacks depth. Thus, I am confused about the target audience of the course and do not see it as a competitor to much larger and deeper CS231n.

Another great course from Stanford is CS224d: Deep Learning for NLP from 2015. The course focuses more on text processing and may be less useful if you are looking for general lectures on Deep Learning. Even though the course covers Recurrent and Recursive networks, these network architectures are introduced in the context of text processing. Due to my work on text processing projects that involve regular use of Artificial Neural Networks (ANN) in production I really liked the course. However, CS231n has more domain-neutral material and describes it in more details and depth. In my opinion CS231n is friendlier for beginners and takes time to prepare a student for ANN models and their training in general while seamlessly including CV related information before moving fully into CV domain. Additionally, CS231n includes the latest findings in the field. Even though it has been less than a year between the courses there were already new interesting methods and optimizations that were mentioned in CS231n course. To summarize, CS224d is great if you are interested in NLP context of DL, otherwise CS231n may suit you better.

Course highlights

The course covers a number of topics in ML and DL both in theory and in practice. Here are a few of them that are unique to the course:

  • Detailed explanation of backpropagation of error in Neural Networks. Even though it has been already covered multiple times, the authors go through backpropagation in great details and multiple steps, showing how gradient is computed in ANN. This is demonstrated on a simple example of a network and its computational graph that is easy to follow and to visualize. In addition to network diagram the slides include corresponding source code in Python.

  • Information about a few popular activation functions and explains their advantages and areas of application. For example, the authors discuss why tanh is better than sigmoid and why in some cases RELU (vanilla, leaky, etc) may be superior to tanh. The authors also go into important details like weight initialization techniques (e.g. Xavier, 2010 for tahn and K. He et al, 2015 for RELU) that significantly affect training of ANN.

  • Gradient update functions from RMSprop and Nestorov momentum update to more recent Adam (D.P. Kingma, 2014) which works similar to RMSProp with momentum.

  • Recent trend of DL models with increased number of hidden layers and how researches utilize Residual Learning to make the models deeper with examples from C. Szegedy 2014 and K. He et al, 2015.

  • Importance of monitoring training process and interpreting accuracy and loss values on training and test data. Using this information for adjusting hyper parameters for improved accuracy and less overfitting. This is more of practical tips that many introductory courses omit even though they can make great impact on model’s training. This is especially important for beginners that don’t have experience in the field and are not aware of common pitfalls.

  • Explanation of why Dropout method (N. Srivastava et al., 2014) works for regularization and what it has in common with training an ensemble of models. The method was described in details and explained step-by-step on a simple example. Even though I read the original paper I found its explanation by the lecturer easier to follow.

  • Information about practical aspects of training and running ANN including hardware (GPU, memory and storage) and distributed training.

  • Course even includes relatively new (to the time when CS231n course started) Batch Normalization (S.Ioffe and C.Szegedy, 2015]) method that I briefly reviewed last year and it was a very promising method at that time. After less than a year the method has gained significant popularity both in the industry and among researchers and the paper has already been cited a few hundred times.

  • Traditional Autoencoders and more recent Variational Autoencoders (VAE)(D. P. Kingma and M. Welling, 2013). The latter version uses Variational Bayesian methods that allow us to generate data from a trained model. The course gives enough information about these methods to give an impression of their capabilities and gives enough mathematical reasoning to explain the idea behind it. In my opinion, the authors could have moved some of math behind VAE into reading materials because it affected the pace of the lecture and felt out of place compared to other lectures. This was one of few examples in this course, when reading the original papers was quicker and easier to understand.

Conclusion

In my opinion CS231n is probably the best introductory course into Deep Learning for beginners or specialists who want to refresh their memories on the theory. Even though the name the name implies focus on CV it provides good material of general methods and concepts of Deep Learning. You will benefit from the course even if your area of interests lies outside CV (e.g. text processing or speech recognition). Content specific to CV is introduced gradually in the first half of the course and intensifies in the second half. However, CV related content does not require deep prior knowledge to follow and still can provide valuable information regarding Convolutional Neural Networks or Recurrent Neural Networks. The course includes recent findings in the field and which is especially important for such a rapid changing area of research. Therefore, I recommend the course for everyone interested in entering the field of Deep Learning.