Introduction In October I took part in Mozilla bug classification competition on Topcoder and was awarded a prize for the 3rd place. Without a doubt, it was a valuable experience and an opportunity to work on a challenging problem. Due to the licensing agreement, I will not describe the exact solution that I submitted but an alternative version with slightly lower accuracy. Problem overview The dataset contained data from Bugzilla bug tracking system for Mozilla products.
In the previous parts we created a few different models to generate recommendations for scientific papers. Today we will evaluate the quality of recommendations and select the best performing model that will be used in our application. For evaluation we will use ElasticSearch built-in recommendations from part 1, LSA unigram model from part 2, LSA trigram models and LDA trigram models from part 3. For evaluation we will use measures of relevance from Information Retrieval.
In the previous part we talked about LSA in details. Today we will consider a few more methods that are used in finding semantic similarity. The new methods however still use term-document matrix and also create latent topics, hence having similarity with LSA. Thus, it may be irrational to describe them in the similar level of detail as we did for LSA in part 2. Capturing more features for LSA In the previous example each term in our TF-IDF matrix was a single word (unigram).
In the previous part we used ElasticSearch for finding similar documents. Today we will implement document recommendations with Latent Semantic Analysis which is a popular method that is used in 70% number of research paper recommenders according to the survey in [J. Beel et al., 2016]. However, we will need a brief information about term-document matrices and TF-IDF. TD-IDF TF-IDF (Term Frequency–Inverse Document Frequency) is a statistical method of finding weights of words (terms) in a document with regards to a list of documents (corpus) [K.
My daily job is developing machine learning (ML) systems for NLP and related fields. Due to the nature of my job I read scientific papers on the regular basis and share some of them on my blog (not so much lately). There are multiple sources that I use for finding new papers: Arxiv, twitter posts, personal blogs and university labs’ websites. Over the time I realized the need in a system that can keep it all together and make the papers accessible to me.
Recently, I have been looking for methods of language identification for large collection of heterogeneuos documents. After experimenting with different models over the weekend I decided to share one that is simple to implement and can be used as a baseline. It will not require expert knowledge in NLP and can also be used as a foundation for more advanced models. The method we will use today is N-Gram classification, introduced in “N-Gram-Based Text Categorization” paper by W.
Paper: One-shot Learning with Memory-Augmented Neural Networks, A. Santoro et al., 2016 The topic of memory augmented Artificial Neural Networks (MANN) is particularly interesting for me and I am glad to see growing attention to this topic. Back in 2014 I reviewed “Neural Turing Machines” (NTM) paper by A. Graves et al. that introduced a model of Artificial Neural Network (ANN) capable of Turing-complete computing. Today, I will review another interesting paper from DeepMind that presents an updated version of NTM.
Paper: Dialog-based Language Learning, J. Weston, 2016 The paper introduces a model that aims to learn from natural language sentences. The ability to learn from natural language input is important goal of Machine Learning that is actively pursued by many researchers. Such ability will increase the opportunities of supervised learning by overcoming the necessity of labeling large quantities of data samples. The data labeling may require enormous resources even though some of the work can be automated with libraries such as Stanford NLP that parse and annotate the text.
Two years ago I wrote about Stanford course CS229: Machine Learning by Andrew Ng and why it was still one of the best introductory Machine Learning (ML) courses. Today I will review another Stanford course CS231n: Convolutional Neural Networks for Visual Recognition and explain why it may be one of the the best introductory courses in Deep Learning (DL). Even though course’s name implies its focus on Computer Vision (CV) it may be useful for people outside of CV domain.
Paper: Infinite Dimensional Word Embeddings, E. Nalisnick, S. Ravi, 2015 The paper describes Infinite Skip-Gram (iSG) model that can learn the number of dimensions for word embeddings rather than used a fixed dimensions value. This approach is inspired by the idea of infinite dimensions in Infinite Restricted Boltzman Machine(iRBM), that was introduced earlier this year by Cote, Larochelle. The infinite size is possible because the layer size is a part of the energy function of iRBM.