In October I took part in Mozilla bug classification competition on Topcoder and was awarded a prize for the 3rd place. Without a doubt, it was a valuable experience and an opportunity to work on a challenging problem. Due to the licensing agreement, I will not describe the exact solution that I submitted but an alternative version with slightly lower accuracy.
The dataset contained data from Bugzilla bug tracking system for Mozilla products. Considering the high number of products and components in the system it is time-consuming to correctly assign bug category, especially for bug reporters outside the company. Therefore, the goal of the competition was to automate the process by predicting a software component based on the bug ticket data.
Looking at the data
If we look at the dataset we can see that the number of unique products is 146 which is indeed quite high. Moreover, we have 2134 unique components in the system with the following distribution of the number of components per product.
However, some of these components are probably not active anymore because we have only 1788 components in our training set. Nevertheless, building a classifier for 1788 classes may be quite challenging and require a considerable amount of data. In our training set, we have 443270 samples which are highly imbalanced as can be seen on the histogram in Figure 2 below.
For example, if we check the number of classes with only 1 sample and over 1000 samples per class, it will be 108 in both cases. There is a high chance that such imbalance may affect the performance of the models.
There are multiple parts of the dataset that contains different information about the ticket and product/component. From their structure, I assume they were exported from a relational database of their system. Thus, it was easy to insert them in SQLite tables and use SQL queries for building additional features.
During experiments, I saw that the short text description of the tickets was the most important source of data, contributing the most to my models’ accuracy. Unfortunately, long text descriptions were too noisy to use because they also contained details like system information, steps to reproduce the bug, references to other software products not related to the components, etc.
The first step was building a baseline model. I used Complement Naive Bayes [Rennie et al., 2003] that handles imbalanced classes rather well and TF-IDF features from short description texts. After that, I experimented with word embeddings and Bi-LSTM networks [Schuster et al., 1997] for beating the baseline. Due to a high number of domain-specific words standard pre-trained embeddings had a high number of out-of-vocabulary cases. To address this issue, I trained a small word embedding model with FastText [Bojanowski et al., 2017] and tunned its parameters to account for the small size of our dataset. In the end, I had a model with the vector size of 32 which performed quite well considering the limited dataset.
After I trained custom embeddings it was time to use them together with pre-trained embeddings to improve the performance of our model. The improvement was significant, but even ensemble of different models (Bi-LSTM with embeddings, character level CNN [Zhang et al., 2015], char n-gram Complement Naive Bayes, SGD classifier for non-text data, etc.) was still under the accuracy threshold required to get a prize. So it was time to experiment more with the different architectures and methods.
Things that helped to improve performance further:
Cyclical Learning Rates [Smith, 2017]
Additional statistical features from text
Proxy labeling [McClosky et al., 2006; Zhou et al., 2005]
In the end, the main model in the ensemble with the highest weight was a network with an Bi-LSTM with pre-trained FastText embeddings and a custom small 32-dimensional FastText embeddings. After Bi-LSTM layer we reduce it with 1-dimensional convolution layer and max pooling and combine with statistical features of the tickets text description.
In the final stage of the competition, I added proxy labeling to my pipeline and was able to improve the accuracy well above the 50% threshold in local cross-validation. Of course, this meant I had to be careful in interpreting results on the public leaderboard. The situation was exacerbated by a relatively small public validation set which was only 10% of the test set. Therefore, I relied only on my cross-validation metrics for selecting the finals model. As a result, I got a much better score on the final leaderboard and climbed from 8th place to the 3rd with a score of 0.462.
Alternative solution to the problem
As I mentioned earlier, the main goal was to automate a time-consuming task of categorizing bug tickets for over a thousand components. Unfortunately, even the best solutions submitted to the competition could not cross 0.50 score threshold which was originally set by the organizers. Therefore, it was reasonable to assume that a single label classification approach to the problem exhausted most of the options for further improvements for this particular task that could be implemented in the short time frame of a few weeks. On the other hand, there could be other ways to solve this problem.
After looking at the problem and the available data, I thought of alternative ways of addressing this business problem. Instead of a single label classification problem, we can treat it as a problem similar to recommendation systems. In this case, when reviewing a ticket an engineer could be presented with a few suggestions of the most relevant components based on the ticket description. Similarly, such component suggestions could be offered to users that fill in tickets if necessary.
In order to test my hypothesis, I used some of the models that I used for the challenge and measured their Top-N accuracy score. In Top-N accuracy, we get N classes with the highest probability and check if they contain the correct class. In my test, I selected N value of 3 which seems like a reasonable trade-off between usability and accuracy of a potential system, while Top-1 was just a regular accuracy metric for a single label classification. The test included two models (1) Complement Naive Bayes with unigram world level td-idf and (2) ensemble described in the previous section. They were the least and the most computational intensive models respectively and I was interested to see if there is a significant difference in accuracy change between them.
Both models were evaluated on a 10-fold cross-validation and statistical significance of accuracy means was tested with Wilcoxon signed-rank test. First, I tested Complement Naive Bayes that went from 0.280 on Top-1 accuracy to 0.416 Top-3 accuracy. Next, I did the same test on the ensemble that included a few Neural Networks (described in the previous section) and went from 0.454 Top-1 to 0.615 Top-3 accuracy. Without a doubt, in both cases, there was a considerable improvement that was achieved by changing an approach to the problem. It is also worth noting that, the simpler model gains were significantly higher than the ensemble model, with mean accuracy improvements of 48.5% and 35.4% respectively. Therefore, further investigation in Top-N class recommendations solution may be needed to evaluate the extent of possible accuracy improvement with different models on this problem.
Rennie, Jason D. M., Lawrence Shih, Jaime Teevan and David R. Karger. “Tackling the Poor Assumptions of Naive Bayes Text Classifiers.” ICML (2003).
Schuster, Mike and Kuldip K. Paliwal. “Bidirectional recurrent neural networks.” IEEE Trans. Signal Processing 45 (1997): 2673-2681.
Bojanowski, Piotr, Edouard Grave, Armand Joulin and Tomas Mikolov. “Enriching Word Vectors with Subword Information.” Transactions of the Association for Computational Linguistics 5 (2017): 135-146.
Zhang, Xiang, Junbo Jake Zhao and Yann LeCun. “Character-level Convolutional Networks for Text Classification.” NIPS (2015).
Smith, Leslie N.. “Cyclical Learning Rates for Training Neural Networks.” 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (2017): 464-472.
McClosky, David, Eugene Charniak and Mark Johnson. “Effective Self-Training for Parsing.” HLT-NAACL (2006).
Zhou, Zhi-Hua and Ming Li. “Tri-Training: Exploiting Unlabeled Data Using Three Classifiers.” IEEE Trans. on Knowl. and Data Eng. 17 (2005): 1529-1541.