Organizing scientific papers with personal recommendation system. Part 4: Evaluation & NDCG

In the previous parts we created a few different models to generate recommendations for scientific papers. Today we will evaluate the quality of recommendations and select the best performing model that will be used in our application. For evaluation we will use ElasticSearch built-in recommendations from part 1, LSA unigram model from part 2, LSA trigram models and LDA trigram models from part 3. For evaluation we will use measures of relevance from Information Retrieval.

There are two types of measures we can use: binary relevance and multiple level relevance. In binary relevance measure a paper recommendation has two states: relevant and not relevant. The benefit of it is the simplicity of getting feedback from users. Examples of such measures include Mean Average Precision(MAP) and Mean Reciprocal Rank (MRR). However, in the context of scientific papers binary relevance may not be the best choice. When you get a list of recommended papers the chance is they are somewhat relevant to the document in question. Thus, it would be beneficial to use a few levels of relevance between papers. Considering the fact that I built a recommendation system for personal use, the end user (I) was willing to rate generated recommendations with multiple levels of relevance. Thus, I decided to use Normalized Discounted Cumulative Gain (NDCG) [K. Järvelin & J.Kekäläinen, 2000].

NDCG

The main component of NDCG is Discounted Cumulative Gain (DCG). When computing DCG we assume that documents with high relevance are more useful (higher gain) to users and should be ranked higher in a query result. Similarly, less relevant (lower gain) documents should be ranked lower. For relevance rel of i documents we can compute total accumulated gain at the rank position p according to Equation 1.

Equation 1: DCG

In the code below there is an implementation of DCG in Python. If we compute DCG on a list of relevance ratings we will get a non-normalized value e.g. 9.05.

import math
def calc_dcg(items):
    dcg = 0
    i = 0
    for item in items:
        i += 1
        dcg += item / math.log(i + 1, 2)
    return dcg
rel_ratings = [2,4,5,3,1,1]
dcg = calc_dcg(rel_ratings)
print dcg
> 9.05880868285

However, the number itself may not be easy to interpret when comparing our queries and especially if we may have varying length query results. Thus, it will be better to normalize our DCG value in the range from 0.0 to 1.0. This is achieved with Normalized DCG (NDCG). For computing NDCG we will need to calculate ideal DCG (IDCG) and divide DCG by IDCG: NDCG = DCG/IDCG. Essentially, IDCG is DCG of the items with highest relevance for the query that are sorted by their relevance in descending order. If know relevance of all items in the query, we can simply sort the items by their relevance for computing IDCG. The code below demonstrates computation of DCG, IDCG and NDCG.

rel_ratings = [2,4,5,3,1,1]
dcg = calc_dcg(rel_ratings)
rel_ratings.sort(reverse=True)
idcg = calc_dcg(rel_ratings)
ndcg = dcg / idcg
print ("DCG: %f, IDCG: %f, NDCG: %f" % (dcg, idcg, ndcg))
>DCG: 9.058809, IDCG: 10.628132, NDCG: 0.852342

There are other variants of NDCG, with different discount functions or more emphasis on the rank position. For example, the variant in [C. Burges et al., 2005] gives more gain value to the rank position, see Equation 2.

Equation 2: DCG alternative formula

This modification is quite popular in the search engine evaluation where the first few results are extremely important but may not be the best choice for our case. In our recommendation system it is possible to have recommendation that are rated as irrelevant(relevance score is 0) because we evaluate multiple models with varying accuracy. Thus, penalization from irrelevant recommendations should not be overshadowed by a few misplaced documents at the top 5 positions. Let me illustrate it with an example, where our model A recommends a list of items with the following relevance ratings [4, 3, 3, 4, 2, 2, 0, 0 ] while the list with ideal relevance ratings is [4, 4, 3, 3, 2, 2, 2, 1 ]. In this example, the model missed two documents at the end of the list. If we compare the classic and modified versions of NDCG, we will see that modified version penalized the model for missing two documents less because of their position. The following code compares these two variants of NDCG.

def calc_modified_dcg(items):
    dcg = 0
    i = 0
    for item in items:
        i += 1
        dcg += (math.pow(2, item) - 1)/ math.log(1 + i, 2)
    return dcg

def calc_ndcg(relevances, ideal=None, is_classic=True):
    dcg = calculate_dcg(relevances, is_classic)
    if ideal is None:
        ideal = list(relevances)
    ideal.sort(reverse=True)
    optimal_dcg = calculate_dcg(ideal, is_classic)
    ndcg = dcg / optimal_dcg
    return ndcg

rel_scores = [4, 3, 3, 4, 2, 2, 0, 0 ]
ideal =      [4, 3, 3, 4, 2, 2, 2, 1 ]
class_res = calc_ndcg(rel_scores, ideal, is_classic=True)
mod_res = calc_ndcg(rel_scores, ideal, is_classic=False)
print ("classic %f" % (class_res))
print ("modified %f" % (mod_res))

>classic 0.899662
>modified 0.915492

As we can see, classic NDCG formula is better suited for our application because retrieving irrelevant document is also an important factor of the evaluation.

Evaluation protocol and results

First of all, we have to acknowledge the fact that we do not have ground truth labels for our dataset of 40k papers. The number of possible recommendations pairs is n2 of the number of documents in our dataset which is prohibitively high to rate entirely. As a result, we will be measuring relative performance of the models on a limited set of the rated recommendation pairs. The most naive way would be selecting top N (e.g. 15) recommendations from a particular model and using it as a baseline. However, there is a chance that only few of recommendations we rated in LSA will be present in other methods’ recommendations e.g. LDA. Consequently, without seeing the recommendations we cannot have them evaluated. It leads to question: should we mark unseen examples as irrelevant (0 relevance rating)? If so, our labeled dataset will be biased toward one particular method. Therefore, I decided to aggregate recommendations by selecting top 8 recommendation pairs from all of our models (ElasticSearch, LSA and LDA).

The recommendations will be merged together for each reviewed document and duplicates will be removed. Merged recommendations will randomly shuffled to avoid potential bias based on their position. It is important because if I know that positions of the recommendations reflect their relevance score it may influence the ratings I give them. The list of the models used in generating recommendations is shown on Figure 1. Here ES abs. and ES text are ElasticSearch recommendations based on abstract and full document text respectively; LSA 300, 200 and 120 are LSA trigram models with the number of topics set to 300, 200 and 120 respectively; LDA 300, 200 and 120 are LDA trigram models with the number of topics set to 300, 200 and 120 respectively.

Figure 1: Recommendations list

For evaluation I added a page to my web application that allows rating the recommendations in the range of 1 to 5 stars. In this range 1 stands for irrelevant document, 2 for marginally relevant, 3 for moderately relevant, 4 for relevant and 5 for highly relevant. Here is an algorithm of rating recommendations that I followed. It is also demonstrated in Animation 1.

For papers in recommended papers
    Scroll to the next unrated paper
        Read abstract of the recommended paper
        if I have read it before
            rate it
        else 
            open the paper and read it
            rate it 

Animation 1: Rating process

All ratings were performed by a single person (myself) over the period of three weeks. The ratings were subjective to my perception of recommendations relevance which suits the goal of evaluating recommender models for personal use. The topics of the documents were Machine Learning and Natural Language Processing (for more information please check part 1). Only the documents that I have read before were used as the targets of generating recommendations. Total number of these documents was 65 and total number of recommendations pairs was 1280, with mean average recommendations pairs per document being 19.7. The size of the labeled dataset was limited by the amount of free time I had to rate recommendations. While it is relatively small compared to the set of all documents it should be enough to give answer to the main question: which model should I keep for generating recommendations for my application. The results of the evaluation can be found in the Table 1.

Model NDCG, %
ES abs. 57.2
ES text 51.5
LSA 300 66.4
LSA 200 68.3
LSA 120 68.1
LDA 300 75.4
LDA 200 74.1
LDA 120 72.8

Table 1: Evaluation results

As we can see trigram LDA with 300 topics performed the best while ElasticSearch(ES) recommendations based on document text the worst. I suspect poor performance of document text ES could be caused by noise in the document text which can make Bag-Of-Words TF-IDF model less accurate. This can also explain why it loses to ES recommendations using abstract text only. Both ES methods perform significantly worse compared to their trigram competitors. Even though I did not check the recommendation source code of ES I suspect it only uses unigrams which can explain its performance. On the other hand, LDA strong advantage over LSA could be explained by the fact that our dataset is not diverse enough to generate orthogonal topics even with relatively low number of 120. We can see that LSA with lower number of topics (120 and 200) actually outperforms LSA with 300 topics. On the other hand LDA does not assume topic to be orthogonal and benefits from increasing number of topics for capturing more fine grained differences between them.

This concludes the end of the part 4. We evaluated our recommendations models and selected the best performing method. In the next parts we may look at supervised learning methods for improving our system.

References

  • Järvelin, Kalervo, and Jaana Kekäläinen. “IR evaluation methods for retrieving highly relevant documents.” In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 41-48. ACM, 2000.
  • Burges, Chris, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. “Learning to rank using gradient descent.” In Proceedings of the 22nd international conference on Machine learning, pp. 89-96. ACM, 2005.