Organizing scientific papers with personal recommendation system. Part 2: TF-IDF & LSA

In the previous part we used ElasticSearch for finding similar documents. Today we will implement document recommendations with Latent Semantic Analysis which is a popular method that is used in 70% number of research paper recommenders according to the survey in [J. Beel et al., 2016]. However, we will need a brief information about term-document matrices and TF-IDF.

TD-IDF

TF-IDF (Term Frequency–Inverse Document Frequency) is a statistical method of finding weights of words (terms) in a document with regards to a list of documents (corpus) [K. Jones, 1972]. It has two components: TF and IDF.

TF (Term Frequency). In the most simple case term frequency is just a number of time a word occurs in a document divided by number of all terms of the document. There are other variations of TF besides raw counting e.g. augmented term frequency where this number is normalized according to max frequency of any term in a particular document. The normalization is intended to reduce the impact of document length on TFs.

IDF (Inverse Document Frequency). Using TF alone for calculating document word weights may not be optimal because it only operates within the scope of a single document. However, it will be more reasonable to use information about TF of other documents in a corpus. In this case we can adjust term weights e.g. reduce the impact of terms common in the corpus. For example, the word “data” will probably be more common in research papers compared to the word “lemmatization”. Such scale adjustment is calculated in IDF (Inverse Document Frequency) as show in Equation 1 where N is number of documents in the corpus D, t is a term, d is a document and D is the corpus.

Equation 1: idf

When we multiply TF and IDF we will get TFIDF(t, d, D) = TF(t,d) * IDF(t,d). Here is a source code in pure Python that implements TF-IDF:

import math

def tf(term, document):
    terms = tokenize(document)
    terms_count = float(len(terms))
    return document.count(term)/terms_count

def get_num_docs_contain(term, corpus):
    occurance_sum = 0
    for doc in corpus:
        if term in doc:
            occurance_sum += 1
    return occurance_sum

def idf(term, corpus):
    num_docs = get_num_docs_contain(term, corpus)
    # +1 so we avoid division by 0
    return math.log(len(corpus) / (1.0 + num_docs))

def tfidf(term, document, corpus):
    return tf(term, document) * idf(term, corpus)

However, if you want to use TF-IDF in practice you may want to use a library that will take care of managing term-document matrix in an efficient way. While there are a few different Python libraries available that implement this, I decided to use Scikit-learn (Sklearn) in my project. Scikit-learn is a popular, mature and well documented library that implements a number of various Machine Learning algorithms. You can install it with a single command pip install scikit-learn. In the following code example I demonstrate how to create term-document matrix.

from sklearn.feature_extraction.text import TfidfVectorizer	
    corpus = ["I would like to buy a book.",
              "I bought a book yesterday.",
              "That book was really good."]
    vectorizer = TfidfVectorizer(min_df=0, stop_words="english")
    matrix = vectorizer.fit_transform(corpus)
    #Show tfidf scores of words of the first document in the corpus
    print (matrix[0])    

Output:

  (0, 0)	0.385371627466
  (0, 2)	0.652490884513
  (0, 4)	0.652490884513

You may notice a few things in the output. First, we have more than 3 words in our document 1 but got only three word weights. This is because widely common words such as “I”, “a” and “to” do not have much value for representing documents due to their popularity in majority of documents. Secondly, we do not see any words in a row of our term-document matrix but only numbers. This is because each word is assigned a number in a vocabulary. We can see the vocabulary of our corpus in the vectorizer.vocabulary_, as show in the code bellow.

print (vectorizer.vocabulary_)
>{u'book': 0, u'good': 3, u'buy': 2, u'like': 4, u'really': 5,
 u'yesterday': 6, u'bought': 1}

Now If we look at the output for our document (first row of the term-document matrix), we will see that the words with assigned numbers 0, 2 and 4 in the vocabulary are book, buy and like respectively.

Make RF-IDF more production ready

Instead of handling our text files manually we will pass the list to TfidfVectorizer. The code example bellow also includes some basic preprocessing.

from sklearn.feature_extraction.text import TfidfVectorizer	
vectorizer = TfidfVectorizer(input='filename',
                                 encoding='utf-8', decode_error='replace', strip_accents='unicode',
                                 lowercase=True, analyzer='word', stop_words='english',
                                 token_pattern=r'(?u)\b[a-zA-Z_][a-zA-Z0-9_]+\b',
                                 ngram_range=(1, 1), norm='l2', use_idf=True, smooth_idf=True,
                                 sublinear_tf=False)
tfidf_matrix = vectorizer.fit_transform(filenames)

How can we use TF-IDF?

When we calculate TF-IDF for every word in a particular document we will get a list of word scores for this document. It may look like this: [word1: 0.11, word2: 0.35, …, word45: 0.47]. If we calculate such word score for every document in our corpus we can use these lists to represent our documents. From mathematical standpoint these lists are vectors which means we can use Vector Algebra operations on them. One of such operations is cosine similarity which can calculate similar of vectors based on cosine of the angle between two vectors (each represents a document). For vectors A and B we calculate it according to Equation 2.

Equation 2: Cosine similarity

Here is how we will calculate cosine similarity with Sklearn, using our term-document matrix:

    from sklearn.metrics.pairwise import cosine_similarity
    #Vectors of the first and second documents of the corpus
    vectorA = matrix[0]
    vectorB = matrix[1]
    result = cosine_similarity(vectorA, vectorB)
    print (result)

Now we can compute similarity for each document pair in our corpus and use it for finding the most similar products. However, instead of comparing long sparse vectors we can reduce their dimensions by using Latent Semantic Analysis (LSA).

LSA

Latent Semantic Analysis can find associations between terms and documents by finding hidden (latent) semantics [T. Landauer et al., 1998]. LSA transforms original term-document matrix by mapping terms or documents to concepts. We will use TF-IDF matrix, that we computed earlier and apply Singular Value Decomposition (SVD) method on it. First we will decompose our term-document matrix X with m documents and n terms into 3 matrices where U , ∑ and VT as shown on Eq. 3. with their rank reduced to k concepts as shown on Equation 3.

Equation 3: SVD

In this equation is a diagonal matrix wuth eigenvalues of XXT and VT is a transposed matrix.

if you are interested in brief refresher on eigenvalues and matrix transposition, here are links to Khan academy (1, 2) And here is a code where you can try different matrices and gain additional intuition on how matrix transposition works:

import numpy as np
a = np.array([
               [1,1,1,1],
               [2,2,2,2],
               [3,3,3,3]
])

a_t = a.transpose()
print (a_t)

>[[1 2 3]
 [1 2 3]
 [1 2 3]
 [1 2 3]]

Sklearn has SVD implementation in TruncatedSVD class. It has an important parameter n_components which sets number of concepts (components, topics). Finding an optimal number of components is a task of Topic Modeling which is out of the scope of this article. However, in order to select this parameter we look at the relevant research. For example, in a relevant study [T. Magerman et al., 2010] the authors reported success with using 300 concepts for LSA on the task of detecting similarity between patent documents and scientific publications. Thus, we will use this number as the base line. Now, we will apply TruncatedSVD to our matrix and output top 10 words from the first five concepts to see how it works.

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=300)
#not running transformation so we can get terms
lsa = svd.fit(matrix)
feature_names = vectorizer.get_feature_names()
vectorizer.get_feature_names()
for i in xrange(5):
    component = lsa.components_[i]
    features = [(ind, x) for ind, x in enumerate(component)]
    top_features = sorted(features, key=lambda x:x[1], reverse=True)[:10]
    top_words = [feature_names[x[0]] for x in top_features]
    print ("\n Concept %d: " %(i))
    print ("\n".join(top_words))

Output:


Concept 0:
image
recognition
cnn
object
convolutional
layer
training
segmentation
deep
features
detection

Concept 1:
network
neuron
spike
synaptic
neural
firing
brain
activity
dynamics
layer

Concept 2:
learning
training
layer
deep
convolutional
cnn
layers
neural
lstm
network
method

Concept 3:
word
image
language
words
sentence
translation
sentences
lemma
images
log
corpus
 
Concept 4:
policy
regret
action
agent
reward
user
actions
object
state
reinforcement
video

We can see that concept 4 has more high rated words related to Reinforcement Learning while concept 3 has more words about NLP. After examining concepts in more details it appeared that the number of concepts was probably to high for my dataset. The documents in my collection may not be diverse enough for 300 concepts we initially chose. In theory, LSA performs better if the concepts are orthogonal. Thus, it makes sense to also experiment with the number of concepts less than 300 e.g. 200 and 120.

Before transformation each component has n rows with weight for each term with regards to this topic. Now we can replace svd.fit(matrix) with svd.fit_transform(matrix) it will create (m, k) array

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=200)
lsa = svd.fit_transform(matrix)
print (lsa.shape)
print (lsa[0])

Output:

(40153, 200)
[ 0.10479444 -0.03941835  0.00583014  0.03606947 -0.04222058  0.01530614
 -0.0002207   0.01456949  0.00835069 -0.00239415  0.01303619  0.00307306
 -0.02835313 -0.0223248  -0.00326648 -0.01348519 -0.0025782  -0.01182429
  ...
  0.00650632  0.03462778  0.00064334  0.02050118 -0.0220552  -0.01314632
  0.00801342 -0.00356966  0.0183404   0.02474565 -0.02011975 -0.01169044
  0.0063045   0.00635549]

We can compare documents in lower dimensional space using cosine similarity

    from sklearn.metrics.pairwise import cosine_similarity
    #Vectors of the first and second documents of the corpus
    vectorA = lsa[0]
    vectorB = lsa[1]
    similarity = cosine_similarity(vectorA, vectorB)

In addition to reducing dimensionality of document vectors LSA tends to perform better at addressing polysemy and synonymy compared to a simple TF-IDF model by capturing relationships between words that may not be present in the original term-document matrix [T. Landauer et al., 1998]. In part 3 we will add more information to our term-document matrix and will try a new model for documents representation.

References

  • Beel, Joeran, Bela Gipp, Stefan Langer, and Corinna Breitinger. “Research-paper recommender systems: a literature survey.” International Journal on Digital Libraries 17, no. 4 (2016): 305-338.
  • Spärck Jones, Karen. “A statistical interpretation of term specificity and its application in retrieval.” Journal of documentation 28.1 (1972): 11-21.
  • Landauer, Thomas K., Peter W. Foltz, and Darrell Laham. “An introduction to latent semantic analysis.” Discourse processes 25.2-3 (1998): 259-284.
  • Magerman, Tom, Bart Van Looy, and Xiaoyan Song. “Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications.” Scientometrics 82.2 (2010): 289-306.