Organizing scientific papers with personal recommendation system. Part 3: n-gram & LDA

In the previous part we talked about LSA in details. Today we will consider a few more methods that are used in finding semantic similarity. The new methods however still use term-document matrix and also create latent topics, hence having similarity with LSA. Thus, it may be irrational to describe them in the similar level of detail as we did for LSA in part 2.

Capturing more features for LSA

In the previous example each term in our TF-IDF matrix was a single word (unigram). With unigrams our model does not capture multi-word expressions which may be an issue. For example, in our Bag-Of-Words model a document that has a term machine in one paragraph and term learning in the other paragraph will be treated similarly to a document that has a two-word expression (bigram) “machine learning”. Similarly, trigram artificial neural network will be similar to the words artificial, neural, network occuring in different parts of a document. Thus, we may lose a lot of valuable information by not using n-grams in our TF-IDF vocabulary.

The simplest solution to include bigrams and trigrams to TF-IDF matrix will be to simply use an option ngram_range=(1, 3) in TfidfVectorizer. However, it will increase the number of terms in our vocabulary and co-occurrence matrix dramatically due to exponential growth in our feature space. This can have a negative impact on our application’s performance. If we look at TfidfVectorizer we will see that it uses CountVectorizer for getting token counts and passes results to TfidfTransformer. With CountVectorizer our dramatic vocabulary growth will require more memory for each additional term. As a result, even with our small dataset of 40k documents it may use more RAM than a typical workstation has (32 Gb). Another issue with CountVectorizer is not being able to have a parallel access to the vocabulary which also limits performance of our application. Unfortunately, the situation will only exacerbate as our document collection grows. Therefore, it will better to use HashingVectorizer.

HashingVectorizer is memory efficient because it does not keeps the mappings of terms to feature indices in memory. Instead it uses hash function to find feature indices. As a result, the growth of the corpus and vocabulary will not cause proportional growth in memory consumption. Additionally, the application will be more horizontally scalable because does not need to scan the whole corpus while mapping terms. Moreover, stateless nature of hashing vectorizer will simplify parallel vectorization. The hashing algorithm is signed 32-bit MurmurHash3 which commonly used for hash lookup data partitioning in the industry, including distributed computing systems Apache Hadoop, Apache Spark and Apache Cassandra. If we look specifically at Apache Spark, it uses similar approach and the same hashing algorithm in term frequency mapping in the class pyspark.mllib.feature.HashingTF (github link) from spark.mllib libary (documentation).

Hashing function may cause collisions between different features but the probability of it can reduced by increasing number of features in the hash table. In our case 2^22 ~ 4e6 features should be enough considering the vocabulary of unigrams in our corpus is close to 216000.

The following code example shows how to use HashingVectorizer for our application.

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

hasher = HashingVectorizer(input='filename',
                           encoding='utf-8', decode_error='replace', strip_accents='unicode',
                           lowercase=True, analyzer='word', stop_words='english',
                           token_pattern=r'(?u)\b[a-zA-Z_][a-zA-Z0-9_]+\b',
                           ngram_range=(1, 3), norm='l2', n_features=(2 ** 22),
                           non_negative=True)
vectorizer = make_pipeline(hasher, TfidfTransformer())
td_matrix = vectorizer.fit_transform(filenames)
lsa = TruncatedSVD(n_components=200)
lsa_result = lsa.fit_transform(td_matrix)

PLSA and LDA

In LSA topics are assumed to be orthogonal hence our LSA model may not perform optimally on our dataset even with lower number of concepts. However, we can mitigate this potential issue with Probabilistic LSA (PLSA) [T. Hofmann et al., 1999]. In PLSA model a document is associated with a probability distribution on a set topics(concepts). Topics can be non-orthogonal because they are treated as word distributions. For word w, document index d and topic(concept) c we can find their co-occurrence probability according to Equation 1.

Equation 1: PLSA

There is an issue with PLSA when calculating P(c|d) because as the number of documents grows our number of parameters also grows hence potentially leading to overfitting. Also, it is unclear how to assign P(w) to a new document, that has not been seen before by the model. This issues are addressed by Latent Dirichlet Allocation (LDA) [D. Blei et al., 2003]. LDA is quite similar to PLSA although it uses Dirichlet priors both over topic-term and document-topic distributions. Graphical model representation of LDA is show on Figure 1.

Figure 1: LDA graphical model

Where α is the parameter of the Dirichlet prior on the per-document topic distributions, β is the parameter of the Dirichlet prior on the per-topic word distribution, θ is the topic distribution for document m, M is the number of documents, N is the number of topics z and words w.

We can compute LDA with Sklearn, using the same term-document matrix that we used for LSA (see part 2):

from sklearn.decomposition import LatentDirichletAllocation 
td_matrix = vectorizer.fit_transform(filenames)
n_topics = 300
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=10,
                                learning_method='batch')
lda_result = lda.fit_transform(td_matrix)

Due to different approach to topics (term distributions) in LDA we may achieve higher accuracy with relatively high number of topics (300) compared to LSA. In part 4 we will evaluate our recommendation models with different number of topics.

References

Hofmann, Thomas. “Probabilistic latent semantic indexing.” In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50-57. ACM, 1999.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3, no. Jan (2003): 993-1022.

Arun, Rajkumar, Venkatasubramaniyan Suresh, C. Veni Madhavan, and M. Narasimha Murthy. “On finding the natural number of topics with latent dirichlet allocation: Some observations.” Advances in Knowledge Discovery and Data Mining (2010): 391-402.