Organizing scientific papers with personal recommendation system. Part 1: Elasticsearch

My daily job is developing machine learning (ML) systems for NLP and related fields. Due to the nature of my job I read scientific papers on the regular basis and share some of them on my blog (not so much lately). There are multiple sources that I use for finding new papers: Arxiv, twitter posts, personal blogs and university labs’ websites. Over the time I realized the need in a system that can keep it all together and make the papers accessible to me. One of the most important requirenments for the system was flexible full-text search system that indexes documents’ content. Another important feature that I needed was recommendation system for papers relevant to my interests. After examining existing solutions I decided to make my own system.

First of all, I needed a search engine that could handle my growing set of documents. The simples solution was to use Apache Lucene for indexing. I was familiar with the library and even had a couple posts about it here and here. Performance-wise it could also work even with a few GB index of 40k papers given decent hardware and proper Lucene configuration. However, there was another solution that offered more functionality and which was also familiar to me - ElasticSearch.

ElasticSearch (ES) is a search and analytics engine based on Apache Lucene. Its main advantages for my project over Lucene were JSON API and analytics capabilities. First of all, it was important to have JSON API for simple integration in a web application. It gave me more freedom in choosing server technologies beyond Java. Even though I have been using Java on the server side for years I find Python more suitable to use in small web applications, especially when it is for a pet project. Thus, JSON API was a big plus for ES.

Another advantage of ES is its capabilities in creating analytical reports with near real time performance. Some of the analytic capabilities in ES will be demonstrated later in this post.

Building search application with Python

For this example you will need ES 5.x. It is available for download on the official website. Alternatively, you can use official Docker image.

ES has two Python client libraries: elasticsearch-py and elasticsearch-dsl. Elasticsearch-dsl is based on elasticsearch-py and provides higher level abstractions. Its lower level alternative elasticsearch-py is more flexible library that is relatively low level. Nevertheless, it offers enough additional functionality over manually handling plain HTTP requests in Python. In this example we will use elasticsearch-py.

First we install it with pip:

pip install elasticsearch

Next we need to fill in information about our ES server and an index. If you don’t have authentication on you server you should delete http_auth paramater.

from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=["ip_address:9200"],
                   http_auth=("username", "password"))
index_name = "test-paper-index"
type_name = "paper"

As you can see, we created an index with the name test-paper-index and type name paper. Next step would typically be creating mapping for our index. However, it is an optional step because new index will be created automatically by ES when you add a first document. Mappings generated by ES 5 will suffice for our simple case. For example, its default mapping for text fields allows full text search and aggregation operations due to added keyword subfield. Here is an example of title field mapping generated by ES:

 "title" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }

You can examine created mapping by simply viewing it in browser on this url http://ip_address:9200/test-paper-index?pretty=true

Next we will create a function for adding papers to the index:

def add_paper_to_index(paper):
    doc = {
        "title":paper.title,
        "author":paper.author,
        "authors":paper.authors,
        "abstract": paper.abstract,
        "published" :paper.published_date,
        "text":paper.text
    }
    res = es.index(index=index_name, doc_type=type_name, id=paper.id, body=doc)
    print res

If you are going to add tens of thousands documents to the index there is a simple way to optimize it by disabling automatic index refresh. By default ES performs index refresh automatically and this interval is defined in refresh_interval setting. For ES 5.x the default value is typically 1s. We can disable it before bulk update of the index and enable it afterwards. Of course it is a completely optional step and you can skip such optimization.

def modify_refresh_interval(val):
    settings = {
        "index":{
            "refresh_interval":val
        }
    }
    es.indices.put_settings(index=index_name, body=settings)

#disable automatic refresh
modify_refresh_interval("-1") 
#Bulk insert 
#enable automatic refresh
modify_refresh_interval("1s")   

After adding a few papers we will need a search function. But instead of creating multiple functions for different types of search we can create one common function like this:

def search_paper(query, source=["title"], query_type="match_phrase"):
    request = {
        "_source": source,
        "query":{
            query_type:query
        }
    }
    res = es.search(index=index_name, doc_type=type_name, body=request)
    print("Got %d Hits:" % res['hits']['total'])
    for hit in res["hits"]["hits"]:
        print hit 
    return res

By default this function will look for a matching phrase and return only title of the paper. This is for developing purposes so it will not overwhelm you with text from abstract or paper content fields. You can easily add more fields to the result or just omit this filtering.

Here is an example on using this function for searching documents by title:

def search_by_title(title):
    search_paper(query={"title":title})

More advanced queries

After we defined functions for adding and quering documents we can look at more advanced features and analytic capabilities of ES. For this and other examples I used dataset with all Arxiv papers from a few categories that interest me. Here is the list:

  • cs.IR: Computer Science - Information Retrieval
  • cs.CL: Computer Science - Computation and Language
  • cs.CV: Computer Science - Computer Vision and Pattern Recognition
  • cs.AI: Computer Science - Artificial Intelligence
  • cs.LG: Computer Science - Learning
  • cs.NE: Computer Science - Neural and Evolutionary Computing
  • stat.ML: Statistics - Machine Learning
  • q-bio.NC: Quantitative Biology - Neurons and Cognition

First, let’s find a number of papers published every month. Instead of iterating through the whole index we can ask ES to compute it on the server using Aggregations Framework. With this framework you can quickly find basic stats of the documents in your index or build complex analytical reports with nested aggregations and multiple filters. For this particular task we will use bucket type aggregation that groups documents based on their properties or certain conditions. To be more specific, it will be Date Histogram Aggregation that will group papers into buckets based on the published date.

def get_agg_buckets_by_month():
    request = {
        "aggs": {
            "papers_count":{
                "date_histogram":{
                    "field":"published_date",
                    "interval":"month"
                }
            }
        }
    }
    res = es.search(index=index_name, doc_type=type_name, body=request)
    print res["aggregations"]
    return res

Here is a fragment of the output of this query that show how many papers were added per month in the categories I follow:

      {
        "doc_count": 1011,
        "key": 1475280000000,
        "key_as_string": "2016-10-01T00:00:00.000Z"
      },
      {
        "doc_count": 1235,
        "key": 1477958400000,
        "key_as_string": "2016-11-01T00:00:00.000Z"
      },
      {
        "doc_count": 1028,
        "key": 1480550400000,
        "key_as_string": "2016-12-01T00:00:00.000Z"
      },

This is definitely higher than I expected. And it’s clear that even skimming through them would take a lot of time. We will need some sort of recommendation system. But before implementing the recommendation logic in code it we can try ES More Like This query for finding similar documents. It will use the terms in specific fields of the document(s) and will look for documents with similar terms frequency. This is a useful feature that will help us discover other interesting papers with only a few lines of code.

def search_similar_paper(paper_id):
    query_type = "more_like_this"
    q = {
        "fields":["text"],
        "like":[{
            "_index":index_name,
            "_type":type_name,
            "_id":paper_id
        }]
    }
    res = search_paper(q, query_type)
    return res

Before we test it, let’s find a fresh ML paper published this month. For this we need a bit more complex query that will also filter the documents in a date range.

def search_with_filter():
    request = {
        "query": {
            "bool": {
                "must": {
                    "match_phrase": {
                        "title": "Machine Learning"
                    }
                },
                "filter": {
                     "range": {
                         "published_date": {
                             "from": "now-1M/M" }
                        }
                    }
                }
            }
        }
    
    res = es.search(index=index_name, doc_type=type_name, body=request)
    print("Got %d Hits:" % res['hits']['total'])
    for hit in res["hits"]["hits"]:
        print hit
    return res
Output:
...
u'1612.04858', u'_source': {u'author': u'Scott Clark', u'title': u'Bayesian Optimization for Machine Learning : A Practical Guidebook', 
...

Let’s pick a paper with a title ‘Bayesian Optimization for Machine Learning : A Practical Guidebook’ by I. Dewancker et al., 2016 with id 1612.04858 and pass that id to search_similar_paper function.

#Top 5 results using text field:
{"_score": 42.259304, "_type": "paper", "_id": "1609.01088", "_source": {"title": "GTApprox: surrogate modeling for industrial design"}, "_index": "test-paper-index"}
{"_score": 39.35621, "_type": "paper", "_id": "1412.1114", "_source": {"title": "Easy Hyperparameter Search Using Optunity"}, "_index": "test-paper-index"}
{"_score": 38.96662, "_type": "paper", "_id": "1608.00704", "_source": {"title": "Identifiable Phenotyping using Constrained Non-Negative Matrix\n  Factorization"}, "_index": "test-paper-index"}
{"_score": 38.704517, "_type": "paper", "_id": "1611.06213", "_source": {"title": "GaDei: On Scale-up Training As A Service For Deep Learning"}, "_index": "test-paper-index"}
{"_score": 37.322968, "_type": "paper", "_id": "1610.01874", "_source": {"title": "Neural-based Noise Filtering from Word Embeddings"}, "_index": "test-paper-index"}

The results make some sense but seem to range widely. However, if we only use papers’ abstracts instead of full texts, ES will give much more reasonable results.

#Top 5 results using abstract field:
{"_score": 15.3194475, "_type": "paper", "_id": "1506.02080", "_source": {"title": "Local Nonstationarity for Efficient Bayesian Optimization"}, "_index": "test-paper-index"}
{"_score": 14.616457, "_type": "paper", "_id": "1612.08915", "_source": {"title": "Bayesian Optimization with Shape Constraints"}, "_index": "test-paper-index"}
{"_score": 14.238169, "_type": "paper", "_id": "1501.04080", "_source": {"title": "Differentially Private Bayesian Optimization"}, "_index": "test-paper-index"}
{"_score": 14.169479, "_type": "paper", "_id": "1211.4888", "_source": {"title": "A Traveling Salesman Learns Bayesian Networks"}, "_index": "test-paper-index"}
{"_score": 13.894984, "_type": "paper", "_id": "1506.01349", "_source": {"title": "Bayesian optimization for materials design"}, "_index": "test-paper-index"}

Let’s try another example with a paper Teaching Machines to Read and Comprehend K. Hermann et al., 2015 with id 1506.03340.

#Top 5 results using text field:
{"_score": 62.88277, "_type": "paper", "_id": "1603.01547", "_source": {"title": "Text Understanding with the Attention Sum Reader Network"}, "_index": "test-paper-index"}
{"_score": 59.178444, "_type": "paper", "_id": "1511.02301", "_source": {"title": "The Goldilocks Principle: Reading Children's Books with Explicit Memory\n  Representations"}, "_index": "test-paper-index"}
{"_score": 46.119095, "_type": "paper", "_id": "1602.04341", "_source": {"title": "Attention-Based Convolutional Neural Network for Machine Comprehension"}, "_index": "test-paper-index"}
{"_score": 45.457447, "_type": "paper", "_id": "1512.00965", "_source": {"title": "Neural Enquirer: Learning to Query Tables with Natural Language"}, "_index": "test-paper-index"}
{"_score": 39.0097, "_type": "paper", "_id": "1512.01409", "_source": {"title": "What Makes it Difficult to Understand a Scientific Literature?"}, "_index": "test-paper-index"}

#Top 5 results using abstract field:
{"_score": 22.579983, "_type": "paper", "_id": "1107.1322", "_source": {"title": "Text Classification: A Sequential Reading Approach"}, "_index": "test-paper-index"}
{"_score": 21.763605, "_type": "paper", "_id": "1311.3175", "_source": {"title": "Architecture of an Ontology-Based Domain-Specific Natural Language\n  Question Answering System"}, "_index": "test-paper-index"}
{"_score": 19.692902, "_type": "paper", "_id": "1410.3916", "_source": {"title": "Memory Networks"}, "_index": "test-paper-index"}
{"_score": 19.400835, "_type": "paper", "_id": "1203.5084", "_source": {"title": "A Data Driven Approach to Query Expansion in Question Answering"}, "_index": "test-paper-index"}
{"_score": 19.173578, "_type": "paper", "_id": "1511.02570", "_source": {"title": "Explicit Knowledge-based Reasoning for Visual Question Answering"}, "_index": "test-paper-index"} 

In this example, both recommendation types provided reasonable results. After experimenting with different papers I’ve got an impression that increased number of terms in the field (text vs abstract) may decrease quality of recommendation in ES. However, we will need more objective evaluation before making conclusions. Regardless, we can use it as a baseline for comparison with more advanced models for computing semantic similarity in part 4.

Conclusion

In this post we created functions for populating index and querying documents. We also used ES to find similar papers, filter papers by date and get statistics of published papers.

In part 2 I will describe how to build personalized paper recommendation using semantic similarity models.