Indexing and searching text with Apache Lucene, Part 1

Apache Lucene is an open source popular library for Information Retrieval tasks. The library is implemented in Java and has ports or API bindings in multiple languages such as Python, Ruby, C#, etc. If you need full text search functionality in your application you can add it with Lucene. Text search is the most common application of the library but not the only one. For example, you can compute text similarity with Lucene using different measures provided with the library or add custom similarity algorithms. Today’s blog post will focus on text search scenario and a code example of typical indexing and searching operations.

First of all, a short digression to the concept of indexing. Let’s say we have a large set of text documents and need to search words or phrases inside of them. The simplest approach would be to search each document for the word or phrase in question. Unfortunately, performance of this solution will be poor and unscalable, especially with HDD storage where reading large number of small files bears high overhead. In real world analogy, imaging if word dictionary didn’t have words ordered alphabetically and you had to read through the whole dictionary each time you want to find a word definition. Instead of this inefficient approach, we can analyze the documents once and create a data structure that will store compact and ordered information about the content of the documents. This condensed data will increase the search speed by dramatically reducing the number of data you need to scan each time. This is a basic idea behind indexing.

Indexing

Lucene API is well designed and has a low learning curve for using its core functionality. The core concept in Lucene is a document object that contents fields. The fields can be indexed or used for other purposes e.g. storing metadata. The document objects are populated with the text provided by the client application. The functionality of extracting text from various document formats e.g. PDF, HTML, DOC is not provided by Lucene but left to the clients. This approach makes Lucene independent of the data source and offers flexibility in processing various text formats without limiting developers to a fixed set of formats. For example, you may have a client application in Python that uses wide range of 3rd party Python parsing libraries and sends the text for indexing to Lucene. The text is then used to populate the fields in Lucene Document object that can later be used by Lucene Analyzer. Analyzer splits the text into small elements called tokens, the process known as tokenization. Lucene offers multiple variants of Analyzers that have different approach to creating tokens e.g. using different stemmers for extracting word roots. If the built-in Analyzers do not solve your particular problem you can combine built-in tokenizers and token filters or add your own Analyzer implementation.

After the documents have been analyzed they can be stored in the index structure. This is performed by IndexWriter object that is responsible for creation and modification of the index. The particular index data structure used in Lucene is inverted index. The index called inverted because its keys are tokens that mapped to the documents in which they occur. The reason of using inverted indices is increased performance of finding the documents containing the token compared to forward indices. The index is stored in multiple segments with each segment being an index itself. Segments are created when index writes flushes its buffer and contain changed documents. Each segment is stored in a few files, responsible for different parts of the index. The segments will later be merged to improve index performance. Speaking of performance, Lucene has a number of options for tuning performance such as IndexWriter buffer related policies, index deletion and merging policies and others.

Searching

After we populated the index with documents we can use search queries. The query is parsed by QueryParser that allows constructing fairly complex search logic. For example, using wildcards, increase importance of particular search parameters (a.k.a. boosting), fuzzy and proximity searches based on word distance algorithms (e.g. Levenstein distance), specify search range, etc. Moreover, you can combine search conditions for different fields in one query, effectively using multiple queries. For example, you can use specific query for document title and another query for document content joined together with logical AND operator. If default QueryParser does not offer the functionality you need it is possible to customize it. On the other hand. In the most basic case you can simply pass a word or a phrase to query and it will look for a matching documents. Similarly to indexing, Lucene has high flexibility in query construction and can cater to simple use cases as well as more advanced needs. The main steps of indexing and searching are shown in Figure 1.

Figure 1: Interaction with Lucene

The search algorithm used in Lucene is topic for at least separate article or rather series of articles. In short, it combines the approach of vector space model and boolean model for search. The idea of vector space is in representing documents and queries as high dimensional vectors. The vectors are used for finding relevant documents by computing the distance between vectors. In the case of Lucene the distance measure is Cosine Similarity. The boolean model works with boolean logic in the query and narrows the scope of the of the documents to be scored by the vector space model. If you are interested in more details, Lucene documentation website and its wiki have more information on the topic.

Products based on Lucene

There is a number of projects using Lucene search engine. Two popular search servers that extend Lucene functionally and add a number of enterprise features are Apache Solr and Elasticsearch. Both of them started in 2004 and aimed to solve similar problems. At the moment they are two most popular enterprise search engines according to DB-Engines ranking.

Apache Solr is a server solution that is built around Lucene and offers rich functionality. Features such as dynamic clustering, distributed index replication, plugins system, Web UI and database integration make it more attractive for enterprise use. Additionally, it can extract text from various document formats (e.g. PDF, Microsoft Word) and has REST API out of the box. The server is configured with text files and you can get a working search server for your company without writing a line of Java code.

Elasticsearch (ES) is another enterprise server solution with functionality similar to Apache Solr. Compared to Solr, ES seems to have more options for data import and snapshot management which can be important for some enterprise users. From a personal experience, ES Java API gives an impression that you are supposed to use REST API instead and that ES REST API receives more attention from the company. In contrast, Solr Java API is more powerful compared to Solr REST API. Another difference, is lower degree of search customization in ES that currently does not have an alternative to Solr’s SearchComponents. Finally, I experienced some issues with ES documentation which could due to the higher rate of changes of the product. It may not be a problem in simple out-of-the-box use case but can hinder customization beyond that.

Why use Lucene when Lucene-based servers are available ?

If your application needs deep customization of Lucene and you want total control over the indexing and search, you might want to go with pure Lucene. In that case you will need Java experience and better understanding of how the search works. Additionally, you will be responsible of managing infrastructure tasks such as scaling and replication. Requirements for search customization in this case outweigh the benefits of more manageable enterprise solutions.

Maybe exact the opposite is true and even the most basic server is overkill for your scenario. Maybe you simply want to add a small lightweight library and forget about it. This is also possible because Lucene library is only a few Megabytes and it uses resources efficiently in terms of CPU, memory and storage space. This simplifies the integration with existing systems because there is no need in separate server that has to be configured and maintained. The integration can be seamless where you just add it as any other library for your project. For example, you can embed it into Desktop application or Web application.

Regardless of your reasons, you will get library with a straightforward and flexible API. Library that is simple at the surface and is highly customizable for advanced tasks if you venture into its depths. You can use default components that will provide search functionally out of the box and never look under the hood. Without the knowledge in NLP algorithms for tokenization or Information Retrieval you get a working solution with decent performance.

In Part 2 I will create a sample Java application that demonstrates indexing of the documents and their search with Apache Lucene.