Indexing and searching text with Apache Lucene, Part 2

In the previous part I described the main functionality of Lucene and how it can be used. Today I will create a Java application that demonstrates how two basic operations in Lucene: adding documents to index and searching for documents. The code example is intended to give an idea of how to start with Lucene. My goal was to include only the essential code without overwhelming the readers with unnecessary configuration and features. Please do not treat it as a production ready code.

In order to start you will need Java SDK (version 7 or later) and Maven (version 3 or later) installed. First, create a Maven project and add the following dependencies for Apache Lucene :

 	<dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.8.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>4.8.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>4.8.1</version>
        </dependency>

Next, create a Java class LuceneExample and add the following code there:

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.IOException;
import java.nio.file.Paths;

/***
 * Example of adding documents to Lucene index and finding a document with the index.
 * @author Walter Volodenkov
 */
public class LuceneExample {

    public static void main(String[] args) throws IOException, ParseException {

        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_48);
        String indexDirectoryPath = "put a path to index directory here";
        Directory indexDirectory = FSDirectory.open(Paths.get(indexDirectoryPath).toFile());
        writeToIndex(analyzer, indexDirectory);
        String queryString = "Lucene";
        findInIndex(queryString, analyzer, indexDirectory);

    }

    private static void writeToIndex(StandardAnalyzer analyzer, Directory indexDir) throws IOException {
        IndexWriterConfig indexWriterConf = new IndexWriterConfig(Version.LUCENE_48, analyzer);
        indexWriterConf.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        IndexWriter indexWriter = new IndexWriter(indexDir, indexWriterConf);
        addExamples(indexWriter);
        indexWriter.close();
    }

    private static void findInIndex(String queryString, StandardAnalyzer analyzer, Directory indexDir) throws ParseException, IOException {
        Query query = new QueryParser(Version.LUCENE_48, "title", analyzer).parse(queryString);
        IndexReader indexReader = DirectoryReader.open(indexDir);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        int hitsPerPage = 10;
        TopDocs results = indexSearcher.search(query, hitsPerPage);
        ScoreDoc[] matchedDocs = results.scoreDocs;
        System.out.println("Documents matched: " + matchedDocs.length);
        for (int i = 0; i < matchedDocs.length; ++i) {
            int documentId = matchedDocs[i].doc;
            Document document = indexSearcher.doc(documentId);
            System.out.println(String.format("Document title: %s , id: %s",
                    document.get("title"), document.get("id")));
        }
        indexReader.close();
    }

    private static void addExamples(IndexWriter indexWriter) throws IOException {
        addDoc(indexWriter, "TPS report", "111111111111");
        addDoc(indexWriter, "Lucene manual", "222222222222");
        addDoc(indexWriter, "Contract with ACME Ltd", "333333333333");
        addDoc(indexWriter, "Some test text", "444444444444");
    }



    private static void addDoc(IndexWriter indexWriter, String title, String id) throws IOException {
        Document doc = new Document();
        doc.add(new TextField("title", title, Field.Store.YES));
        doc.add(new StoredField("id", id));
        indexWriter.addDocument(doc);
    }
}

The example class is almost ready and the only thing that’s left is specifying the directory where Lucene index will be created for this example. Set the path to the index directory to indexDirectoryPath variable in method main and run the class. The output should be looking similar to the following:

Documents matched: 1
Document title: Lucene manual , id: 222222222222

This means the application has created Lucene index, added a few documents to the index, processed the search query and found a matching document. All this was done in one small class with just a few methods (Java method is equivalent to function in other programming languages). Now, let’s go though each method in the example starting with the main. If you are familiar with Java programming you know that method main is the one to be called first when the example app is run. In this method we create an analyzer that uses default implementation and works without additional configuration. We also specify an index directory where Lucene will store index files. Both analyzer and index directory are necessary for operations that involves the index. In our case we have two methods, writeToIndex that adds documents to the index and findInIndex that searches documents in the index.

    public static void main(String[] args) throws IOException, ParseException {

        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_48);
        String indexDirectoryPath = "put a path to index directory here";
        Directory indexDirectory = FSDirectory.open(Paths.get(indexDirectoryPath).toFile());
        writeToIndex(analyzer, indexDirectory);
        String queryString = "Lucene";
        findInIndex(queryString, analyzer, indexDirectory);

    }

Method writeToIndex will create index if it does not exist and populate it with a few documents. The documents that are added will be generated in the method addExamples. The documents are deliberately made very simple to demonstrate the basic case. They have only two fields: title and id. The title field will processed by Lucene analyzer and added to the index. However, id has another type and will only be stored without indexing because it belongs to metadata fields and we do not intend to use it in search. Instead, we can use it for tasks such as getting a record from a database. Another example of non-indexable field could be url field which references a text document e.g. PDF or MS Word. You can add more documents, add larger text to the documents or add more document fields if necessary in the method addDoc.

    private static void writeToIndex(StandardAnalyzer analyzer, Directory indexDir) throws IOException {
        IndexWriterConfig indexWriterConf = new IndexWriterConfig(Version.LUCENE_48, analyzer);
        indexWriterConf.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        IndexWriter indexWriter = new IndexWriter(indexDir, indexWriterConf);
        addExamples(indexWriter);
        indexWriter.close();
    }


    private static void addExamples(IndexWriter indexWriter) throws IOException {
        addDoc(indexWriter, "TPS report", "111111111111");
        addDoc(indexWriter, "Lucene manual", "222222222222");
        addDoc(indexWriter, "Contract with ACME Ltd", "333333333333");
        addDoc(indexWriter, "Some test text", "444444444444");
    }



    private static void addDoc(IndexWriter indexWriter, String title, String id) throws IOException {
        Document doc = new Document();
        doc.add(new TextField("title", title, Field.Store.YES));
        doc.add(new StoredField("id", id));
        indexWriter.addDocument(doc);
    }

After the index was populated with documents we can search for documents. The simplest query will be a word that we want to find in the document fields. In our case search query is a word Lucene, that is passed to the method findInIndex. The query is parsed by a QueryParser object and later used by IndexSearcher object for finding matching documents. The code specifies that field title that will be used in search. If you want to change the example, please make sure you change the field name in the QueryParser as well. If you look back at the method addExamples you can see four documents added to the index. One of the has a title Lucene manual and will match our search query Lucene.

    private static void findInIndex(String queryString, StandardAnalyzer analyzer, Directory indexDir) throws ParseException, IOException {
        Query query = new QueryParser(Version.LUCENE_48, "title", analyzer).parse(queryString);
        IndexReader indexReader = DirectoryReader.open(indexDir);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        int hitsPerPage = 10;
        TopDocs results = indexSearcher.search(query, hitsPerPage);
        ScoreDoc[] matchedDocs = results.scoreDocs;
        System.out.println("Documents matched: " + matchedDocs.length);
        for (int i = 0; i < matchedDocs.length; ++i) {
            int documentId = matchedDocs[i].doc;
            Document document = indexSearcher.doc(documentId);
            System.out.println(String.format("Document title: %s , id: %s",
                    document.get("title"), document.get("id")));
        }
        indexReader.close();
    }

Here is a small exercise that you can do. Currently there is just one document added to the index that matches the search query. However, the code can show multiple documents if there are more than one match. Modify the method that generates examples to include more documents with title including word Lucene so there will be multiple matches for the search query. Alternatively, You can change the search query to another word and add documents that has this word in the title.

To summarize, in this example we have created Lucene index, populated it with a few documents, searched for a document with the title matching our query and displayed the results.