Apache Lucene: a brief developer's guide

There is lot of material on Lucene freely available on the internet already, though most of the material is either formal (lengthy, too detailed, and dull) or informal (mostly incomplete, and often scattered). And developers want documentation we can trust.

Apache Lucene is a free and open-source information retrieval (IR) software library, originally written completely in Java by Doug Cutting. It is supported by the Apache Software Foundation, and is released under the Apache Software License.

This VOX DC post is based on my years of experience of using Lucene library, and should provide a quick and pointed guide to using Lucene.

Why Lucene: understanding the need of Lucene

Today, we’re surrounded by data. We use computers, internet, intranet, and mobile phones extensively. We upload documents, send text messages, update social channels, send emails, publish blogs and so forth. Each of these operations results in bulk of data. Machines, too, are generating and keeping more and more data.

These all things lead to the exponential growth of data. Some 10 years back this growing data presented challenges to cutting-edge businesses such as Google, Yahoo, Amazon, and Microsoft. At that time, they had to go through terabytes and petabytes of data to identify which websites were popular, what books were in demand, and what kinds of ads appealed to people. These companies felt the existing tools, were becoming inadequate to process large data sets. So most of them created proprietary products.

Google was the first to publicize MapReduce—a system used to scale their data processing needs. As time passed, the explosion of data was not limited to cutting-edge technology companies. Now these challenges are faced by almost every organization; each organization now deals with huge data every day.

With time, the amount of data available has become so vast that we need more dynamic ways of finding information. For storing and retrieving data there were 2 classical ways – one is to classify the data into categories and subcategories, and then search through hundreds of these categories and subcategories of data. But it is not an efficient method for finding information. A second method, is to use structure database RDBMS. RDBMS is not efficient way to manage unstructured, messy and unpredictable data that grows exponentially.

One of the other important requirement of modern time is - we need to be able to make flexible, freeform, ad-hoc queries. These queries should run beyond the category boundaries and find exactly what we’re looking for while requiring the least effort possible.

To address this need of efficient information retrieval in sea of data, information retrieval software came into existence. Lucene is one among these offerings.

What is Lucene?

Lucene is a mature, open source project implemented in Java. You can seek the complete source and distribute with your application. There are no dependencies. Currently 7.2 is latest version available, with many searchable libraries available now. Lucene is based on concepts of indexing. To understand the fundamentals of Lucene, we need to understand indexing first.Image 1 Lucene.png

Details on indexing

Suppose you want to search for certain words in large number of files. One approach would be to go through each file sequentially and look for text. This will work, but is an inefficient method, especially if you must search numerous documents. Here, indexing can help.

An efficient solution to this problem is to create index and search inside those indexes. Indexing can be compared to the “index” at the end of a book, from which a reader can quickly look for a topic of interest. At the heart of all search engine is concept of indexing – indexing means processing the original data into a highly efficient cross-reference to facilitate the search.

How Lucene employs indexing

This picture captures a high-level view of Lucene working. Lucene allows you to add indexing and searching capabilities to your application. Lucene can index and make searchable any data that can be converted to text format.Image 2 Lucene.png

This figure demonstrates a typical application integrated with Lucene library, showing an application that gets data from various sources (like file system, DB, web). It collects data, indexing it using Lucene, and storing indexes. This allows users to search collected data and present the report to a user.

Lucene does not care about the source of the data – its format and language – as long as you can convert it to text. With Lucene you can index and search any type of document, for example, web pages on a remote web server or a document stored in local file system, among many others.

Lucene information retrieval

The picture above shows two main processes – indexing and searching. The left-hand side shows indexing, and the right-hand side shows searching. Let’s assume we have numerous documents and our application want to perform search. The first step in doing so, is indexing.Image 3 Lucene.png

Lucene can take a huge number of documents as input, internally building “document” objects for its internal use, then analyzing those documents. Lucene then creates tokens from the documents and ultimately creates “index” table of those documents.

Once indexing is done, we are ready to search those documents. On right-hand side of the picture, we see users can search a specific term among millions of documents with relative ease.

I hope this introductory information on Lucene proves to be helpful. I recommend accessing the complete documentation at the link below, and please reach out to me in the VOX DC @AnandKayande; I'd welcome the chance to discuss this post, or development at Veritas, generally!

For complete documentation and to download Lucene, you can visit https://lucene.apache.org/core/

2 Comments

 Awesome! I have been looking for material on Lucene ever since I joined the eDiscovery Platform Program. Thank you Anand!!!

Great nuggets of knowledge.  Thank you Anand.