The Case for Lucene
Last Updated Mar 2009
By: Mark Bennett (Editorial and Technical Advisory), NIE Enterprise Search - Issue 03 - July 2003
Should you use this Open Source tooal in your next Enterprise Search project?
Lucene, part of the Apache Group's Jakarta Project, is a powerful open source search engine implemented in Java. (see http://jakarta.apache.org/lucene/docs/index.html) It seems to be gaining visibility, and many of our clients have either heard of it, or are in fact using it.
This article talks about the business aspects of Lucene, items like ease of implantation, support, scalability, etc. For a more in depth technical look, including source code, please see the links at the end of this article.
Questions About Lucene: (to be covered one at a time below)
- What is it? Where did it come from?
- How much does it cost?
- How does it compare to commercial offerings?
- What skills would a staff need in order to make full use of it?
- What are some projects that might be appropriate for Lucene?
- What are some projects that might be better served by other offerings?
- Does it scale?
- In Summary
- Where can I read more about it? / Resources
What is it? Where did it come from?
Lucene is an open source search engine written in Java. Though powerful and extensible, it is currently more of a toolkit than a turnkey searching solution. It's author, Doug Cutting, previously worked for the search vendor Excite. Lucene's first open source release was in 2000. Please see the resources section for helpful links to more detailed information.
How much does it cost?
Total cost of ownership is complex question.
On the surface, Lucene is "free". Its generous open source license is the Apache Software License (ASL). In particular, this license does NOT require companies to ship their source code, which had been a common complaint about the GNU-style GPL.
However, since this is a toolkit rather than a turnkey product, some programming or consulting will be required to implement it. The initial and ongoing maintenance costs will need to be factored in.
How does it compare to commercial offerings?
In terms of "quality", we've been very impressed with what we've seen. What's in Lucene seems of high quality, and is well documented. Since it is intended to be extensible, great effort has been taken to document it.
But since this is not a turnkey product, so by default you will not have things like:
- Not an out-of-the-box turnkey solution. There is no "installer" or "setup wizard"
- No out of the box administration or command line tools. The demo code does offer some command line demos which could be leveraged.
- No "spider" to index your web site, though you may be able to find some code to do it
- No built in support for HTML format files, though one of the demos shows how to add it
- No built-in support for office documents such as MS Word, though you could add it
- No support for advanced XML queries, though some articles have been written about this topic.
- No support for Adobe's PDF format, but again, this could be implemented.
- No database gateway, though you could build one.
- No built-in Web Interface for searching, though they do have some sample JSP code
- No "help desk" or Tech Support phone number to call, though it does have a very active and enthusiastic user base. If you read the FAQs, or post a polite question, I'd say the odds are you will get some help.
To their credit, the Lucene folks have posted links to pursue the implementation of many of these features. And other members of the community may already have some rough code the implements a particular feature that they may be willing to share.
In terms of performance vs. commercial offerings, we can't comment on specific vendor comparisons. I would personally characterize it as "decent" to "pretty good", and for many projects, likely on a par with some commercial products.
What skills would a staff need in order to make full use of it?
You will likely need a Java programmer or two on staff or contract to make your project a success. The good news is there is a LOT of Java talent available, and we are even starting to see Lucene listed on some resumes. Experience with JSP would also be very helpful. And Lucene does ship with some good demo code.
A prototype project might even be appropriate for a Java-savvy intern; you do not need advanced threading or other high end coding skills to get some use out of Lucene - it is geared more towards mere-mortals. Extending Lucene functionality would require more advanced skills.
What are some projects that might be appropriate for Lucene?
- Excellent for student projects.
- Embedding into an existing application to add search capability, such as adding search capability to an email or messaging client.
- Creating a highly customized Intranet portal.
- Search enabling a database application. (that has relative low record updates)
- Applications that do not fit well into existing search engine offerings.
What are some projects that might be better served by other offerings?
- Adding generic search to your web site. Even if you are on a budget, there are much more cost effective ways of solving this problem.
- Enterprise systems needing built-in support for Office document formats, PDF, advanced XML searching, etc.
- Very high volume systems.
- Systems where the data to be searched is constantly updated.
- Mission critical projects that require vendor certification and quality of service commitments.
- Systems that will be deployed and administered at multiple locations by busy IT staff members.
Does it scale?
Generally speaking, for systems with light to moderate traffic with reasonably simple queries on datasets up to 100,000 documents, our current impression is that Lucene should be adequate. We have seen reports of Lucene performing well on a 300,000 document dataset, and we have run queries on 800,000 document sets. Simple queries still performed reasonably.
If your datasets are routinely in the 100,000 document range, or if you will ever be searching more than 1 million records, you should investigate performance carefully.
If you require an average of more than 10 queries per second, we encourage you to at least do some performance testing before making decisions. This holds true for commercial vendors as well. Lucene does support some amount of threading.
Lucene does not do as well for systems with highly volatile data. When source data changes, the Lucene indices must be updated to reflect the new terms present in the modified content. For each "update", Lucene requires a pair of "delete" and an "add" transactions; and the "add" will only be visible to newly opened search sessions. This can cause search synchronization and/or latency issues if not properly handled.
In Summary
Lucene is a reasonable toolkit for programmers, and for some projects may be a real alternative to commercial offerings. Companies wishing to embed a search engine into their product should certainly consider it.
Small companies or departments with limited IT resources should probably avoid it. Companies who just want a "free" or inexpensive site search solution should look elsewhere.
Where can I read more about it? / Resources
Main Lucene Webpage
http://jakarta.apache.org/lucene/docs/index.html
Background on original coder Doug Cutting:
http://lucene.sourceforge.net/background.html
Licensing FAQ
http://www.apache.org/foundation/licence-FAQ.html
Technical FAQs:
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cg
http://www.jguru.com/faq/Lucene
Technical Articles
http://www.onjava.com/onjava/2003/01/15/lucene.html
http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene.html
Other Free or Inexpensive Site Search Options
In some cases a free offering may require ads and/or branding. Many of the free engines offer an inexpensive upgrade to remove their ads.
http://www.freefind.com
http://picosearch.com
http://www.google.com/services/websearch.htm (some free services)
http://www.master.com/texis/master/app/home.html (Thunderstone / Texis)