|
Locator: NIE Home / Publications / Enterprise Search Newsletter / Issue 3 / Article 3 The Case for Lucene:Should you use this Open Source tool in your next Enterprise Search project? Editorial and Technical Advisory by Mark Bennett - NIE Enterprise Search - Issue 3 - July 2003 Lucene, part of the Apache Group's Jakarta Project, is a powerful open source search engine implemented in Java. (see http://jakarta.apache.org/lucene/docs/index.html) It seems to be gaining visibility, and many of our clients have either heard of it, or are in fact using it. This article talks about the business aspects of Lucene, items like ease of implantation, support, scalability, etc. For a more in depth technical look, including source code, please see the links at the end of this article. Questions About Lucene: (to be covered one at a time below)
What is it? Where did it come from? Lucene is an open source search engine written in Java. Though powerful and extensible, it is currently more of a toolkit than a turnkey searching solution. It's author, Doug Cutting, previously worked for the search vendor Excite. Lucene's first open source release was in 2000. Please see the resources section for helpful links to more detailed information. How much does it cost? Total cost of ownership is complex question. On the surface, Lucene is "free". Its generous open source license is the Apache Software License (ASL). In particular, this license does NOT require companies to ship their source code, which had been a common complaint about the GNU-style GPL. However, since this is a toolkit rather than a turnkey product, some programming or consulting will be required to implement it. The initial and ongoing maintenance costs will need to be factored in. How does it compare to commercial offerings? In terms of "quality", we've been very impressed with what we've seen. What's in Lucene seems of high quality, and is well documented. Since it is intended to be extensible, great effort has been taken to document it. But since this is not a turnkey product, so by default you will not have things like:
To their credit, the Lucene folks have posted links to pursue the implementation of many of these features. And other members of the community may already have some rough code the implements a particular feature that they may be willing to share. In terms of performance vs. commercial offerings, we can't comment on specific vendor comparisons. I would personally characterize it as "decent" to "pretty good", and for many projects, likely on a par with some commercial products. What skills would a staff need in order to make full use of it? You will likely need a Java programmer or two on staff or contract to make your project a success. The good news is there is a LOT of Java talent available, and we are even starting to see Lucene listed on some resumes. Experience with JSP would also be very helpful. And Lucene does ship with some good demo code. A prototype project might even be appropriate for a Java-savvy intern; you do not need advanced threading or other high end coding skills to get some use out of Lucene - it is geared more towards mere-mortals. Extending Lucene functionality would require more advanced skills. What are some projects that might be appropriate for Lucene?
What are some projects that might be better served by other offerings?
Does it scale? Generally speaking, for systems with light to moderate traffic with reasonably simple queries on datasets up to 100,000 documents, our current impression is that Lucene should be adequate. We have seen reports of Lucene performing well on a 300,000 document dataset, and we have run queries on 800,000 document sets. Simple queries still performed reasonably. If your datasets are routinely in the 100,000 document range, or if you will ever be searching more than 1 million records, you should investigate performance carefully. If you require an average of more than 10 queries per second, we encourage you to at least do some performance testing before making decisions. This holds true for commercial vendors as well. Lucene does support some amount of threading. Lucene does not do as well for systems with highly volatile data. When source data changes, the Lucene indices must be updated to reflect the new terms present in the modified content. For each "update", Lucene requires a pair of "delete" and an "add" transactions; and the "add" will only be visible to newly opened search sessions. This can cause search synchronization and/or latency issues if not properly handled. In Summary Lucene is a reasonable toolkit for programmers, and for some projects may be a real alternative to commercial offerings. Companies wishing to embed a search engine into their product should certainly consider it. Small companies or departments with limited IT resources should probably avoid it. Companies who just want a "free" or inexpensive site search solution should look elsewhere. Where can I read more about it? / Resources Main Lucene Webpage Background on original coder Doug Cutting: Licensing FAQ Technical FAQs: Technical Articles Other Free or Inexpensive Site Search Options In some cases a free offering may require ads and/or branding. Many of the free engines offer an inexpensive upgrade to remove their ads. http://picosearch.com http://www.google.com/services/websearch.htm (some free services) http://www.master.com/texis/master/app/home.html (Thunderstone / Texis) Return to the Table of Contents |