Lucene: An Interview with Otis Gospodnetic
Last Updated Mar 2009
By Theresa Shafer, New Idea Engineering, Inc. - Volume 5 Number 3 - April/May 2008
Lucene is a free open source search and retrieval application available through the Apache Software Foundation (ASF) and is released under the Apache Software License. Lucene itself is an application-agnostic text indexing and search toolkit -- it doesn't contain functionality such as crawling and document parsing. The Apache project Nutch is a search package based on Lucene that includes crawling, parsing, and indexing in addition to searching. It's built for large-scale web-wide search engine installations, but it's also often used for building vertical search engines. Solr, another Apache project, is an open-source search server built on top of Lucene. We refer to the Lucene/ Nutch /Solr (LNS) trio as LNS.
We interviewed Otis Gospodnetic about the opportunities and challenges to consider when evaluating an open source search technology. Otis is the co-founder of Sematext, a Lucene expert, co-author of Lucene in Action, and a long-time Lucene and Solr developer with over 10 years of experience in search and related technologies. Sematext implements open-source search, linguistic, and text analytics technology in the enterprise. They focus on the development of scalable and high-performance search solutions.
Q: Why do corporations choose open source engines like Lucene/ Nutch/ Solr?
A: There are numerous good reasons for choosing one of these packages. The key advantage of an open source search engine is the price. The flexible of customizing the code is another key reason for companies to choose an open source tool. Would you rather wait for the next version of a commercial search solution and hope that it has the feature that you really need, or would you rather get the open-source software today and pay somebody to add exactly the functionality that you need next week? While I can't speak for all open-source projects, LNS trio has been tried and tested in some very high profile and high traffic companies (e.g. Amazon, Friendster, Technorati, Digg, MySpace ...). With so many people testing it around the world and around the clock, the quality of the code is very high.
Q: What are some of the issues?
A: When working with open-source software there is a little bit of culture one should try to understand and work with. Sometimes clients ask us to make changes in LNS. This is perfectly fine, but they should always at least consider contributing their changes back to the open source project. They need to know what the cost to them might be if they fork in a way that makes it hard for them to maintain their changes.
Q: What is true cost of this decision?
A: If you do not contribute the changes back to the open source project, you will bear the cost of ongoing upgrades and porting of the changes. If the changes are not made in a way that makes it easy to upgrade, one might get stuck with an old version of code if you don't keep porting to the new release. Sometimes that's easy, sometimes it's hard, depending upon how many changes were made between versions and what these changes involved. A part of what makes LNS so stable is their release cycle - neither Lucene nor Solr nor Nutch have very frequent releases. Releases are typically months apart and sometimes it can be a year between releases. There are people working on LNS night and day, so in a year's time a lot of changes are made. Luckily, developers on all three projects are experienced software engineers and pay very close attention to backwards compatibility of both the API and the index format. Changes to APIs and indices always go through a deprecation phase and release, which simplifies upgrades. There is also cost with not upgrading. Newer code is often faster; therefore staying with an older release could require more hardware. Because of these costs and trade-offs, we usually recommend contributing back as much as you can afford and as much as it makes sense from the IP standpoint.
Q: It sounds like your strong advice is to contribute changes back?
A: Unless you have some change that's extremely proprietary and really worth keeping that way, our advice is to contribute it back to the open-source project. By doing so, one sheds the maintenance responsibility and derives the benefit of getting improvements to their changes over time and for free. Companies are typically not doing rocket science, but are implementing solutions to common problems. If they show interest in contributing back to the project, the project developers will often jump in and help with integration. This is especially true when the proposed changes handle a common use case and scratch many people's itches. Such help from project developers often leads to improvements of the code on its way into the project.
Q: What's involved in contributing changes?
Just because a person or a company contributes something, it doesn't mean that it's going to be accepted. This is where one has to have a good understanding of the project and people who invest their "extra" time in it. The LNS developers have very high standards for the quality of the accepted code. This is one reason why contributions often get improved even on their way in. For instance, any significant changes need to come with appropriate unit tests. These unit tests need to show that the newly contributed functionality indeed works as described and that it doesn't break any other existing functionality. The code needs to be sufficiently documented. Before the code is accepted, it gets looked by several pairs of eyes and when the contribution is missing something or could be improved, there are typically several back-and-forths until things are ironed out. While this seems like (and is) extra work in the short-run, it pays off in the long run. Another thing to keep in mind is the complexity of one's contribution and ease of its application and testing. If the contribution is complex, do everything you can to make it easy for somebody to read, review, and understand your contribution. Make it easy for a project developer to apply your change on their own workstation and see your changes in action. The easier one makes this process, the higher the chances of the contribution getting integrated into the project. If you are making changes to the open-source project, make them on the development version, not on some old version of the code. This, too, will make it easier for project developers to test and accept your work. Finally, working in a vacuum when making changes to an open-source project is risky. If one has a change to make and contribute, then the best thing to do is discuss it in the open, typically on a mailing list. You do not want to be working in a vacuum, spending precious time on a change that will later be rejected for whatever reason (e.g. bad approach, bad implementation, somebody already implemented the same functionality last week ...).
Q: What sources do you regularly consult for practical information?
A: Believe it or not, I still have a copy of Lucene in Action handy and I do consult it periodically. Each of the LNS projects has a Wiki with a decent amount of documentation, and Solr and Nutch have tutorials on their Wikis.
- Lucene homepage: http://lucene.apache.org/
- Solr homepage: http://lucene.apache.org/solr/
- Solr tutorial: http://lucene.apache.org/solr/tutorial.html
- Solr wiki: http://wiki.apache.org/solr/
- Nutch homepage: http://lucene.apache.org/nutch/
- Nutch wiki: http://wiki.apache.org/nutch/
Our take....
We're seeing a good deal of interest in Lucene and Solr, and to a lesser extent Nutch, from many of the companies we work with. Our thanks to Otis for sharing his experience, and additionally thanks in advance for additional interviews we expect to have with him over the coming months.