Where Have All the Filters Gone?
IntroductionThrough various mergers and acquisitions, the three main vendors for commercial document filters are now owned by companies who are already selling their own search products. We worry that new search engines may face problems when they try to license these filter packages. They will be bargaining with direct competitors. This could also negatively impact users of enterprise search technology as it may limit choices and allow prices to rise. This is not a hopeless situation however, as some other solutions are available, though at some additional effort.Background: What are "Filters"? And how are they related to search engines?In order to create a full-text index, which is the heart of every full-text search engine, the engine needs to read the text inside of every single document. It then tabulates and catalogs these words and stores that information into a highly optimized binary index. Later, search terms are quickly checked against this compact index to find matching documents (vs. rescanning each individual document). But remember, to create this index the engine must read all the documents, and those documents come in many formats, such as Microsoft Word and PowerPoint, Adobe PDF, HTML/XML, Frame Maker, etc. In order to access the text efficiently, the engine needs a good set of "document filters" to read those various formats and convert the contents into plain old text that the search engine can then index. Restating, Document Filters convert documents from various formats into plain text that the search engine can then read and index. This "conversion" is usually temporary and does not impact the original document. Virtually every modern search engine needs a set of document filters. Three Main SourcesThe 3 top commercial filters being used now are:
The Old GuardFor those of you who've been around the industry for a while, there are a couple other names you might vaguely remember, but through mergers and acquisitions these old guard players are either gone, relabeled or otherwise subsumed into other offerings:
Functionality: Filters vs. Viewers vs. ConvertersIf your only intent was to build a simple search engine then all you'd need is to just get the words out of a document as a generic list. However, documents contain structure and format, and for some applications this actually matters.Here is the general progression of functionality, from simplest to most complex, and why you might care. Many filters can only perform the simplest types of conversion.
Summary of Document Filtering StrategiesThis is our take at the general strategies companies can follow to provide document filtering. But we'd also love to hear your ideas! Remember, if you are using a mainstream search vendor, they have already taken care of this for you. But if you're writing your own engine, or working to deploy an open source engine, you will need to choose a strategy. Some specialized applications also need to come up with their own filters, since they need to access documents that may not be inside a search index; examples of these specialized applications might be security and compliance, tagging/automated-classifiers, eDiscovery, and archiving software. General filtering strategies include:
Issues With Each of These StrategiesLike so many other times in engineering, every choice here poses potential compromises. We're trying to be as factual as possible here. If you feel that we're mistaken or have additional information, please do email us. This is not meant as an "attack" on any set of filters.Potential Stellent/Oracle IssuesStellent has excellent market share, and we believe they are the market leader for commercial standalone filters at this time. They are used by many of the search engine vendors.From a business standpoint we're already pointed out that, in theory, Oracle might not be motivated to give good deals to competitors. However, we haven't heard of any actual problems at this time, and industry folks we've talked to seem to think that Oracle is such a big company, and this is such a small product relatively speaking, that they will likely continue to play nice. It's even been suggested that they could put a positive spin on it, listing all of the search engine competitors that are now licensing Oracle technology. The only remaining business concern is cost; neither Stellent nor KeyView is cheap. From a technical standpoint, we have heard lingering complaints about the stability of Stellent's filters. Though details are sketchy, the general theme is that Stellent either crashes or skips some problematic documents. Some coders have insulated themselves form this by launching Stellent in a separate process or thread, and then also double checking the output. To be fair, we don't get the impression that this happens very often, it may be that it only impacts engines that are handling millions of documents. It may also that Stellent's filters have been getting more stable in recent versions. We confess to not having a lot of hard numbers on this. Also, we can personally attest to seeing some really mangled documents on the web; I could write an entire article just on that. From a user's standpoint, if the original viewer can open it, it's not corrupt. While that's certainly an understandable position, the reality is that some documents vary so widely that anyone not privy to the internal codebase cannot design for every contingency. All filters will occasionally trip over a bad document; Stellent may be mentioned more often simply because they have more users. Potential KeyView/Autonomy IssuesWe suspect Autonomy may be less generous in licensing their filters to competitors, based in part on our own first hand experience.Long before the Autonomy/Verity merger, when we were shopping for technology for SearchButton.com, we did talk to Verity about potentially licensing their KeyView filters. One of their first responses to us was something like "Well, we'd need to look at your business plan first." Yes, SearchButton was a potential competitor to Verity, though we were a small startup at that time. We also had bit of sticker shock from their initial pricing and didn't pursue it any further. For core search we just used the filters that were bundled with Search97. Granted this was many years ago, and the technology has a new owner now. To be clear, we have no direct reports of Autonomy mistreating any of their competitors over filters. From a technical standpoint, KeyView seems like a pretty solid technology. KeyView doesn't thoroughly maintain every aspect of a document's structure, so some advanced search relevancy tuning based on structure would not be possible. Microsoft's IFilter FrameworkYears ago Microsoft needed filters for one of its own early search engine efforts, for the original Search Server product and later for the Content Indexing Service. The early focus was on Microsoft Office related formats, but other formats have been added over the years by both Microsoft and 3rd party vendors. The formal IFilter Framework became well established when Microsoft used it in its own MSN Desktop Search offering.The basic filters are "free", since the DLLs ship with many Microsoft products. But free doesn't always mean "open source". We haven't lumped them in with open source filters because there is a real company maintaining them, and they are only shipped in binary form. In contrast, open source software includes the source code, which in fact is the "source" in open source. The good news is that the filters do work and are widely available, and the price is right! However, potential issues include:
Limitations of the Other Strategies Mentioned AboveRecapping, these are the other potential solutions we had mentioned above. We're relisting them here with a summary of the type of problems you might encounter.
IBM to the Rescue?We had one whacky idea that might help this situation. Maybe IBM will generously step in and fix this by supporting one of the open source filter solutions. Granted, IBM is also in the search engine space, so technically a competitor as well, but they have done other great things for the Open Source community, and there might be some benefits to doing this as well. As examples, IBM has worked to promote Linux, they were a major source of coding resources on the Eclipse platform. Having a great set of filters would help any future Lucene efforts they might have in mind. Some benefits to IBM of doing an excellent packaging of an open source filter set:
Other Commercial and Open Source FiltersThere are other options out there, both commercial and open source. But there doesn't seem to be any convenient direct replacement for Stellent or KeyView. Disclaimer: The rest of this article is taken from my raw notes, so it is a bit terse, may be out of date, and possibly wrong.Commercial
Open Source
Final ThoughtsEven if you're just a user of a commercial search engine, I hope we've at least given you some appreciation for what goes on behind the curtains, with regard to document filtering.If you find yourself actually needing a set of filters we hope we've given you some fresh ideas. If you're not inclined to buy one of the two big commercial packages (Stellent or KeyView), then the other options may all seem a bit complex. If the right path is not clear, I'd remind you of an old engineering proverb: "Technology selection has three main axis: Fast, Good and Cheap - Pick Two" We would modify them to 1: Time/complexity, 2: Quality/Functionality, and 3: Money/TCO. Home | Products | Services | Newsletter | Resources | About Us | Contact Info | Privacy Policy Copyright New Idea Engineering, Inc 1996 - 2008 |