Where Have All the Filters Gone?

« NIE Newsletter

Introduction

Through various mergers and acquisitions, the three main vendors for commercial document filters are now owned by companies who are already selling their own search products. We worry that new search engines may face problems when they try to license these filter packages. They will be bargaining with direct competitors. This could also negatively impact users of enterprise search technology as it may limit choices and allow prices to rise.

This is not a hopeless situation however, as some other solutions are available, though at some additional effort.

Background: What are "Filters"? And how are they related to search engines?

In order to create a full-text index, which is the heart of every full-text search engine, the engine needs to read the text inside of every single document. It then tabulates and catalogs these words and stores that information into a highly optimized binary index. Later, search terms are quickly checked against this compact index to find matching documents (vs. rescanning each individual document).

But remember, to create this index the engine must read all the documents, and those documents come in many formats, such as Microsoft Word and PowerPoint, Adobe PDF, HTML/XML, Frame Maker, etc. In order to access the text efficiently, the engine needs a good set of "document filters" to read those various formats and convert the contents into plain old text that the search engine can then index.

Restating, Document Filters convert documents from various formats into plain text that the search engine can then read and index. This "conversion" is usually temporary and does not impact the original document.

Virtually every modern search engine needs a set of document filters.

Three Main Sources

The 3 top commercial filters being used now are:

1: Stellent: Now part of Oracle, who now sells "search"
2: KeyView: Now part of Autonomy, a Tier-1 search vendor
3: Microsoft IFilter Framework (part of some Microsoft products): Microsoft is also pushing their own search technology.

Number 3 is a bit of an oddity, as it's not really a stand-alone "product" per se; Microsoft includes this as part of Windows and many search engines are starting to leverage it, including Microsoft's own search engines. We'll get to the other commercial and open source players in a bit.

The Old Guard

For those of you who've been around the industry for a while, there are a couple other names you might vaguely remember, but through mergers and acquisitions these old guard players are either gone, relabeled or otherwise subsumed into other offerings:

INSO / OutsideIn: Now part of Oracle via Stellent acquisition.
Mastersoft: Now also part of Oracle, by a rather circuitous route: Mastersoft was bought by Frame, who was then bought by Adobe. The filters were then sold to Inso, and so they are now also part of Oracle via the Stellent acquisition.

Functionality: Filters vs. Viewers vs. Converters

If your only intent was to build a simple search engine then all you'd need is to just get the words out of a document as a generic list. However, documents contain structure and format, and for some applications this actually matters.

Here is the general progression of functionality, from simplest to most complex, and why you might care. Many filters can only perform the simplest types of conversion.

Just "get the text": All the words are brought through as a simple unstructured list of words.; This is sufficient for simple word indexing.
Get the text, with paragraph boundaries and meta data: Even simple filters will often give an extra line break (or some other token) wherever a paragraph break was detected. A search engine can then give more weight to search terms that appear together in the same paragraph.; Meta data such as Title, Author and date are useful to have when displaying, sorting and filtering results. Many documents have their Meta Data set in the File / Properties... dialog box of the application that created them.
Get the text and document structure: Get all of the words and retain their location within the basic document structure such as headings, paragraph and page boundaries, tables, meta data, etc.; This level of detail allows for zone-based searches and more precise searching within long documents. For example, extra weight can be given if a search term appears in a main chapter or section heading. In large documents, the engine can be told to require all search terms to be within the same chapter or even on the same page. In Power Point presentations, slide boundaries represent important logical boundaries.; Sadly many filters fall short of this very useful level of detail, which is particularly problematic when trying to find relevant text in long documents. We'd like to see all vendors support this level of detail.
Retain rich document structure with the ability to convert into other formats: This functionality is very useful if a site has many different document formats, and users are unlikely to have all those applications or viewers installed on their machines. You may have seen this on Google where they say "view as HTML". This technique can also be used to highlight search terms within a matching document, see the sidebar below.; This conversion to another format is usually temporary, but in a few cases it may be permanent. An example of a temporary conversion would be an MS Word document temporarily converted to HTML to show to a user. Conversely, a few applications might want to convert all documents to a common format such as XML; having all documents in that common format could facilitate further automated processing or custom tagging.
Exact Document Preservation and WYSIWYG Viewing: There were a handful of vendors who supported document formats so well that they could display an almost perfect version of the document to the user for any of the supported document formats, without needing original application installed. This was popular back in the 1990s during the "client/server" software era. However, despite their best efforts, these companies had trouble rending a pixel-for-pixel image of the document that would be identical to how it would look in the native application. For example, a very complex Microsoft Word document might not look exactly the same in the universal viewer as it would if it were loaded into Microsoft Word, no matter how hard they tried. This technique eventually became less important as applications requiring high fidelity formatting switched to PDF, and other applications switched to a web based model and settled for an HTML approximation of the original document. The lesson learned was that generally, if you really need high fidelity, either use PDF or have the user install document-specific applications or viewers.

Document Highlighting: An important use for temporary HTML conversion When somebody mentions "search term highlighting" they usually mean 1 of 2 things:

Highlighting the search terms in the results list, in the document summaries.
- or -
Highlighting the terms inside the complete document when the user opens it up.

Many engines do the former, they do show the terms that the user searched for highlighted in the results list.

But some engines also highlight the search terms when viewing the document. Google does this when you click the "view as HTML" link or when you search news groups. Some search vendors also offer this as an option.

These highlights may include hyperlinks to jump to the next instance of a highlighted term, which can be useful in longer documents. Some engines will even open the document to the exact spot where the first instance of a search term appears. This can be a very handy feature, especially when dealing with the longer documents that enterprises often have.

With the exception of PDF, most search engines convert documents into HTML if they are going to show highlights inside the actual document. Once the document is converted into HTML, additional HTML tags can easily be added to show the highlights and extra links.

PDF files are a special case, as the PDF viewer supports search term highlighting directly in the viewer, while viewing the original PDF document, so there is no need to convert to HTML just to highlight search terms. Of course a user still needs have the PDF viewer installed; to avoid that requirement PDF documents can also be converted to HTML. So PDF documents can be highlighted either in the native Adobe PDF viewer, or by converting to HTML and marking up with highlights in same way other document formats are highlighted. The choice is up to the site designer.

Most applications can get by with very basic filtering. Search applications that are dealing with longer documents or need to tweak document relevancy algorithms should consider filters that preserve document structure and formatting. Sites that need to display term highlighting within the documents should use filters that can also convert to HTML, or consider using PDF.

Summary of Document Filtering Strategies

This is our take at the general strategies companies can follow to provide document filtering. But we'd also love to hear your ideas!

Remember, if you are using a mainstream search vendor, they have already taken care of this for you. But if you're writing your own engine, or working to deploy an open source engine, you will need to choose a strategy. Some specialized applications also need to come up with their own filters, since they need to access documents that may not be inside a search index; examples of these specialized applications might be security and compliance, tagging/automated-classifiers, eDiscovery, and archiving software.

General filtering strategies include:

Go ahead and license Stellent or KeyView filters; they are both certainly still on the market and you may get a good deal.
Make use of Microsoft's IFilter platform (more on that later)
Use lesser known commercial filters (see last section of article)
Use Open Source filters (see last section of article)
Write Your own (usually a bad idea)
Use the "strings" command (certainly a hack, read on)
Use a Per-Document strategy
Piggy-back off of another application that has filters
If possible, use only simple or open formats for all of your documents

Issues With Each of These Strategies

Like so many other times in engineering, every choice here poses potential compromises. We're trying to be as factual as possible here. If you feel that we're mistaken or have additional information, please do email us. This is not meant as an "attack" on any set of filters.

Potential Stellent/Oracle Issues

Stellent has excellent market share, and we believe they are the market leader for commercial standalone filters at this time. They are used by many of the search engine vendors.

From a business standpoint we're already pointed out that, in theory, Oracle might not be motivated to give good deals to competitors. However, we haven't heard of any actual problems at this time, and industry folks we've talked to seem to think that Oracle is such a big company, and this is such a small product relatively speaking, that they will likely continue to play nice. It's even been suggested that they could put a positive spin on it, listing all of the search engine competitors that are now licensing Oracle technology. The only remaining business concern is cost; neither Stellent nor KeyView is cheap.

From a technical standpoint, we have heard lingering complaints about the stability of Stellent's filters. Though details are sketchy, the general theme is that Stellent either crashes or skips some problematic documents. Some coders have insulated themselves form this by launching Stellent in a separate process or thread, and then also double checking the output.

To be fair, we don't get the impression that this happens very often, it may be that it only impacts engines that are handling millions of documents. It may also that Stellent's filters have been getting more stable in recent versions. We confess to not having a lot of hard numbers on this.

Also, we can personally attest to seeing some really mangled documents on the web; I could write an entire article just on that. From a user's standpoint, if the original viewer can open it, it's not corrupt. While that's certainly an understandable position, the reality is that some documents vary so widely that anyone not privy to the internal codebase cannot design for every contingency. All filters will occasionally trip over a bad document; Stellent may be mentioned more often simply because they have more users.

Potential KeyView/Autonomy Issues

We suspect Autonomy may be less generous in licensing their filters to competitors, based in part on our own first hand experience.

Long before the Autonomy/Verity merger, when we were shopping for technology for SearchButton.com, we did talk to Verity about potentially licensing their KeyView filters. One of their first responses to us was something like "Well, we'd need to look at your business plan first." Yes, SearchButton was a potential competitor to Verity, though we were a small startup at that time. We also had bit of sticker shock from their initial pricing and didn't pursue it any further. For core search we just used the filters that were bundled with Search97.

Granted this was many years ago, and the technology has a new owner now. To be clear, we have no direct reports of Autonomy mistreating any of their competitors over filters.

From a technical standpoint, KeyView seems like a pretty solid technology. KeyView doesn't thoroughly maintain every aspect of a document's structure, so some advanced search relevancy tuning based on structure would not be possible.

Microsoft's IFilter Framework

Years ago Microsoft needed filters for one of its own early search engine efforts, for the original Search Server product and later for the Content Indexing Service. The early focus was on Microsoft Office related formats, but other formats have been added over the years by both Microsoft and 3rd party vendors. The formal IFilter Framework became well established when Microsoft used it in its own MSN Desktop Search offering.

The basic filters are "free", since the DLLs ship with many Microsoft products. But free doesn't always mean "open source". We haven't lumped them in with open source filters because there is a real company maintaining them, and they are only shipped in binary form. In contrast, open source software includes the source code, which in fact is the "source" in open source.

The good news is that the filters do work and are widely available, and the price is right!

However, potential issues include:

Complexity of integration. Using the filters with your software will require some investigation and some coding.
Somewhat limited document format support. You'll need to check your requirements against the list for formats; in some cases 3rd party vendors may have the filter you need that will plug-in to the IFilter framework.
Tightly linked to the Windows operating system, which may or may not be an issue, depending on your OS requirements. We're not sure if there is any chance of Linux support for IFilters, although some .net components do run on non-Windows operating systems.
Licensing and distribution rights. We have not investigated this thoroughly, but the safest course of action might be to have customers separately install a Microsoft product that includes the filters, and then install your product. But we're not lawyers, so you'll need to do your own homework.

Delving into this subject further is beyond the scope of this article, but here are some links for the curious:

Limitations of the Other Strategies Mentioned Above

Recapping, these are the other potential solutions we had mentioned above. We're relisting them here with a summary of the type of problems you might encounter.

Use lesser known commercial filters (may require multiple vendors)
Use Open Source filters (complexity)
Write Your own (more complexity)
Use the "strings" command (doesn't preserve any format or structure, and may not work on some formats, can also produce many false tokens)
Use a Per-Document strategy (complexity)
Piggy-back off of another application that has filters (uusually unsupported by provider, and complex)
Embedded filters (same as above)
If possible, use only simple or open formats for all of your documents (may not be feasible if not in control of content cration, or handling legacy data)

IBM to the Rescue?

We had one whacky idea that might help this situation. Maybe IBM will generously step in and fix this by supporting one of the open source filter solutions.

Granted, IBM is also in the search engine space, so technically a competitor as well, but they have done other great things for the Open Source community, and there might be some benefits to doing this as well. As examples, IBM has worked to promote Linux, they were a major source of coding resources on the Eclipse platform. Having a great set of filters would help any future Lucene efforts they might have in mind.

Some benefits to IBM of doing an excellent packaging of an open source filter set:

Generally good publicity for IBM, espcially in the search and content mining communities, and open source folks in general.
Could be customized to provide enhanced capabilities within their own search product offernings
Might be a compliment to their other open source activities
A good start on entity extraction and other content mining activities
A bit of a shot over the bow at Autonomy, Oracle and Microsoft, the keepers of the current filter packages
Would be multi-platform (Microsoft's is free, but only for Windows)

Other Commercial and Open Source Filters

There are other options out there, both commercial and open source. But there doesn't seem to be any convenient direct replacement for Stellent or KeyView.

Disclaimer: The rest of this article is taken from my raw notes, so it is a bit terse, may be out of date, and possibly wrong.

Commercial

Commercial / General
- Davisor Offisor 4.1
  Java Based, converts many formats into XML
  http://www.davisor.com/offisor/index.html
  Filtrix??
- Blueberry Software Filtrix
  http://www.blueberry.com
  To/from many publishing oriented formats; missing some Office formats such as Excel. Does handle FrameMaker MIF files. Can output to HTML. Windows and Solaris. Oddly, no support in/out for XML or PDF.
  http://www.blueberry.com/formats.htm
- WordPort
  From Ascii.com (http://www.acii.com/wpt.htm), also claims to read various Multimate formats.
- LogicTran
  http://www.logictran.com/index.html#r2net
- YAWC Pro
  http://www.yawcpro.com/
- Defunct? WvWare
  Filter for Microsoft Word documents (old)
  http://www.wvware.com
Commercial Uni-Format
- Antenna House to/from PDF/XML (Japan)
  Seems to only handle NEW XML-based Word format, WordML
  XML and PDF munging
  http://www.antennahouse.com/aboutus.htm
- DocSoft Word to XML
  W2XML Word to XML, even DocBook.org format
  http://www.docsoft.com/w2xmlv2.htm
  W2XML may have been called Wordplay at some point in the past.
- Infinity Loop upCast
  Microsoft Word only, to/from XML; upCast and downcast respectively.
  http://www.infinity-loop.de/products/upcast/index.html
  Some good technical info
  http://www.infinity-loop.de/products/upcast/support.html
Commercial PDF
- XPump
  NIE's own PDF to XML reader (part of XPump)
- Some email based converter?
  http://preprints.cern.ch/Convert?emailGuide
- $65 shareware
  http://www.sanface.com/jpg2pdf.html
- Java Classes for PDF, free, "big faceless"
  http://big.faceless.org/products/pdf/index.jsp

Open Source

Open Source / General
- AnitWord
  http://www.winfield.demon.nl/
- Abiword command line conversion
  http://linuxhelp.blogspot.com/2005/08/use-abiword-to-convert-filetypes-on.html
  http://en.wikipedia.org/wiki/AbiWord
  http://www.abisource.com/
  http://portableapps.com/apps/office/word_processors/portable_abiword
- Lius Lucene Index Update and Search
  http://sourceforge.net/projects/lius
Open Source Office Formats
- MS Word document format
  http://en.wikipedia.org/wiki/Microsoft_Word#File_formats
  http://visualbasic.about.com/od/learnvba/l/blecvbai0204.htm
- Open Office developer page
  Article to convert from the command line
  http://www.xml.com/pub/a/2006/01/11/from-microsoft-to-openoffice.html
  http://development.openoffice.org/
  http://wiki.services.openoffice.org/mwiki/index.php?title=Filter&action=edit
- Jakarta POI Java API for Microsoft formats
  For OLE 2 Compound Document Format
  HWPF for Word Documents
  HPSF for Document Properties
  http://jakarta.apache.org/poi/
  http://jakarta.apache.org/poi/hwpf/index.html
- Koffice Microsoft Word Filter KWord
  http://www.koffice.org/filters/1.4/kword/msword97.php
- Doc2XML (in Python)
  http://pair.mbl.ca/doc2xml/
- Nutch MS Word (Lucene)
  http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/msword/package-summary.html
  http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/msword/chp/package-summary.html
- EgoThor Microsoft Office formats
  http://egothor.sourceforge.net/documentation/api/msft/
- Open Office
  early on was xmerge to get away from licensing ... some relation to Inso
  Proposed http://xml.openoffice.org/xmerge/docs/XMerge_sdk.pdf
  For small devices http://xml.openoffice.org/xmerge/
  Open office java port for Mac
  http://neowiki.sixthcrusade.com/index.php/NeoOffice/J_File_Formats
- Apache Office Format Project
  This is the leading candidate for OLE2 Compound Documents http://jakarta.apache.org/poi/
- TEXT Mining (Word)
  http://www.textmining.org/modules.php?op=modload&name=Downloads&file=index&req=viewdownload&cid=2
  Part of Apache, uses some POI for some MS formats, better than straight POI
- Suggested to also see code from wais and digg oxml tools
Open Source HTML / XML
- Nutch HTML to DOM
  http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/html/package-summary.html
- Tidy / JTidy
  http://sourceforge.net/projects/jtidy
Open Source PDF
- Nutch PDF
  http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/pdf/package-summary.html
- EgoThor PDF
  http://egothor.sourceforge.net/documentation/api/pdf/
- another PDF
  http://www.pdfbox.org
- XPDF and PDF2HTML
- XPDF
  http://www.foolabs.com/xpdf/
- PDF2HTML
  http://pdftohtml.sourceforge.net/
- Jpedal
  http://www.jpedal.org/

Final Thoughts

Even if you're just a user of a commercial search engine, I hope we've at least given you some appreciation for what goes on behind the curtains, with regard to document filtering.

If you find yourself actually needing a set of filters we hope we've given you some fresh ideas. If you're not inclined to buy one of the two big commercial packages (Stellent or KeyView), then the other options may all seem a bit complex. If the right path is not clear, I'd remind you of an old engineering proverb: "Technology selection has three main axis: Fast, Good and Cheap - Pick Two" We would modify them to 1: Time/complexity, 2: Quality/Functionality, and 3: Money/TCO.