Search Term Highlighting
Last Updated Jan 2009
By Mark Bennett - Volume 2 Number 3 - September / October 2004
Like John Godfrey Saxe's story The Blind Men and Elephant , people envision many different things when they talk about "search term highlighting". And, not to be outdone, enterprise search vendors support the feature in a number of different ways. In coming months we will be talking details about how you can implement the different styles of search term highlighting in your environment, but this month we will start with an overview.
Generally speaking, "highlighting" means marking up a document to visually indicate the words that a site visitor used to perform a search. If you searched for "Java", the matching documents would show all occurrences of the word, typically rendered in a visual style that stands out.
Currently there are four types of highlighting that most vendors support:
1: Highlighting the summaries in the results list
This is what Google and many enterprise search engines do. If your search matches 10 web pages or documents, you'll see 10 titles listed. After each title will be a small excerpt from that document. Any search terms that are in that excerpt will be highlighted.
Some engines even go out of their way to select summaries that contain the highest "hit count' of search terms; we tend to prefer using either static summaries or a dynamic document summary based on the query.
2: Highlighting the actual HTML page when viewed
In this case, the search terms are highlighted when the document is viewed.
An advantage of this type of highlighting, with some search engines, is that when you open the document, it "jumps" to the first instance of the word. This can be a real plus, especially when the first search term is several pages into a long document.
Some vendors even let you jump around inside the page, from link to link, by clicking with your mouse.
3: Highlight terms in an Adobe PDF document within the Adobe Viewer
This is a bit more sophisticated, but we have seen this work well with Verity's K2 product. Essentially the Adobe Acrobat PDF viewer is passed the URL to the source PDF document. But that URL is in a special format - it instructs the viewer to then retrieve additional information from the server concerning what words in the document to highlight. If you look carefully at one of these URLs, you'll see a special flag of "...#xml=..." - this is the "magic" flag that makes the highlighting work.
Inside the PDF viewer, terms are highlighted, and the user can click buttons to jump to the next highlight. Also, Adobe will automatically jump to the first page with a highlight, which can save a lot of time in a long document.
4: Highlight key words in proprietary formats
Microsoft Office documents, such as MS Word, Excel and PowerPoint, and other formats like WordPerfect and others, store text in proprietary formats.
In this case, the search engine converts the proprietary "binary" document into HTML "on the fly", dynamically adding highlights. The search engine then displays the document in an HTML window as in (2) above.
Note that when you convert from one of these formats to HTML, some formatting will be lost; how good the output document looks depends on the document itself, and the specific converter being used. As an example, Verity K2 offers highlighting using the Verity KeyView filters, which generally do an excellent job maintaining document fidelity in most documents and formats. The loss in formatting in complex documents may be unacceptable in some applications; in which case native PDF might be more appropriate.
Another advantage of doing this dynamic conversion to HTML, even for PDF, is that since the document is converted to HTML, no special viewer needs to be installed on the user's machine.
Some search results will give the user an option of which document/viewing to display when a result is selected. This lets an educated user select what works best; but sometimes an educated user is hard to find.
Summary
Search highlighting is a complex subject; in coming issues we will address specific details of implementing various types of highlighting in different environments.