Search this site:
Enterprise Search Blog
« NIE Newsletter

Intelligent Query Pre-Processing

Last Updated Mar 2009

By: Mark Bennett & Miles Kehoe - NIE Tech Staff - Issue 4 - August/September, 2003

Introduction

If you’ve been monitoring the search activity on your web site, you’ve probably noticed that there are two kinds of searcher: those that use one or two words to locate a particular document; and those that use very long queries to try to home in on exactly the document they want. Unfortunately, neither type of query is very useful in really locating answers.

Luckily, most enterprise search technology allows you to pre-process user queries before they get to your search engine, and by intelligently processing user queries, you can greatly enhance the changes that the right document will show up near the top of the result list.

Different Approaches based on Different Queries

Let’s break the problem into its two parts: too few useful terms, and too many extraneous terms.

Not Enough Information

When most users search on a web site, they enter one word. It may be a product (“printers”); a department (“sales”); or support information (“updates”). The obvious answer is to use a product that lets you define the correct document for your most popular queries, but if you really want to try it on your own, you need to evaluate the query in your search pre-processing script to decide how to identify what the user wants.

The Dissertation Query

The opposite extreme is the user who decides to type a question that he/she might ask one of your tech support reps (“I’m looking for the little attachment that sits on the right side of the input tray, and that guides the paper into the printer when I use the envelope printer attachment”). The other variation of this long-winded user is to use every word that might be related to the document of question, in hopes that your search engine will miraculously know what to return (“printer input tray attachment plastic small guide paper feed”). To keep the user from getting no results whatsoever, you need to do some query pre-processing.

The Problem

The real problem you have is that your search engine has a number of different relevancy tools you can use, but some work on the short queries, while others work best on long queries. To solve the problem you need to pre-process the user query before you pass it to the search engine.

Short and Concise

When a user enters a single term, you can apply extensive pre-processing to the query before you send it to your search engine. Some of the operations you can apply to single-term queries include:

  • Thesaurus and Synonyms
  • Soundex
  • Stemming (plurals, etc.)
  • User term in the title
  • User term in the URL

Long Winded

For longer queries, operators like Soundex and thesaurus are much less useful and more processing intensive; and it’s much less likely that all of the terms will be in a title or in a URL. For Longer queries, you might consider using operators such as:

  • Many
  • Phrase
  • Near/Proximity
  • “Like” or “Accrue”

Sample Implementation

As we discussed in the article Adjusting Search Engine Relevancy in the June Issue of Enterprise Search, you can expand the user query in your search ASP or JSP code before you pass it to the search engine itself.

To implement the sort of algorithm described above, you would first need to examine the user query to determine whether you have a sort query or a long query.

Next, you want to carefully use the syntax for your engine to expand the user query to include the expansions you want. For example, in a single term query, you want to look for not only the user term, but also the term within the title, etcetera. It is critical that your expansion generate valid syntax for your search engine, or your user will see an error, and will either go away unhappy or will call your support team to find the answer.

Finally, once you have expanded the query, don’t confuse the user by showing the expanded query; just show the user input and expand it again if your user should perform another search.

Single Term Query Processing Example

What follows are sample ASP scripts that expands single term user queries in Verity and in Lucene.

Using Verity Query Language (VQL) syntax, and assuming the query form field is named userQuery, you might expand the query as follows:

    expandedQuery = userQuery +  “,<THESAURUS>(“ + userQuery + “)”
    expandedQuery = expandedQuery + “,<SOUNDEX>(“ + userQuery + “)”
    expandedQuery = expandedQuery + “ ,’ “ + userQuery + “ ‘“
    expandedQuery = expandedQuery + “,title<CONTAINS>” + userQuery
    expandedQuery = expandedQuery + “ , “ + userQuery + “<IN>url”

If the user entered “cat”, the expanded query here would be:

    cat,<THESAURUS>(cat),<SOUNDEX>(cat),’cat’,title<CONTAINS>cat,cat<IN>url

Most major search technologies feature a rich query syntax, but some feature a subset of features by default. To implement the same type of capability in Lucene, which does not yet feature a THESAURUS or SOUNDEX operator, you may need to limit the query expansion:

    expandedQuery = userQuery
    expandedQuery = expandedQuery + “ OR “ + userQuery + “*“
    expandedQuery = expandedQuery + “ OR title:” + userQuery
    expandedQuery = expandedQuery + “ OR url:“ + userQuery

If the user entered “cat”, the expanded query here would be:

    cat OR cat* OR title:cat OR url:cat

You can see that the results and relevance might not be the same between the two engines, but the query expansion approximates the same single term query handling.

Multiple Term Query Processing Example

What follows are sample ASP scripts that expands multiple term user queries in Verity and in Lucene.

To implement the multiple term expansion requires a bit more coding, and perhaps more care with parentheses and grouping. Using Verity Query Language, your search script might contain the following code segment (remember that MANY is the default in VQL so it does not need to be specifically provided):

    expandedQuery = userQuery
    expandedQuery = userQuery + “,<PHRASE>(“ + userQuery + “)”
    expandedQuery = expandedQuery + “,<NEAR>(“ + userQuery + “)”
    expandedQuery = expandedQuery + “ ,<ACCRUE>(“ + userQuery + “)“

This code when processing the query “cat food” would yield the query string:

    cat food,<PHRASE>(cat food),<NEAR>(cat food),<ACCRUE>(cat food)

A similar query processing done in Lucene might include the following code. Since Lucene does not support the ACCRUE or LIKE operator except as the default, we only specify the phrase and proximity operators. Note that CHR(34) is a quote character, required for Lucene syntax. Note also the proximity operator ~5, which means any of the terms within 5 words of the other terms.

    expandedQuery = CHR(34) + userQuery + CHR(34)
    expandedQuery = userQuery+“ OR “ +CHR(34)+userQuery+CHR(34) + “~5”

This query preprocessing code generates the following query:

    “cat food” OR “cat food~5”

Summary

Query preprocessing can be a useful tool if you make sure to create valid syntax for your engine; process the query intelligently, based on the standards specific to your site; and test the code thoroughly before you roll it into production.