How do you create your own thesaurus file in K2? - Ask Dr. Search
Last Updated Mar 2009
By: Mark Bennett
Ask Dr. Search
This month's question: How do you create your own thesaurus file in K2?
The Verity K2 engine supports a user-defined thesaurus capability that allows users to define synonyms for specific terms. Once enabled, a search that uses the <THESAURUS> operator will expand to include documents that include the defined synonyms defined in the custom thesaurus.
Creating a Thesaurus File
Creating and using a thesaurus for Verity K2 is a multi-step process:
- identify the 'equivalent terms' and create the thesaurus 'source' file
- compile the source file
- move the compiled source file into the Verity directory structure
Let's look at how to implement these steps.
1. Identify the equivalent terms and create the file
Verity's thesaurus provides a way to define terms that are essentially equivalent. You create the association between synonymous terms, and enter them into a 'source thesaurus file'.
The format for each detail line of the source file is:
list: "cars, automobile, cars, corvette, dodge charger, mustang"
In this case, when the user does a search for "cars
", documents with "automobile
", "corvette
" , the phrase 'dodge charger
', and "mustang
"will come back as well. In the same way, when a user does a search for "mustang
", documents with "cars
" and "automobile
" will be returned.
Note that if a term appears in multiple circular lists, the terms in both lists become synonymous:
list: "cars, automobile, cars, corvette, dodge charger, mustang
list: "mustang, ford, convertible, sports car"
A thesaurus search for "mustang
" becomes an OR search for each list:
(cars, automobile, cars, corvette, "dodge charger", mustang) <or>
(mustang, ford, convertible, "sports car")
All of the terms are, by default, circular. It's possible to specify a 'one way' relationship for terms on a given line by using the 'key' operator:
list: "manufacturers, bmw, ford, toyota, honda"
/keys = "manufacturers"
Once this thesaurus file is active, searches for "manufacturers
" would find "bmw
" and "ford
" as evidence; but a search for "toyota
" would not find any based on "honda
".
Generally, we recommend using the thesaurus as a synonym file with each item being equivalent. Also, we suggest that lists be self contained with no terms spanning multiple lists unless necessary.
2. Compile the source file
Once you have created a thesaurus source file, you are ready to compile it.
Open a command window (cmd.exe on Windows servers), and change to the working directory where you saved the source thesaurus file.
Verify that the Verity binary directory is in your path, and enter:
mksyd -f src_file_name.ctl -syd vdk30.syd
3. Move the compiled source file into the Verity directory structure
Once the SYD file is compiled with no errors or warnings, you can copy the binary file (vdk30.syd) into the active Verity directory. Normally, this is the verity\k2\common\english directory.
Before you copy the file, be sure to make a backup copy of the existing vdk30.syd file in the Verity directory structure. Because Verity keys on the file type/extension SYD, it is best to either move the existing SYD file into a different directory; or to rename it to a file with a different extension.
What we generally recommend is to renamed older SYD files based on a version number, pre-pended to the file extension SYD. For example, we would name the backup of our original vdk30.syd as vdk30.v0syd.
Once you have saved the previous version of the file, simply copy the new binary file into the verity\k2\common\english directory. Any <THESAURUS> searches will now use the new file.
Un-compiling Thesaurus Files
Verity supports the ability to extract the original source from an existing binary SYD file. To do so, open a command window and, with the Verity binary in your path, run:
mksyd –dump –syd vdk30.syd –f src_file_name.ctl
You can edit the source, adding or deleting terms as you see fit.
We recommend against using/extending the standard Verity English thesaurus because most sites have specific vocabulary desired in the thesaurus file. For example, one entry in the standard English file is a line that includes:
list: "large,general,broad,overall,extended,extensive,global"
Thus any search that included large (as in "large sizes
") would also weight documents that contain the terms "general
", "overall
" and "broad
".
Remember, thesaurus terms will only impact your results if you use the <THESAURUS> operator in your user query. Since very few of your users will want to do that, be sure your search script appends the <THESAURUS> operator with the users query term. See our article on Intelligent Query Pre-Processing in our August/September Enterprise Search last year.
Write us with your Enterprise Search question at support@ideaeng.com.