Stemming isn't working in K2 V6.5! - Ask Dr. Search
Last Updated Aug 2009
This month's Dr Search issue addresses a bug that the doctor discovered while working with a customer who had recently updated from Verity K2 V5.5 to Autonomy K2 V6.5. Note this is not the Autonomy IDOL-based K2. It is the most recent - and perhaps last - version of K2 based on the Verity kernel. This was also posted last week to the Autonomy sub-group of SearchDev.org.
Question:
After I upgraded to K2 Version 6.5, stemming doesn't work! The testqp utility shows that <many><stem> is in use, but even when I use rcvdk and enter my search using <many><stem>, my search finds exact matches only. What have I done wrong?
Dr. Search answers:
Well, this one stumped the old doctor. I assumed it was user error until I actually built a test collection on my own and discovered that - lo! - stemming did not work. Just to convince myself I was sane - technically - I went back and created the same collection in version 5.5 and version 6.1 and stemming worked as expected.
I've got to tip my hat to the Autonomy support person who took the initial bug report. After a day or so, with a copy of our replication collection, he came back with the suggestion: use the english or englishx locale, not the uni locale that all of the K2 tools assume is default.
Sure enough, we created a collection using english, and all of a sudden, stemming worked again! Some words may have been muttered under the doctor's breath.
Warning: It looks like you may experience some stemming problem with short words - perhaps words of four or less characters. We've also seen some odd word parsing. Check on the SearchDev site for updates, or email Dr Search.
Changing Locales
Remember, when you change locales, you need to specify the new locale in any place K2 might need to know: every command line utility, and the dashboard screen as you initially attach the collection. Oh, and don't forget to set the locale properly in your C# (or Java) code. Fortunately, K2 gives you a warning, not just an empty result set that could lead to cardiac arrest!
When you initially build the english collection, remember to use the locale in the command line:
mkvdk -collection testcoll -style styles/fsusec -locale english -create
This brings the doctor to one more little frustration. In the command line tools, you can (almost) always type the command line followed by '/?' to see the usage. In rcvdk, it reports the proper syntax to specify locales would be:
rcvdk /locale=english testcoll
Not so, grasshopper. You have to use a hyphen, not a forward slash, in rcvdk:
rcvdk -locale english testcoll
It looks like other tools including testqp can use the indicated syntax of the hyphenated syntax you see here.
We've verified this bug occurs in Version 6.5 on Windows; but we'd suggest if you use that version on Solaris or Linux, test your own results: you may be losing data that you didn't know you are missing. Which can be important in marbles, horseshows, and nuclear power plants.
Don't forget to send your technical questions to Ask Dr Search!