« NIE Newsletter
What's the Role of "connectors" in Enterprise Search? - Ask Dr. SearchLast Updated Mar 2009 By: Mark Bennett, Volume 4 Number 2 - April 2007 This month a concerned reader contacted Dr. Search with the following email: I am trying to get a better understanding on how the different enterprise search products fit within an architecture. The most obvious is GSA, in that it is simply a box that contains a central index of all searchable content within the enterprise. That index is built by either crawling or metadata tagging. Is there a set time that the index is refreshed or is that customizable? Now, Autonomy and FAST seem to be a bit different, as they each have a concept of "connectors" that are specific to a particular data type. Each connector "pushes" data to the central index that aggregates the results and makes them available for use queries. Does that "push" process occur dynamically when the user search is initiated? Or does the user search the central index that is populated by each individual connector? If that is the case, can you customize how often the central index is refreshed? Dr Search replies:That's a great question! Actually it's three questions, so I guess no ducking out of the office early today!
Let's tackle question 2 first: Most modern web spiders have similar techniques for dealing with web content; they differ a bit in which pages they revisit and when, but these are details. Some of them attempt to use "incremental" indexing for web content by using a more elaborate revisit schedule. Most spiders will use one or more of these revisit techniques:
The administrator can usually adjust how often a spider revisits pages, though different spiders offer different levels of control. Most of these methods have some serious limitations. Also notice that different vendors use the word "incremental" to refer to different techniques. If incremental spidering is important to your application, make you each vendor that offers it to you clearly explains what they mean. Now for questions 1 and 3, what's up with these "Connectors" anyway? A more fundamental difference, as you suggest, is the "connector" architecture that some vendors offer. Generally speaking, native repository connectors offer better integration than generic "web crawling". Let's assume I have CMS (Content Management System) application, which I'll call "XYZ Super CMS" There are at least 8 ways I could theoretically search that data!
Phew! So many choices! There are limitations and tradeoffs for all of these methods. I'll summarize the pros and cons here, but if anybody needs more details on a particular method, please drop us a line. Method 1 misses a lot of Meta data, including document level security. Also, method 1 must re-poll to check if a document has changed. Method 2, on the other hand, will have access to all the data, and also know precisely when a document has changed. And yes, as you suggest, most connectors know precisely when content has changed, and will immediately send that change to the search engine. Technically speaking, some connectors still use "polling" vs. direct "push", but it's likely to be a very concise form of efficient polling, so is still a vast improvement over the normal polling a generic web spider would use. If available, method 2, using a specific connector, would be the preferred method. Method 3, using a generic database connector, will be more complicated to set up. Do-able, but more complicated. You would still need to get your search vendor's database connector; they might also refer to it as their ODBC connector or database gateway. Method 3 would also be a potential workaround if the search engine vendor doesn't offer a specific connector for the XYZ CMS. See Clinton Allen's article, also in this issue, for an example of this technique using K2 and ODBC. All of the other methods mentioned above have more serious limitations or added complexity, and might prove problematic in a production environment. Methods 4 and 5 are "batch oriented" so would likely be even less responsive, and also somewhat complicated to setup. Methods 6 and 7 are generally complicated and will usually require some formal programming; therefore they should be considered only when all other techniques are unfeasible. Method 8 requires a coincidence of you and your CMS vendor both using the same search engine, and usually the same version. Even if all that worked out, there may be enforced licensing issues or other technical issues that make it a long shot. Therefore method 2, using a connector, would be preferred because:
Two issues you might face however would be:
The details of installing connectors do vary widely by search engine vendor and by the specific connector, but it would generally be expected to perform better than using the web spider / crawler. Given the potential complexity, you may want to get some help setting it up. We hope this has been useful to you; feel free to contact Dr. Search directly if you have any follow-up or additional questions. Remember to send your enterprise search questions to Dr. Search. Every entry (with name and address) gets a free cup and a pen, and the thanks of Dr. Search and his readers. | ||