Role of "connectors" in Enterprise Search

What's the Role of "connectors" in Enterprise Search? - Ask Dr. Search

This month a concerned reader contacted Dr. Search with the following email:

I am trying to get a better understanding on how the different enterprise search products fit within an architecture. The most obvious is GSA, in that it is simply a box that contains a central index of all searchable content within the enterprise. That index is built by either crawling or metadata tagging. Is there a set time that the index is refreshed or is that customizable?

Now, Autonomy and FAST seem to be a bit different, as they each have a concept of "connectors" that are specific to a particular data type. Each connector "pushes" data to the central index that aggregates the results and makes them available for use queries. Does that "push" process occur dynamically when the user search is initiated? Or does the user search the central index that is populated by each individual connector? If that is the case, can you customize how often the central index is refreshed?

Dr Search replies:

That's a great question! Actually it's three questions, so I guess no ducking out of the office early today!

Some vendors talk about "connectors" and others do not. Is there some big advantage to using one?
How often do search engines go back to look for new content?
Does the use of Connectors change this?

Let's tackle question 2 first:

Most modern web spiders have similar techniques for dealing with web content; they differ a bit in which pages they revisit and when, but these are details. Some of them attempt to use "incremental" indexing for web content by using a more elaborate revisit schedule.

Most spiders will use one or more of these revisit techniques:

Revisit all the pages and reindex everything. This is the "brute force" way
Revisit all the pages, but carefully check to see if they have changed. Although the same amount of content is downloaded, the system saves processing time by not reindexing pages that have not changed. Some vendors call this "incremental indexing"
Query the web server about each page, to see if it has changed. Older spiders did this with the HTTP "HEAD" command; if the page did change, then they re-request it from the server in a second transaction. Newer spiders use a more sophisticated method that effectively says "give me this page again if it has changed since this date", and the spider provides that last date it fetched the page.
Re-download or re-check each page on its own schedule. The system maintains a database of guesses for when it believes each page is likely to change, and then adjusts future guesses for each page when it finds out how far off it was. This is what some vendors call "incremental" indexing. Notice that if a page had been unchanged for many spider visits, and then is suddenly changed, the spider may take a long time to notice.
Re-download or re-check certain parts of the web site more often than others. For example, check the home page every hour, but only redo the archives once a week.
Have a special way for the web server to tell the spider what has changed. Ultraseek does this with the sitelist.txt file, and modern blogs do this with "blog pings"

The administrator can usually adjust how often a spider revisits pages, though different spiders offer different levels of control. Most of these methods have some serious limitations. Also notice that different vendors use the word "incremental" to refer to different techniques. If incremental spidering is important to your application, make you each vendor that offers it to you clearly explains what they mean.

Now for questions 1 and 3, what's up with these "Connectors" anyway?

A more fundamental difference, as you suggest, is the "connector" architecture that some vendors offer. Generally speaking, native repository connectors offer better integration than generic "web crawling".

Let's assume I have CMS (Content Management System) application, which I'll call "XYZ Super CMS" There are at least 8 ways I could theoretically search that data!

Continue to just use the web spider, pretending to be a web browser user. The spider will see and index the same HTML content that a user would see.
Use the proper XYZ connector from the search vendor. This keeps the search engine index up to date all the time.
Go after the tables with the generic "database" connector; since most CMS systems use a database as the back end
Export a "dump" of XYZ into XML or some text format, then have the search engine index those exported files.
Import the data from the CMS into a search enabled database such as Oracle.
Write a custom connector with the search vendor's API to "inject" data.
Don't index. Just use "Federated Search". Use the CMS's built in search capability, and then just combine those search results with all the other search results.
Try to cheat and use the CMS's search engine index directly. For example, if your search engine used K2, and you also use K2, then try to attach to their K2 collection and search it along with your other mounted collections.

Phew! So many choices! There are limitations and tradeoffs for all of these methods. I'll summarize the pros and cons here, but if anybody needs more details on a particular method, please drop us a line.

Method 1 misses a lot of Meta data, including document level security. Also, method 1 must re-poll to check if a document has changed.

Method 2, on the other hand, will have access to all the data, and also know precisely when a document has changed. And yes, as you suggest, most connectors know precisely when content has changed, and will immediately send that change to the search engine. Technically speaking, some connectors still use "polling" vs. direct "push", but it's likely to be a very concise form of efficient polling, so is still a vast improvement over the normal polling a generic web spider would use.

If available, method 2, using a specific connector, would be the preferred method.

Method 3, using a generic database connector, will be more complicated to set up. Do-able, but more complicated. You would still need to get your search vendor's database connector; they might also refer to it as their ODBC connector or database gateway. Method 3 would also be a potential workaround if the search engine vendor doesn't offer a specific connector for the XYZ CMS. See Clinton Allen's article, also in this issue, for an example of this technique using K2 and ODBC.

All of the other methods mentioned above have more serious limitations or added complexity, and might prove problematic in a production environment. Methods 4 and 5 are "batch oriented" so would likely be even less responsive, and also somewhat complicated to setup. Methods 6 and 7 are generally complicated and will usually require some formal programming; therefore they should be considered only when all other techniques are unfeasible. Method 8 requires a coincidence of you and your CMS vendor both using the same search engine, and usually the same version. Even if all that worked out, there may be enforced licensing issues or other technical issues that make it a long shot.

Therefore method 2, using a connector, would be preferred because:

It has access to complete meta data, including ACL info.
It knows precisely when to grab a document (either via push or very efficient poll methods)

Two issues you might face however would be:

The specific connector you need may not exist. Remember, it needs to be specific to both the XYZ CMS system and your particular search engine platform. Generally you would contact the search vendor, vs. the CMS vendor. If it doesn't exist, consider option 3, using a more generic database connector, as a workaround.
It might be an extra cost option from your search vendor. For some vendors it could be in the range of $25,000 to $100k. On the bright side, some vendors bundle in one connector license with the base engine, so check your original purchase paperwork to see if you've already paid for it.

The details of installing connectors do vary widely by search engine vendor and by the specific connector, but it would generally be expected to perform better than using the web spider / crawler. Given the potential complexity, you may want to get some help setting it up.

We hope this has been useful to you; feel free to contact Dr. Search directly if you have any follow-up or additional questions. Remember to send your enterprise search questions to Dr. Search. Every entry (with name and address) gets a free cup and a pen, and the thanks of Dr. Search and his readers.