Ask Dr. Search: Creating Collections with the Google Appliance
Last Updated Mar 2009
Volume 5 Number 2 - February/March 2008
A reader asks:
We love our new Google Appliance, but sometimes we only want to search over part of our web site, vs. everything. For example, maybe let somebody search within just the Tech Support pages. In our old search engine I would have created different "collections".
Dr. Search replies:
Sure, the Google Appliance (we call it the GSA) supports this type of thing, and they even call it "collections" too.
When you're logged into your GSA Admin screen, the menus are on the left. These are actually nested menus, and when you click on a top level item, it expands to show all the choices. In this case you'll want to click the main "Crawl and Index" menu. When this opens up, click on the Collections link (it's at the very bottom) Take a look at this first image and you can see how we arranged the different sets of data on our site.
Logically speaking, we have three main sets of data:
- Our web site
- Our blog
- This Newsletter
But you'll notice that's not quite what's listed. Since our Newsletter lives on our web site, a full index of the site will include its contents as well. We'll get to how to handle that in a minute.
Creating a New Collection
When you want to create a new collection, you start the process on this same screen, in the list of collections. Look for where it says "Create New Collection" and fill in a new collection name. Let's pretend I typed in the name "foo" and clicked the Create button. And then... I will be returned back to the list of the collections, with a new foo entry in the list. This may seem a bit odd; you need to then click Edit to get to the settings screen.
Important Note # 1: GSA Admin Behavior
The GSA Admin UI tends to bring you back to the same screen you were just looking at. That's OK! If there's no error, it means whatever you just did worked. And the Create New Collection task is a great example.
As we said above, you need to now click on the Edit link to the right of foo in the Collections list to get to the next screen!
As you can see, I just gave it the starting URL. This can point to the top level of a site, or even a subdirectory. I could have just as easily said http://acme.com/support/.
Important Note # 2: Collections Not Instant
When you define collections, the effect is not immediate. It takes the GSA a while to re-index those pages and tag them with the new collection.
You might expect, since the collections are based on URLs, and Google lets you search by URL, that this change would take effect immediately.
But instead, Google handles this the next time it indexes the pages, by assign a specific collection tag to each spidered page. It's more efficient for Google to search against this collection tag than to compare long URLs to prefixes over and over again every time a search is run.
Although URL searching is OK for ad-hoc searches, using collection tags is better for speed and testing. The only tradeoff is that you do have to wait for the pages to be reindexed.
Testing The Results
After you've given Google a chance to re-spider the content, you can run test searches for a common word against the various collections to see the different matching counts. I like to have a separate browser window open to do the testing, so I can have the Google admin up at the same time. Having 3 screens certainly comes in handy.
An even more interesting way to check the results is to look at the Crawl Diagnostics screen, under the Status and Reports menu.
Important Note # 3: Check the Drop-Down list
In several places in the GSA Admin, there is a drop down list in the upper right corner that controls which item you're looking at. For example, in the Crawl Diagnostics screen above, notice the drop down list for "Show Diagnostics for Collection" that I have set to default_collection. It's an easy thing to miss if you're new to the UI. If your numbers don't look right, this is something to double check.
Excluding Pages from a Collection
I mentioned before that this newsletter is part of our web site. But if I wanted to split that into two collections, there's a bit of a trick to it.
It's easy enough to create a collection for the Newsletter itself, the starting URL is just http://ideaeng.com/pub/entsrch/ But the main collection, starting at http://ideaeng.com/ will also contain those pages. To fix this, in the screen below, I've defined a collection that is "all of the NIE web site, EXCEPT the Newsletter"
Handling Duplicates Pages (www vs. non-www URLs)
And finally, like any good doctor, I can't help offering additional advice that you never asked for.
A common problem that may come up on your site, and that Google has a really easy fix for, is how to handle duplicate results that are caused by URL variations.
For example, our home page can be reach from both of these URLs: http://ideaeng.com and http://www.ideaeng.com. Some sites handle this with 301 redirects, but in our case, our web server will simply return the page for either URL you type in.
To fix this, we return to the Crawl and Index menu and select the Duplicate Hosts screen, about 2/3rds of the way down. Here you can see I've told it that the preferred URL is the one WITH the www prefix. It will now translate any URL it sees starting with just ideaeng.com as if it were www.ideaeng.com Also notice that you do NOT use the http:// prefix or any trailing slashes on this screen.
I'm glad you're enjoying your new Google Appliance, it's a great fit for some companies, and keep those emails coming!