Login Security
Forms-based security is another way to hide content from your search
engine, even if you don't mean to. Consider the page shown in Figure 8.
Figure 8: Login Page
The content owner must certainly want only valid users to see the content; but
in many cases, don't you think they'd like their search engine to at least
show teasers for secured content to a searcher who may not yet have logged in?
Of course, the solution to this problem is to configure your spider to process
challenge-response forms security; but that's another thing you need to consider
as you set up your spider.
Which leads me to one final issue related to spiders and web servers: the error page.
As you probably know, when a user asks for a URL that does not exist, the web server
will generally return a status of "401" - Page not found - as opposed to a normal
status "200", which indicates no errors.
Companies like to make their web sites friendly, so many have started creating a "friendly" error page for the times that a URL doen't exist. Figure 9 shows one such page.
Figure 9: Friendly Error Page
As with so many other things, there is a right way and a wrong way to handle
this situation. One way, of course, is to configure your web server to redirect to
a friendly page whenever there is an error. But often this "error" page returns
a status of "200", because, in fact, the error reporting page displayed properly.
It's better to go into the guts of your web server and customize the default error page, so that the web server returns a "401" status on the friendly message page.
Otherwise, your spider will happily index as many non-existent pages as
you may have - but all your bad links will show up as perfectly fine pages,
all with the same friendly error message reporting to the user that the
page does not exist.
Return to Top
Call to Action
What can you to see if any of these bad things are happening on your web site right now?
First, perform a "data audit".
Verify that all of the content that you have on your web site was successfully
indexed and added into the search index. Don't assume that because the spider
finished and didn't report any errors that it got all of your content and
metadata - especially if you use any of the methods described here.
If you've indexed your public web site, use Google to confirm the number
of pages they have found, using the "site" modified search. For example,
to check our site, I can go to Google and search for:
site:www.ideaeng.com
Remember that Google considers www.ideaeng.com/index.html and
ideaeng.com/index.html as different pages, so to be safe check for both.
If you engine has far fewer pages than Google, you have a problem.
Next, try a few searches. Use your search engine and look for 'error 401',
which might indicate your 'Page not found' display page is giving your spider
a normal "200" status code.
Do you have pull-down menus on some of your pages? Find one, and click on the
pages in the list. Search your site for that page, and see if there are non
pull-down-menu links to it.
In Summary
As you can see, a good links or landing page approach is a good solution to
many of these spider woes. Go through your code and identify content
"behind" these coding methods, and make sure they are all covered by
some other technique. And run those tests - you can't improve what you
don't measure.
Return to Top