Search this site:
Enterprise Search Blog
« NIE Newsletter

5 ways to hide content from your search engine (and why you shouldn't)

Ten years ago, the web was a wild place: new content development systems everywhere, booming expansion of standard HTML and scripting languages - not to mention cool graphic environments like Flash. Many of today's corporate web developers often learned their trade during the dot-com boom.

Why is that a problem?

Back then, search was new on the internet. Google was still offering 'Stanford Search' and 'Linux Search' on its early homepage with a rather clunky looking graphic compared to today's look (Figure 1, thanks to Archive.org's Wayback Machine and Google Blogscoped).

Figure 1: Google circa 1998

No one was really very interested in indexing deep content on their web site - often just having a search engine finding your front page was exciting.

Those of us who were really into search - yes, some of us had already worked for search companies for more than 5 years - saw the problem coming, but it was hard to get anyone interested about it. After all, people could find your home page! After that it was up to the webmaster to lead them to the right content.

Skip ahead to 2008.

The problem is that many web developers and web agencies didn't understand that designing the web site in certain ways makes it impossible for a search engine to FIND the content. Didn't work then, doesn't work now. And I bet you didn't spend all that money on a search technology to have frustrated users.

Common Problems

So what are the common errors that keep your search engine from finding your content? Here they are, in no particular order:

There are many, many more ways to keep your content out of your company search indices. But patience, grasshopper, Rome wasn't searched in a day.

Pull-down Lists

Pull-down lists are a great way to indicate to your users just what areas your site contains. You can provide direct links to Support, Marketing, HR, Development - whatever parts of your site have content that might be helpful. Problem is, your search engine spider not only doesn't see the text in your pull down list, it cannot follow those links to index the great content you have hiding there behind the pull-down menus.

Jakob Nielsen and others have suggested that half of all web site users will use search if it is available - and we think that number may even be higher. If you run an intranet site, do you want to block your content from half of your employees? And if you run a customer-facing site, can you afford to lose half of your potential customers right away?

Figure 2 shows a home page on an IBM site designed to deliver the user to the correct page. It serves its function well - you see nothing until you select the country. But it also keeps Google, Yahoo!, and other major search engines out. Inside of companies, we've seen this sort of thing not at the home page, but in support sites where a visitor is asked to pick a product or category of questions.

Figure 2: Pull-Down Lists

If your site has pull-down options, you can solve your search engine problem without redesigning your site.

Identify all of the pages that your pull-down site links to - do so for all of the types of links that hide your content including the HTML OPTION tag and JavaScript with 'OnClick" links. Create a link page, or landing page, that contains all of those pages, and include it in your web crawler 'start pages'. Note you do not need to link to this page from any of your normal content, so no users will actually see it. But your spider will crawl the landing page you've created, and because of that all of your content gets indexed.

Return to Top


Text in images or video clips

Some Internet sites that want to be perceived as modern tend to use a lot of images on their web site. Pictures of products in GIF or JPG format, or increasingly, in animated scenes that really want you to know that the company and its products are cool. We're not against nice looking web sites - in fact, we're often jealous of them - but images and Flash or Silverlight animations are not always friendly to web crawlers - both on the Internet and within you own search engine.

The good news is that companies like Adobe, which owns Flash, have realized that being friendly for search engines is important, and they've started including textual metadata within the animations. Silverlight, by the way, has been able to do so since its introduction.

And when it comes to static images, HTML has always had the ability to associate text with images, using the "ALT" tag. As an added benefit, having detailed ALT tags helps your site comply with the American with Disabilities Act because vision impaired people with web audio readers can understand what you are showing.

Consider the site in Figure 3. Very cool looking; very marketing and sales oriented; but unless they are using the latest Flash, no spiders are going to find the content hidden behind their home page.

Figure 3: Graphic Home Pages

You can view the page source in Figure 4 below. It's simple, clean, and easy to create and maintain: the work is in creating the graphics. But will a spider ever get past this page? No.

Figure 4: HTML for a Flash Page

As with the pull-down lists above, one solution is to create a landing page that links to all of the primary content on your site, and include that page in your spider/crawler configuration.

Return to Top


Badly configured robots.txt

There is a defacto standard on the web for excluding those polite and participating web spiders and crawlers from parts of your site. But if you have a robots.txt and don't have it configured correctly, you may be missing content on your internal search index.

First, notice that robots.txt is not an internationally accepted standard: there is a proposed draft A Method for Web Robots Control that dates from 1997, but we believe no formal project is active as of 2008.

Next, notice that robots.txt is optional for every spider - it's the ones from places like Google and Yahoo! that play nice; those from more nefarious sites may ignore your requests anyway.

So how can this affect your internal spider?

If you have a robots.txt file in the root of your web server document directory to keep the good guys away from certain areas of your site, or to ask them to be nice and delay a bit between each page request, it's possible that your requests are being honored by your internal spider.

Consider the snippet from a robots.txt file we captured from a large search technology company shown in Figure 5.

 

Figure 5: robots.txt

You can see that the last three lines do request that external spiders not crawl parts of the site, but because the 'User-agent' field specified all spiders, there is some chance this company's internal crawler is going to ignore the content as well.

There are two things you can do to overcome the problem.

First, almost every commercial web crawler lets you override its default behavior which follows the directives in robots.txt. This is pretty easy, is global for the pages your internal spider crawls, and generally has no downside.

Another choice would be to change robots.txt to specify potentially different rules for different spiders. For example, Figure 6 shows an expanded robots.txt that allows the internal spider, 'nie_crawl', to access the areas disallowed to all other (polite, participating) spiders.

 

Figure 6: robots.txt updated for internal spider

Return to Top


Dynamic URLs and JavaScript

Modern web browsers can handle a great deal of local processing, and the popular technologies to provide such local intelligence such JavaScript or ASP. These technologies of themselves are not a problem: how developers use them can be.

Consider the code snippet in Figure 7.

 

Figure 7: Building Dynamic URLs in JavaScript

In this code, the function is creating a URL to which to redirect a user request. That's not a hard thing to do; but you can see that the URL to follow is never actually visible as a static link, so the local spider will probably not be able to find the page.

As with images and pull-down lists, there are a couple of possible solutions. One is to avoid using dynamic URLs built in JavaScript and stored in variables. When you absolutely need to have such dynamically created links, go ahead and add the link you expect to create on the links or landing page, so the spider can find it there.

Return to Top


Login Security

Forms-based security is another way to hide content from your search engine, even if you don't mean to. Consider the page shown in Figure 8.

 

Figure 8: Login Page

The content owner must certainly want only valid users to see the content; but in many cases, don't you think they'd like their search engine to at least show teasers for secured content to a searcher who may not yet have logged in?

Of course, the solution to this problem is to configure your spider to process challenge-response forms security; but that's another thing you need to consider as you set up your spider.

Which leads me to one final issue related to spiders and web servers: the error page. As you probably know, when a user asks for a URL that does not exist, the web server will generally return a status of "401" - Page not found - as opposed to a normal status "200", which indicates no errors.

Companies like to make their web sites friendly, so many have started creating a "friendly" error page for the times that a URL doen't exist. Figure 9 shows one such page.

 

Figure 9: Friendly Error Page

As with so many other things, there is a right way and a wrong way to handle this situation. One way, of course, is to configure your web server to redirect to a friendly page whenever there is an error. But often this "error" page returns a status of "200", because, in fact, the error reporting page displayed properly. It's better to go into the guts of your web server and customize the default error page, so that the web server returns a "401" status on the friendly message page. Otherwise, your spider will happily index as many non-existent pages as you may have - but all your bad links will show up as perfectly fine pages, all with the same friendly error message reporting to the user that the page does not exist.

Return to Top


Call to Action

What can you to see if any of these bad things are happening on your web site right now?

First, perform a "data audit".

Verify that all of the content that you have on your web site was successfully indexed and added into the search index. Don't assume that because the spider finished and didn't report any errors that it got all of your content and metadata - especially if you use any of the methods described here. If you've indexed your public web site, use Google to confirm the number of pages they have found, using the "site" modified search. For example, to check our site, I can go to Google and search for:

     site:www.ideaeng.com

Remember that Google considers www.ideaeng.com/index.html and ideaeng.com/index.html as different pages, so to be safe check for both. If you engine has far fewer pages than Google, you have a problem.

Next, try a few searches. Use your search engine and look for 'error 401', which might indicate your 'Page not found' display page is giving your spider a normal "200" status code.

Do you have pull-down menus on some of your pages? Find one, and click on the pages in the list. Search your site for that page, and see if there are non pull-down-menu links to it.

In Summary

As you can see, a good links or landing page approach is a good solution to many of these spider woes. Go through your code and identify content "behind" these coding methods, and make sure they are all covered by some other technique. And run those tests - you can't improve what you don't measure.

Return to Top