5 ways to hide content (and why you shouldn't!)

« NIE Newsletter

5 ways to hide content from your search engine (and why you shouldn't)

Ten years ago, the web was a wild place: new content development systems everywhere, booming expansion of standard HTML and scripting languages - not to mention cool graphic environments like Flash. Many of today's corporate web developers often learned their trade during the dot-com boom.

Why is that a problem?

Back then, search was new on the internet. Google was still offering 'Stanford Search' and 'Linux Search' on its early homepage with a rather clunky looking graphic compared to today's look (Figure 1, thanks to Archive.org's Wayback Machine and Google Blogscoped).

Figure 1: Google circa 1998

No one was really very interested in indexing deep content on their web site - often just having a search engine finding your front page was exciting.

Those of us who were really into search - yes, some of us had already worked for search companies for more than 5 years - saw the problem coming, but it was hard to get anyone interested about it. After all, people could find your home page! After that it was up to the webmaster to lead them to the right content.

Skip ahead to 2008.

The problem is that many web developers and web agencies didn't understand that designing the web site in certain ways makes it impossible for a search engine to FIND the content. Didn't work then, doesn't work now. And I bet you didn't spend all that money on a search technology to have frustrated users.

Common Problems

So what are the common errors that keep your search engine from finding your content? Here they are, in no particular order:

Pull-down lists to select sections of content
Graphics that tell your story or describe your content with no good description tags.
Oops - robots.txt
Dynamic URLs in JavaScript
Log in, please
Call to Action

There are many, many more ways to keep your content out of your company search indices. But patience, grasshopper, Rome wasn't searched in a day.

Pull-down Lists

Pull-down lists are a great way to indicate to your users just what areas your site contains. You can provide direct links to Support, Marketing, HR, Development - whatever parts of your site have content that might be helpful. Problem is, your search engine spider not only doesn't see the text in your pull down list, it cannot follow those links to index the great content you have hiding there behind the pull-down menus.

Jakob Nielsen and others have suggested that half of all web site users will use search if it is available - and we think that number may even be higher. If you run an intranet site, do you want to block your content from half of your employees? And if you run a customer-facing site, can you afford to lose half of your potential customers right away?

Figure 2 shows a home page on an IBM site designed to deliver the user to the correct page. It serves its function well - you see nothing until you select the country. But it also keeps Google, Yahoo!, and other major search engines out. Inside of companies, we've seen this sort of thing not at the home page, but in support sites where a visitor is asked to pick a product or category of questions.

Figure 2: Pull-Down Lists

If your site has pull-down options, you can solve your search engine problem without redesigning your site.

Identify all of the pages that your pull-down site links to - do so for all of the types of links that hide your content including the HTML OPTION tag and JavaScript with 'OnClick" links. Create a link page, or landing page, that contains all of those pages, and include it in your web crawler 'start pages'. Note you do not need to link to this page from any of your normal content, so no users will actually see it. But your spider will crawl the landing page you've created, and because of that all of your content gets indexed.

Return to Top

Text in images or video clips

Some Internet sites that want to be perceived as modern tend to use a lot of images on their web site. Pictures of products in GIF or JPG format, or increasingly, in animated scenes that really want you to know that the company and its products are cool. We're not against nice looking web sites - in fact, we're often jealous of them - but images and Flash or Silverlight animations are not always friendly to web crawlers - both on the Internet and within you own search engine.

The good news is that companies like Adobe, which owns Flash, have realized that being friendly for search engines is important, and they've started including textual metadata within the animations. Silverlight, by the way, has been able to do so since its introduction.

And when it comes to static images, HTML has always had the ability to associate text with images, using the "ALT" tag. As an added benefit, having detailed ALT tags helps your site comply with the American with Disabilities Act because vision impaired people with web audio readers can understand what you are showing.

Consider the site in Figure 3. Very cool looking; very marketing and sales oriented; but unless they are using the latest Flash, no spiders are going to find the content hidden behind their home page.

Figure 3: Graphic Home Pages

You can view the page source in Figure 4 below. It's simple, clean, and easy to create and maintain: the work is in creating the graphics. But will a spider ever get past this page? No.

Figure 4: HTML for a Flash Page

As with the pull-down lists above, one solution is to create a landing page that links to all of the primary content on your site, and include that page in your spider/crawler configuration.

Return to Top

Badly configured robots.txt

There is a defacto standard on the web for excluding those polite and participating web spiders and crawlers from parts of your site. But if you have a robots.txt and don't have it configured correctly, you may be missing content on your internal search index.

First, notice that robots.txt is not an internationally accepted standard: there is a proposed draft A Method for Web Robots Control that dates from 1997, but we believe no formal project is active as of 2008.

Next, notice that robots.txt is optional for every spider - it's the ones from places like Google and Yahoo! that play nice; those from more nefarious sites may ignore your requests anyway.

So how can this affect your internal spider?

If you have a robots.txt file in the root of your web server document directory to keep the good guys away from certain areas of your site, or to ask them to be nice and delay a bit between each page request, it's possible that your requests are being honored by your internal spider.

Consider the snippet from a robots.txt file we captured from a large search technology company shown in Figure 5.

Figure 5: robots.txt

You can see that the last three lines do request that external spiders not crawl parts of the site, but because the 'User-agent' field specified all spiders, there is some chance this company's internal crawler is going to ignore the content as well.

There are two things you can do to overcome the problem.

First, almost every commercial web crawler lets you override its default behavior which follows the directives in robots.txt. This is pretty easy, is global for the pages your internal spider crawls, and generally has no downside.

Another choice would be to change robots.txt to specify potentially different rules for different spiders. For example, Figure 6 shows an expanded robots.txt that allows the internal spider, 'nie_crawl', to access the areas disallowed to all other (polite, participating) spiders.