- Missing Titles
- Many Titles are Identical
- The first part of many titles look that same
- Titles with invalid characters
Intermission: a briefing on document summaries
- No document summaries provided
- Identical document summaries
- "Stuffed" Summaries
- Spurious HTML tags in Summaries Trash the Rest of the Page.
- Poor automated summaries
Your search returns a list of documents where the title for
each document is just the name of the file.
Likely cause: your search vendor or open source
search solution can't extract or parse simple titles.
Where are titles typically extracted from?
- HTML Documents: typically these are extracted
from the <title> element.
- E-mails and News Postings: they are often the subject line.
- For binary document formats, such as Microsoft Word,
they are taken from the document "meta data" (File menu /
Properties )
- For plain old text files, which have no predefined title
meta field, some engines will simply take the first line or
block of text as the title.
Fix: it's time to upgrade.
Likely cause: You have a number of Microsoft Office
documents in your collection
Fix: make sure your content creators are checking
document properties and entering a reasonable title before
submitting the document.
Your search returns a list of documents where the titles for
many of the documents are identical.
Likely cause: Use of templates with a boilerplate title.
Fixes:
- Talk to your document authors.
- Use a content "tweaker" as part of your spider process to
look for better titles in the document body.
Your search returns a list of documents where the titles for
many of the documents are identical. Then you look more closely
and it's just that they share the same (long) beginning text.
For example:
- Acme's World Wide Web Customer Service Site Frequently Asked Questions: Igniting Your Rocket Propelled Skates
- Acme's World Wide Web Customer Service Site Frequently Asked Questions: Avoiding Ferrous Materials when Operating the Super Mega Magnet
- Acme's World Wide Web Customer Service Site Frequently Asked Questions: Invisibility Paint Safety Precautions
The long title prefixes clutter the results list and obscure the real content.
Likely Cause: This is usually also caused by the
use of templates, or perhaps a misguided corporate Look and
Feel policy.
Fix: modify your templates or policies to
encourage shorter prefixes: in the example cited a prefix of
"FAQ" or, at most, "Acme FAQs:" would be preferable.
Long titles, if they are truly unique and informative, are
not necessarily bad. We mentioned in
Omit or Truncate the Display of Long URLs in the Search Results
from "Top 10 Tips for Better Search Results"
that long URLs in a results list can push out the right edge
of the results table. Typically long titles will not do
this, they should word wrap properly within the table. An
exception--and potential problem--is when the titles
are bracketed by <nobr> tags:- they can
break the formatting of your search results tables. For that
case we suggest either removing the <nobr> tags from
around the title, or truncating the title at some preset
limit.
The titles returned for documents have one or two gibberish characters.
Likely cause: This is often caused by a malfunctioning document filter,
bad meta data, or character set encoding issues.
Fixes:
Adjust your spider or talk to your search vendor.
Put a tweak in your results templates that does a sanity
check on title. If it's too short (e.g. less than 5
characters in length) assume that it's invalid and default
to a secondary title or perhaps the file name. Using the
filename is not great but less ugly than using a handful of
control characters.
Likely cause: for HTML documents, this can be caused
HTML elements being included in the title.
Fix: Remove extraneous tags.
Next we turn our attention to document summaries.
There are three common types of summaries offered by search
engines. Don't fret too much about which of these methods
your engine uses - any of these methods is typically better
than nothing at all. Some engines offer choices - make sure
to read up on what you have available and test the various
settings.
- Explicit summaries
Summary is specifically stated by
the document author. In HTML, this is the description meta
tag in the <head> section of the document. For binary
documents such as Microsoft Word, this is set as a "document
property", often under the File menu. A benefit of this
type of summary, when used properly, is that content authors
have the control needed to present a precise document
summary, which may prove more useful than the automatically
derived summaries (described below). However, this does
take consistency and discipline to fully realize the
benefits.
- Index-time derived summaries
A block of text is extracted from the document when it is
being indexed, and used as the summary. When this document
matches a search, it will always be displayed with the same
summary, regardless of what the search terms were. For HTML
documents, most engines will prefer to use the fixed meta
tag summary, if it's present, and will only resort to this
as a fallback. Vendors often have settings for how much
text to extract, either measured in characters or words or
sentences. In terms of which text the document to use, some
engines take text near the top of the document, presuming
it's likely to be relevant; this can cause problems in HTML
pages if the top (or left edge) of the document contains
lots of navigation links that the engine mistakes as
pertinent text. Some vendors support embedded tags that
help demarcate central document content from extraneous
content within each page. A symptom of this is when you see
lots of summaries with "Home | Products | Services | About
Us?" type text, you will need to investigate tuning your
settings. Other engines try to extract statistically
"important" sections of the document for the summary, where
"statistically relevant" is determined by each vendor's
algorithms. Some vendors also allow adjustments to this
section by letting the administrator specify words that know
to NOT be interested, such as terms that appear in
navigational parts of the page.
- Search-time derived summaries
These are the fancy summaries some engines have that show
the part of document with the key search terms in them; the
document will have a different summary each time it is in
the results list, depending on the search terms in each
search, sometimes even highlighting the actual terms in
bold. This is certainly the "sexiest" type of summary.
Overall, if this is working well, then you should consider
using it.
We now continue with what can go wrong with summaries and how
to fix it.
This has become more rare, but some search engines do not
provide summaries: instead they provide a long list of
clickable URLs. A few power users might like this, so they
can see 50 results on their screen at once, but mere mortals
like to see summaries.
Likely cause: Document summaries not turned on.
Fix: Turn them on. If they are not available consider
upgrading / changing search engines.
Likely cause: This is also usually caused by the
use of boilerplate templates.
Fix: revisit your templates.
In the old days of the Internet, back in the late 1990s,
webmasters tried stuffing their summaries with key words to
boost their ranking on Internet portals. Portal indexing
spiders got wise to this a long time ago, so this practice
is essentially futile (in terms of portal rankings). But,
if you still have these summaries in place, you probably
will succeed in confusing your own enterprise search engine,
and provide very poor summaries in your results. If you've
got any of this cruft still hanging around, get rid of it.
Example: A summary includes an opening <b> tag for
bolding, but does not include the closing </b> tag.
Result: the rest of the results are bold too! This can be
particularly disturbing if it pushes the "Next Page" link
off of the visible page.
A more serious example is when a summary contains an opening table related element
such as <table> or <td>, but does not contain the closing element. This
can be particularly disturbing and may even prevent the rest of page from displaying.
Fixes:
- If the vendor supports it, ask for a summary that
does not include HTML tags, but that DOES include
entities. Fortunately this seems to be the default for
many modern search engines.
- If the search vendor does not offer this, you can
still write a script to remove it, based on searching
for < and >.
As we mentioned in our briefing on document summaries above,
fancy search engines offer a dynamic keyword specific
summary.
Here are some things to look out for if you use this type of feature:
- Some vendors don't include enough surrounding terms to give context, so your summaries wind up with strings of ellipses, gaps, and lots of small text fragments - this is not particularly useful in conveying context and is visually unattractive. See if your engine has adjustments for how much text to include.
- A better option is for the engine to display complete
sentences with the term highlighted.
- Be careful not to let summaries get too long: limit
them to two to four sentences.
- Performance issues: some engines slow down noticeably
because they re-fetch and re-analyze a document EACH
TIME it is displayed in a results list. They do this
in order to calculate an optimal summary, but on some
systems this can slow everything down.
Return to the Table of Contents
|