Search this site:
Enterprise Search Blog
« NIE Newsletter

20+ Differences Between Internet vs. Enterprise Search - And Why You Should Care (Part 3)

Last Updated Feb 2009

By Mark Bennett, New Idea Engineering, Inc. - Volume 5 Number 4 - Summer 2008

Read Part 1 and Part 2 of this series.

Part 3: Strategic and Business Considerations

The previous two installments of this series have focused on the "bits and bytes" of search.  In this final section we're going to step back and look at features in a broader context.  In a few places we may cover a couple topics again, but this time in the context of "why" vs. "how".

An Analogy: Food vs. Cuisine

You can run a fine French restaurant or you can feed an army of a million soldiers – they are both impressive feats - but the level of care and detail afforded to each individual diner is quite different.  The drastic differences between these services are a lot like the situation facing IT departments managing corporate and customer facing search engines that were originally designed to search the Internet.  In a French restaurant, you can fuss over every diner at every table.  Are the potatoes warm enough?  Did the sauce break?  Does table 3 need more butter?  This is Enterprise Search.  Every document and product listing matters, and every meta data item should be as close to perfection as possible.

Part 3: Outline / Contents     [Part 1] [Part 2]
Intro to Part 3
Food vs. Cuisine Analogy
ROI: Myth or Fact?
The Business Intelligence Perspective of Search
Compliance and eDiscovery: Do You Need Deep Iteration and Deep Facets?
Pairing Data with Navigators
Comparing Faceted Search / Parametric Search and Taxonomies
Data Cleanup
Entity Extraction and Fact Extraction
Taxonomies / Ontologies / Topic Sets / "Road Maps"
User Driven Navigators: Tag Clouds / Folksonomy / Community Driven Content
Automatic Clustering/Unsupervised Clustering
Summary of Strategic and Technical Search Points
High Level / Strategic
Technical
Wrap-up

Editor's Note:  This series of newsletter articles is a summary of a new White Paper that we're working on.  If you'd like a copy of the final version, please email us at info@ideaeng.com

In contrast, search engines designed to index and search the entire Internet are like feeding a million-man army.  Some soldiers get potatoes, some get rice.  Soldiers on patrol will get prepackaged ready-to-eat meals (MREs).  And in the heat of battle vegetarian soldiers might have to compromise at times.  Similarly, if an Internet spider misses a hundred pages on a particular subject, it will likely pickup millions more from other web sites, and normalizing custom meta data is simply out of the question.

[back to top]

ROI: Myth or Fact?

A detailed explanation of the Return on Investment of Search is beyond the scope of this article, but we would like to make a few points.

First off, it's true that if you improve search, you may earn more money from increased sales, or save money by improving employee efficiency and customer retention, depending on the type of improvements you make.  Improvements that can be measured are considered Hard ROI, whereas the more intangibles would be Soft ROI.

However, predicting and accurately measuring how much money you'll earn or save can be difficult.  Some of the popular ROI studies were done in the late 1990s, and even studies that appear to be newer are often citing the earlier work, so much of the ROI data is almost 10 years old!

And beware of vendor ROI figures for improved employee productivity.  When vendors present their ROI calculations, they usually multiply the hours per day spent searching times the number of employees, etc. to arrive at some astronomical annual amount of wasted time.  The implied and flawed assumption is that if you upgraded to their solution you would magically recapture all of that lost productivity – which is simply not true.  A good search engine might save 5 to 10% of that wasted time, maybe in the extreme case 30%, but the point is that it's not 100%!  Of recent note, Q-Go stands apart in modern search engine ROI in offering an ROI money back guarantee to qualifying customers.  We'd like to see other vendors be so confident in their ROI numbers.

Summarizing the commonly cited ROI benefits of improving search:

  • Revenue from helping customers find things to buy quicker
  • Revenue from using search to suggest additional related products
  • Reduce Support costs with self service, reducing emails and phone calls
  • Saving time by helping employees find information faster
  • Saving time by not having employees recreate things that already exist
  • Improving customer and employee satisfaction and retention
[back to top]

The Business Intelligence Perspective of Search

Most people just think of search in terms of helping users find things.  A business person may then consider the ROI impacts of search, using it to sell more inventory or saving employees' time.  But search has benefits at a more strategic level as well, that appeal to Marketing and corporate management.

We say that search has three levels of benefits:

1: The direct benefit to users:

With a good search engine, employees or customers can find what they're looking for quickly.  This is the aspect of search that everybody is aware of.

2: Financial benefits / ROI:

This can be from the direct generation of additional revenue and cost savings, improving efficiency, i.e. "hard ROI" vs. "soft ROI".  We talked about this in the previous section.

Although the ROI of search gets mentioned quite a bit in the press, we don't think it justifies new search projects to management very often, except in the case of a customer facing B2C or B2B commerce site.

3: Strategic / BI (Business intelligence):

Spotting search and content trends, and being able to respond more quickly.

Here are some examples of the potential BI benefits of search:

  • What users are looking for and the CHANGES in this interest over time.
  • What they are not finding, either because of misspellings, "vocabulary mismatch issues" (where the words used in the content doesn't match up with the search terms users type in), or perhaps products that you don't yet offer.
  • Customer Service can spot a spike in complaints about a particular product glitch or searches from an important customer.
  • Getting a handle on the content you own.  Spiders and their related tools can actually teach things about your data that you didn't know.  Preparing for search can also inspire an audit of silos and meta data.
  • Content owners can check that the terminology they are using is matching up with the search terms being used.
  • Improving site navigation.
  • Keeping track of competitors.

Old school "click-tracking" of web site analytics shows you which links a user follows and the number of seconds spent on each page, leaving you to guess why a user clicked on certain links and whether it answered their question or not.  The more modern approach uses Search Analytics to gives a much clearer view.  Search Analytics shows you exactly what the user wanted, because you know what they typed in!  And you can certainly see which searches produced zero results, which is a very good indicator that they were not satisfied.  These analytics can also spot trends and changes in behavior and spot vocabulary mismatch between the search terms typed in and the language used on your web pages.

Modern search engines can look at search terms, phrases and sentences at a statistical level.  This can be applied to both submitted searches and to recently authored content, possibly including tech support incident descriptions and bug reports, mailing list and blog postings, and other highly dynamic internal content.  Modern software can detect statistically significant changes, but assigning meaning and action to these changes is still best left to human experts within the company.  We have ideas about how this can all be coordinated and turned into concrete actions, but most organizations are still busy working on more basic search upgrades.

When justifying search projects, we encourage clients to think in terms of all three levels of benefits.

When thinking about the BI benefits of search, we suggest including additional stakeholders even in the earliest parts of planning.  Most companies already involve IT and site designers in their planning process.  But these BI benefits will also be of interest to upper management, content creators, customer service / tech support and Marketing.  Planning of Enterprise Search projects (behind the firewall) should also include Human Resources, helpdesk staff, corporate librarians, sales engineers and professional services, security and compliance officers, CFO and legal staff, and any knowledge workers central to the company's core competence.

We're not suggesting you design by committee, this isn't about governance, it's about gathering input.   Some companies formalize search terms into an SCOE (Search Center of Excellence), and maintain a list of other stakeholders in the company to routinely communicate with.

And finally, an area where search vendors are still mostly quiet is what to do when you make these discoveries, how do you turn them into actions that will improve the situation.  For example, very few search analytics tools directly tie search reports into the content promotion engine so that an identified problem can be immediately fixed.  There are manual procedure based Best Practices for further acting on these discoveries, and products like our own SearchTrack integrate analytics and defining suggestions into a single unified interface.

[back to top]

Compliance and eDiscovery: Do You Need Deep Iteration and Deep Facets?

A subject that isn't talked about much, because it doesn't affect casual searchers, is the question of whether or not a search engine can return every single document that matches a query, even if there are a million matches.  A corollary is whether or not the terms and counts presented in document navigators reflect every single matching document, or are just an estimate based on looking at the first few hundred or thousand docs.

Many engines do not allow you to see every matching document.  This is normally fine, since even a determined human will typically give up after looking at 20 or so pages of results.  But if your company is responding to a subpoena to produce all documents related to a particular set of terms, the judge is likely to mean ALL matching documents, no matter how many there are.  Similarly, a particular term may appear in thousands of articles written by a hundred different authors.  It might happen that none of a particular author's articles show up in the first thousand pages, but that he has written many articles on the subject.  A search engine that shows author as a facet, based only on the first thousand matches, would not even display his name in the list.  On the Internet we could argue that this author's articles must not have been that relevant so it doesn't matter, but a researcher in a specialized field might have recognized that author's name, had it been listed, and might have been very interested in what he had written.

Such stringent requirements are rare in general usage, but if they do apply they are likely to be very important, and may not be easy to find in vendor literature.  Both Endeca and Dieselpoint claim to be capable of handling this.

[back to top]

Pairing Data with Navigators

OK… this is a little techie… but we need to revisit the subject of your data one more time, because it relates to which search engine features are likely to work well.  Search engine vendors offer a confusing selection of clickable results list gadgets for users to drill down and refine their results, technologies with names like "parametric search", "faceted navigation", "tag clouds", "taxonomies", "automatic clustering"… the list goes on and on.  Since all these clickable links look about the same, casual users assume they are all very similar.  But it's important to understand that the implementations of these techniques are radically different, and each is best paired with different types of data.

A vendor who is proud of one particular method will tend to see all search problems in terms of their patents and PhDs, whereas companies are better served by looking at their data first, and then pairing it up with the ideal navigator technology.

Public Internet search portals tend to not use these advanced techniques because of various technical reasons, including the wider variety and amount of data they must contend with, and the less sophisticated nature of their users.  Corporations can actually surpass the level of search functionality offered by Internet search, because companies have more control over these technical issues.   Yahoo does use taxonomies, some portals offer clustering, and some social sites run on user submitted tags, but public Internet search actually lags behind in these areas.  Although IT departments are used to the question "Why can't our Intranet search be just like Google?", we think management should be asking "How can our search be better than Yahoo, MSN and Google?"

There are five general methods of results list navigators, which we organize into three levels of effectiveness:

Level 1: Likely to provide optimal results

  • Faceted/Parametric Navigation
  • Traditional Taxonomies

Level 2: Not quite as accurate

  • Simple Navigators, sometimes paired with Entity Extraction and Fact Extraction

Level 3: Effectiveness varies widely

  • Automatic Clustering/Unsupervised Clustering
  • User Driven (both explicit and statistically derived)

For best results, follow one of these general rules:

  1. Match your data with the highest order navigation technology available to you.
  2. Or "Upgrade" your data, using various techniques, so that you can use one of the higher order navigators.  Tools and techniques do exist for doing so.
  3. If you can't use one of the higher order navigators, and upgrading the data is not an option, settle for one of the lower order navigators.
[back to top]

Comparing Faceted Search / Parametric Search and Taxonomies:

These two are not the same, though there is some overlap.  You've probably used Facets on one of the large consumer electronic sites, and Taxonomies on the Yahoo or DMOZ.org search portals.  Some data is better suited for Facets; other systems may work better with Taxonomies.  Both assume that the raw content or data has some structure or organization.

If you have documents with a lot of high quality meta data, or database records which have well defined fields, then the data is structured at the document level and would typically be paired with Faceted Navigation / Parametric Search.

Content that lacks that individual document structure but still has an overall organization (the "corpus" level), would typically be paired with a content-based taxonomy, or perhaps a very simple facet.

Parametric Search / Faceted Navigation

Overall, this provides some of the best results for users, but requires the documents (or database records) to have quality meta data (or database fields).  Vendors differ on their meanings of the terms Faceted Navigation and Parametric Search.  Generally these are very similar techniques, although some experts define additional functionality for Facets.  We will cover this in future articles or blog postings.

If content does not have this quality meta data, these techniques won't work very well.  Another option for content that lacks meta data is to "upgrade the data" by parsing it from the text or otherwise deriving meta data.  This topic is discussed earlier in this section.

[back to top]

Data Cleanup

Some businesses find themselves in the awkward spot of having some meta data, but perhaps not enough to drive faceted search.  Or their database fields are not populated consistently enough, or with high enough quality, to power search facets.  We certainly agree that data quality is a big concern, the "garbage in, garbage out" computer saying still holds.  However, we counsel to such clients not to give up so easily.  Document meta data can be normalized and improved, or source database fields cleaned up.  Some of the search vendors now offer tools to do this, as do various third parties.  Content with marginal meta data, but that exists in an overall structure, can have additional meta data derived from that structure.  And finally, Entity Extraction can be used to generate additional meta data.  So in this case, if at all possible, we would advise upgrading the data to fit facets, vs. going with one of the lesser methods.

[back to top]

Entity Extraction and Fact Extraction:

These methods are somewhat less predictable, but do allow you to find people, places, companies and other well-understood objects in your content, then present a very simple set of navigators based on those items.

Even simpler navigators can be had by just showing the number of matches from each data source and letting the user click on a source to further narrow the search.

[back to top]

Taxonomies / Ontologies / Topic Sets / "Road Maps"

Some products and demos assume taxonomies will be used to leisurely browse a set of content, without doing a specific search.  This was the earlier usage model for them.  When we talk about Taxonomies in relation to search, we mean the clickable trees that show up next to a result list, that allow you to narrow your search results by clicking on a node and finding the matches just within that category.  By some definitions this could even be thought of as a specialized type of faceted navigation, though that distinction is not particularly important, as the tools for working with taxonomies tend to be different from those used for parametric and faceted data.

Taxonomies can be great if you have a taxonomy and your data has been organized into it.  If it hasn't, it may be possible to upgrade the data with some automated tool, placing all of your documents into a taxonomy.  And if you don't have a taxonomy at all, other tools exist to help create it, usually available from the same vendor.  This subject is so broad that there are even Taxonomy Bootcamps offered by some companies.

Taxonomies have three basic flavors:

  1. Subject Based / Domain Based:
    Thoroughly organized subjects, championed early on by experts from library sciences and researchers
  2. Content Based:
    Organize the data that you already have, perhaps with an automated tool, without worrying about the overall theoretical subject matter.
  3. Behavior Based:
    Organized by the searches that users are actually doing, or by tags that they assign to specific items. The latter is sometimes referred to as a "Folksonomy".

The first two are more traditional, more predictable.  The third, organizing the site or content by what users are actually searching for, can provide a quick fix while longer term improvements are being worked on.

How data gets into Taxonomies is an interesting subject.  There are newer tools on the market that take completely unstructured data and try to coax it into a taxonomy, effectively "upgrading" the data by adding a logical structure; older tools had humans manually create the rules.  Taken together these tools are sometimes referred to as automatic classification systems, taggers or profiling systems.  Even now a few high value industries use human experts to manually assign documents to categories in a taxonomy, or carefully supervise automated tools.  A simpler cousin of this manual input is the tagging that is popular on many web sites, typically adding descriptive tags to photos or videos.  However, in most systems these tags are not presented in any sort of nested way, so we would not tend to call them a proper taxonomy.  Some have used term "Folksonomy" to describe this type of site.  But again, users are adding descriptive data to documents, thus upgrading the meta data of the content.

[back to top]

User Driven Navigators: Tag Clouds / Folksonomy / Community Driven Content

Editor's Note: This is an extension of the third type of taxonomy, "behavior based", but is a large enough topic that we break it out separately.

Tag clouds can be great if you have enough active users.  The problem is that if participation is about 1% on a big public site, that is still a critical mass of contributors; but in organizations with only hundreds or thousands of users, there may not be enough contributors to get adequate tags in place.  There are techniques which retool socially-driven content for smaller groups, which is possible by using inferred tags, but the effectiveness of doing so has detractors and it is generally not a "canned" feature in mainstream search software.

A further driver is if the items you want to search do not have text, such as photos or videos.  In such a case user-driven navigators may be the only reasonable option, though some vendors now offer audio and video mining.

Tags and behavior can be used to drive both relevancy and results list navigators.  Further, like taxonomies, Tag Clouds are sometimes used only for browsing or an initial search.  In other words, all visits to the site see the same tag cloud prior to doing a search, can then click into a tag to see matching documents, then click on other tags.  To be considered a navigator, tags would need to appear next to search results and be specific to the set of matching documents so that you could "drill down" into them to further narrow results.  Users would see different tag clouds for each search; this has been a point of confusion for some companies when comparing different types of navigators.

[back to top]

Automatic Clustering/Unsupervised Clustering:

This is a more recent development which leverages advanced statistics and may or may not provide acceptable results.  It goes by other more obscure names as well.  These mathematically rooted techniques are still relatively new and rather unpredictable, though vendors are quite proud of the PhDs and Patents that power these new engines.  Vendors who offer this technology tend to think it's applicable for any problem you might have – it even removes tough coffee and tea stains!  As you can probably tell, we're a tad more skeptical.

There are some good implementations and bad.  Usually we suggest these techniques only for data that has no structure or meta data, and that cannot be upgraded, and are therefore not applicable for facets or taxonomies.

Google is betting heavily on this type of navigator, over the more traditional techniques.  If it can be done well, it is certainly more consistent with the "appliance" model of radically simplified administration.  We promise to keep an open mind and report back on this; Google's technical prowess has certainly surprised us all in the past.

[back to top]

Summary of Strategic and Technical Search Points

Here we present a quick summary of this three part series, and how technical design decisions can affect your search engine.  Sort of a mini Enterprise Search Manifesto.

[back to top]

High Level / Strategic:

  • Expecting any computer to consistently return the correct document *you* were looking for, every time you type in a 1 or 2 word query, is simply not reasonable, regardless of the "relevance" claims that vendors make.  There is no "HAL 9000 for search" at this time.
  • "Single Shot Relevancy" is not enough – the engine may not be able to reliably guess which is the "best" page to answer your query on the first shot – it needs to engage users in a conversation to help them refine their search.
  • Ownership equals control.  Since you own (or rent) your search engine, you have the power to control it (or replace it).  Similarly, if you own your own data, you can precisely test your search engine and hold it to a higher standard for spider coverage and proper meta data handling.
  • The public Internet search caters to a wide and often casual audience, whereas employees have real work to get done and searches are often goal and task driven.
  • Decide whether search is secondary to your particular business, or a core component.  If it's core, then you will want an engine that can be heavily tuned and adjusted, and where you have good control over relevancy.
  • Decide whether your search needs are generic or will require heavy customization and/or complex integration.   If it's the latter, then you will want an engine that can be heavily customized and that has many API and integration hooks.  Advanced engines can even apply additional business rules or call-outs when displaying the results list.
  • Decide whether you're hoping to make more money by using better search, or whether you are trying to save money.  Direct revenue generation with search may require more complex business rules and tighter ecommerce integration.
  • Search has 3 main benefits: 1: User experience: the direct benefit to users, finding things they need, 2: generation of additional revenue and/or cost savings, and 3: BI / business intelligence, spotting search and content trends, and being able to respond more quickly.
  • Measuring the success of search engine projects based solely on the reduction in complaints is risky.  Employees that lose faith in the search infrastructure will simply stop using it, and frustrated customers will take their business elsewhere, both of which will also reduce complaints.
  • New employees will have high expectations for search, based on their experience using popular Internet portals.  Not only should you plan to meet those expectations, but consider going well past that benchmark.
  • Employees may be more motivated to use an advanced search interface if it means getting their job done better and is something they'd need to use every day.  Direct training is even an option.  However, even indentured employees will not put up with confusing UI's, wrong answers, irrelevant or duplicate results, slow searches, or other "bad" search systems.  This is an opportunity to give them better search, not an excuse to force them to put up with inferior search; some companies still don't understand the difference.
  • In theory, you know more about your employees, their job functions, and previous searches than public portals know about their users, therefore your search engine should be able to make use of this information perform better.
  • Web 2.0 techniques that require explicit user participation may not work behind the firewall because of low participation rates.  Web 2.0 techniques often need to be retooled to serve corporate search.
  • More data means more responsibility and liability.  If all your data is accessible by search, then you will need advanced security.  And as users type in more and more searches, we believe it's only a matter of time before those search logs become discoverable.
  • "Taxonomy as a symptom" - If an organization has been considering using Taxonomies for more than 2 years they may have an even larger problem.  Taxonomies may or may not be appropriate, but more importantly this organization has obviously been unhappy with search for quite a while and still isn't sure what to do about it.  The mythical "taxonomy" might represent a generic wish to "fix the search engine" which has been accidently attached to a specific positive sounding industry buzzword.  The team needs to reevaluate whether a taxonomy is even what they need, and possibly seek outside advice.
[back to top]

Technical:

  • Match you data to navigators, using the most precise type of navigator you can.
  • In some cases data can be "upgraded", by Entity Extraction or by normalizing meta data, or by automatic profiling tools, so that it can be used with one of the better navigator types.
  • Most engines lack adequate tools to audit spider coverage and certify indexing QOS metrics.
  • Compound and Composite Documents usually require custom handling.
  • Records from a Database need different handling than HTML Web Pages.
  • Companies often benefit by customizing their internal search engine's thesaurus to reflect their terminology.
  • Data changes frequently in some specialized applications, which presents a problem for many engines.
  • Inadequate handling of meta data, either by a lack of normalizing it and performing quality checks, or by ignoring it all together.
  • User interface, query syntax, and Entity Extraction usually require heavy customization.
  • Database-like operations such as transactions, joins and arbitrary key-set filters are not standard.
  • Although it's very difficult to identify missing data on public search engines, it is possible to detect missed data in privately owned search.
  • Internet spiders can rank documents based on hyperlinks, but eCommerce and Enterprise Search engines must look for other signals.
  • The Internet is predominately HTML, but corporations have more document types, such as word processing, spreadsheets, graphic, corporate database, video, audio, and more PDF files.
  • The Internet is mostly public information, but corporate data must be protected more carefully, sometimes even at a document by document level, based on each employee's role.
  • Private data sets in some organizations are now larger than the entire Internet was back in the mid 1990s, and are now also having some of the same problems.
  • Spiders created for the Internet need some tuning in order to work effectively inside of corporate firewalls or to power your customer facing sites, because the data and usage models are different than what the engine was originally designed for.
  • Google's page rank relevancy isn't as effective in the Enterprise, so they use other document signals, which is similar to what other vendors do.  Some parties still believe it is better than their competitors.
  • Businesses may need precise control over relevancy rules, or custom business logic in the results list.
  • Enterprise Search needs document level security.
  • Federated Search can be used to address complex security requirements across multiple silos.
  • Federated Search in the enterprise is more complex than the demos typically shown using only public data sources.
  • Unlike the Internet, every document and database record matters, so spiders need to be monitored.  This is particularly true when you are complying with a subpoena.  Many spiders lack precise document-by-document indexing and monitoring controls.
  • Know what your top 100 searches are, and what the search engine returns for each.  Check this at least every 90 days; monthly, weekly or even daily for revenue generating search.  Most engines now also support directed results, where specific documents are suggested in response to particular search terms.  If any of the top 100 searches are problematic, this system can be used to suggest a best bet.
  • Adding a custom thesaurus can be an inexpensive way to fix many search problems.
  • Site navigation works in conjunction with search.  Generally users will click on clearly laid out navigators first for common tasks, and only resort to search if the site navigators don't immediately address their needs.  Popular searches can suggest additions to your site navigation menus.
  • Better data means you can use better results list navigators, to let users drill down into search results.
  • Good SEO-friendly habits used on a company's public web will likely also help Enterprise Search.  This includes spider friendly content and URLs.
  • Although Intranet content may be varied and scattered, at least your employees are not intentionally trying to mislead the search engine with false meta tags. Make sure the spider knows this and indexes them.  And there is no need to worry about coworkers "outbidding" each other for ranking.
  • Your employees are not looking for celebrities or YouTube videos on your corporate Intranet, so searches will be a bit more focused on business related tasks.
  • Although there is no "bidding" for keyword ads, other good SEO practices will help Enterprise Search engines work better and are still advised behind the firewall.
[back to top]

In Closing

The subject of Enterprise and Customer Facing Search, and how they differ from public Internet search, could fill volumes.  But there is one overriding point to take away: Feeding an army of a million soldiers is different than running a fine French Bistro because details matter!  Most engines can be adjusted to meet your critical business needs - if you know where to look.

New Idea Engineering always welcomes your questions and input.  Feel free to contact us at info@ideaeng.com

[Read Part 1 and Part 2 of this series.]
 
[back to top]