new idea ENGINEERING         Home  | Products  | Services  | Newsletter  | Resources  | About Us | Contact Info | Privacy Policy        

  Specializing in Enterprise Search since 1996 - including FAST, Autonomy, Google, Endeca, Dieselpoint and Lucene

Free Subscription to Enterprise Search, an e-newsletter published 8 times a year that covers search, categorization, personalization and taxonomies. Sign up at http://www.ideaeng.com/subscribe.html

An Introduction to Taxonomies and Categorization

Introduction

Taxonomies are a way to organize documents or web pages into logical groupings, based on their contents. Ideally, documents discussing the same subject will be grouped together into one of the taxonomy's categories.

Taxonomies are often organized into "trees" to make them easier to navigate; the subject-related categories and subcategories form the "branches" of the tree.. Near the "root" of the tree are very broad subject categories, such as "financial news", "computer technology", "history" and "travel". As a user navigates down a particular "branch" of a tree, the subject categories get more and more specific. For example, a user navigating down a "Financial News" branch might then select "Mergers and Acquisitions", and then within there "AOL / Time Warner", and then finally "Key Executives".

Probably the best-known example of a taxonomy is the Yahoo Internet portal. Yahoo has logically grouped the millions of web pages they index into convenient categories and subcategories. Taxonomies are also sometimes referred to as "knowledge trees" or "topics", depending on the vendor.

Once a taxonomy tree has been created, all the documents in the system are tagged as belonging to one or more specific taxonomy categories. This process is typically referred to as "categorization", "tagging" or "profiling", again depending on the vendor. Users can then browse and search within specific categories.

The Value of Taxonomies

Taxonomies are becoming very important to companies as they struggle to organize their ever increasing mountains of electronic data.

Early search engines had no problem searching through a few thousand documents that might have been stored in a single large repository; almost any search engine works fine if you only have 1,000 documents! As the number of indexed documents grew, search vendors tried again and again to improve their search engines, some even added Artificial Intelligence to parse user queries and locate pertinent documents.

But the average user search is composed of just 1.4 words! These short one and two word queries thwarted most of those advanced algorithms. But more importantly, it has become clear that users sometimes prefer an iterative experience. They enter a one word search, and then look at the results. Based on the results, they may edit their search and try again. The early search engines were much more "one shot" oriented. Taxonomies provide a well understood structure for more modern, targeted searches.

Taxonomies provide several key benefits:

  • Documents are partitioned into logical groupings which are easier to navigate

  • Allows users to locate information even if they start with a single word search term

  • Taxonomies facilitate iterative, drill down searches which both advanced and beginning users can quickly traverse.

  • A taxonomy category can be used to limit the scope of a search, thus reducing the amount of irrelevant documents returned

  • A well organized taxonomy adds "context" to documents that are returned in a search result; the category a document is listed in can convey concepts such as "relevance", "source", "authority", "public vs. private" and chronological indicators.

  • Taxonomies help avoid problems with common English language peculiarities of similar sounding words, or words with multiple meanings.

    For example, does the search term "sun" refer to the center of our solar system, or to the company Sun Microsystems? A user typing in this search might be presented with two branches, one labeled "Science / Astronomy / Solar System" and a second branch labeled "Business / Computer Companies / Sun Microsystems" - it would be very clear to the user which documents dealt with which concept. They could then investigate the appropriate branch further.

  • Taxonomies also give customer service pages and corporate portals a more professional, organized look, and an improved navigational structure.

When properly implemented, taxonomies speedup employee access to critical data, dramatically increasing their productivity. Ultimately, this is the main reason companies implement the technology.

Creating Taxonomies and Categorizing Documents

Once customers understand what a taxonomy is, their next question is typically "So where do these taxonomies come from?" That's an excellent question!

Different vendors have different methods for creating and maintaining taxonomy trees. Some vendors separate the creation of the trees from the process of categorizing documents, whereas other vendors combine these two processes.

There are three general type of taxonomy creation, with some vendors offering tools that span more than one type:

  • Automatic Taxonomy Creation and Document Categorization

    Some vendors use statistical models to automatically categorize documents and arrange the subject groups into taxonomies. Vendors such as Verity and Semio offer this capability. Most vendors also allow you to modify or create categorization rules to have more tight control over which categories a document is place in.

  • Assisted Taxonomy Creation and Document Categorization

    This is the most common type of categorization and taxonomy creation. During a highly interactive and iterative process, knowledgeable personnel act as trainers who monitor the categorization of hundreds (or thousands) of documents and take actions to modify the rules the system is using. Some vendors allow trainers to directly input and override key words and phrases that the system is using, while other vendors simply have the trainers indicate which documents should and should not go into each category. Trainers can also indicate that a category should be further subdivided into subcategories, which gives more precise categorization of documents.

    Two of the main vendors in this space are Quiver and Autonomy. It's important to note that some vendors offer both automatic and interactive categorization, including Verity and Semio. Each vendor's product has its unique strengths and benefits, and style of interaction.

  • Professional Taxonomy Creation

    Though many advances have been made in automatic or semi-automatic taxonomy creation, there is no substitute for a professionally created taxonomy. For certain applications this is still the only acceptable route. The typical motives for selecting a professional taxonomy is either the need for a very high quality tree or the desire to deploy a project quickly and avoid a lengthy setup and training period. Vendors typically offer libraries of taxonomies pertaining to specific industries such as financial, legal and medical information. They can also work with a client to create a specific taxonomy targeted to that client's exact needs.

    Sageware is, by far, the leader in this space. They have a large library of existing taxonomies and a staff to create specific new ones.

Where Taxonomies Are Used

Electronic taxonomies have been used by library science professionals for several decades. During adoption of the World Wide Web in the mid 1990's, taxonomies were deployed as part of large Internet portals. In the late 1990's large corporations began deploying them as part of their internal corporate portals.

This industry has continued to grow and now, collectively, offers a wide range technology and products at various price points. Small to midsized companies are now also able to offer their employees and customers these same benefits. In addition to public Internet and corporate portals, taxonomies are also finding their way into vertical portals, customer and partner extranet sites, and even to power very specialized knowledge worker document repositories.

New Idea Engineering and Taxonomies

New Idea Engineering has partnered with many of the industry's leading technology providers in order to help our customers retrofit existing knowledge systems, or to create completely new ones. For new systems, we can help you sort through the many vendor claims and product demos. For existing systems, we can help you do a technical audit and plan your integration. We can also help with the actual taxonomy creation process. And, of course, we can also do the actual integration for you, if you choose. Even if the search engine currently deployed on an intranet site doesn't offer native support for taxonomies, they can often still be retrofitted in, to create a seamless user experience.

Please contact our Sales Team if you would like to discuss any of this in more detail.


Home  | Products  | Services  | Newsletter  | Resources  | About Us  | Contact Info  | Privacy Policy
Copyright New Idea Engineering, Inc 1996 - 2008