new idea ENGINEERING         Home  | Products  | Services  | Newsletter  | Resources  | About Us | Contact Info | Privacy Policy        

  Specializing in Enterprise Search since 1996 - including FAST, Autonomy, Google, Endeca, Dieselpoint and Lucene

Locator: NIE Home / Publications / Enterprise Search Newsletter / Issue 9 / Article 1

Not a subscriber? Sign up at http://www.ideaeng.com/subscribe.html

Contrasting Relational and Full-Text Engines

By Mark Bennett, New Idea Engineering, Inc. - Issue 9 - June 2004

Introduction

Full-text search engines evolved much later than traditional database engines, as corporations and governments found themselves with more and more unstructured textual data in electronic format. These new text documents didn't fit well into the old table-style databases, so the need for unstructured full-text searching was apparent.

Since it was developed later, search engine technology borrowed heavily from the database world, and many search engines still employ some type of traditional table structures in their underlying architecture. Some text retrieval companies were even staffed with employees who came from traditional database company backgrounds. Many of the key RDBMS paradigms have also migrated into search engine technology, though often renamed or recast.

Overview of Full-Text Architecture

Figure 1 shows a generalized, high-level architecture of most of the commercial quality full-text engines currently available.

Figure 1

The indexing process begins when an application inserts data into a row in the main document index. In the simplest case, this index contains one row per document, and at a minimum contains the name of the external file or document stored in the key field. Additional field values – for example, document title – can be inserted at the same time. If the source of the data resides in a relational database, the primary key in the relational table or view goes in the main document index key field.

Once the data is inserted, the indexing engine opens the external document and creates an ordered word list to load into the main word index. The engine repeats this process for every record in the document index.

During the indexing process, each vendor will typically create other indices intended to provide additional features such as Soundex.

While searching is generally available during the indexing process, only completed records are searchable; and for performance, most engines batch together a number of records to index more efficiently.

When a query arrives, either programmatically or as a result of a user request, the full-text engine accesses the sorted and optimized word index to identify which documents contain the requested term(s). The engine creates a list of documents that qualify, typically provided as a list of pointers into the main document index.

This permits the engine to access and display a result list made up of any fields stored in the main index, calculate a relevance weight, and display a list of results.

The description of the internal architecture will be extended throughout the report.

Advantages Over Traditional Techniques

This section summarizes some of the advantages of search engines over traditional database engines. These advantages are typical of high-end search engine products.

Technical Similarities

While relational database systems and full-text search engines are optimized to process fundamentally different types of data, there are a number of similarities between the two.

Technical Differences

While there are similarities between full-text and relational technologies, there are a number of differences as well because of the fundamental differences between the types of data being indexed and the flexibility of the retrieval options. While the differences can present some challenges, they also present the opportunity to take advantage of the key features of full-text search to provide an innovative solution to the problem at hand.

Vocabulary Comparison Summary

Table 1 summarizes vocabulary between relational and full-text databases. Due to the large number of vendors and broad use of terms, this table can only serve as an approximation.

RDBMS Term(s) Full-Text Term(s) Notes
database collection, document index or catalog Varies widely
table segment or partition This is typically transparent to the casual Search Engine administrator; you typically do not address individual partitions or segments.
record document, record, page, web page or result Traditional search engines deal in terms of "documents"; more modern Internet engines talk of "web pages"
field field, document field, meta field, zone Search engines often have two different ways of storing data. When it is stored in the document index, it is usually called a "field". When it is stored in the word index, it is usually called a "zone". Each type of storage has its own benefits
blob zone Larger segments of text are typically stored as zones
index (noun) collection, document index or word index In both worlds it typically refers to a large binary data-store residing on a disk
index (verb) index or spider The tabulating and storing of data into the binary indices
query query or search Same terminology in both
join n/a Full-text engines do not usually do "joins" at search time
import/export n/a Most full-text engines do not offer robust important and export capabilities. Some vendors do offer import tools. Though indexed, documents are typically not imported directly into a full-text database. The process of "indexing" or "spidering" can be thought of as a type of import, although the original source documents are left where they were.
SQL n/a Though there are some full-text query language standards, they are not widely supported or implemented. The closest semi-standard is the "Internet syntax" of some vendors, where + and - service as AND and NOT, quotation marks demark exact phrases, and ()'s are often recognized to convey precedence.
ODBC n/a There is no widely used standard. Most modern full-text engines do offer access via the HTTP protocol's CGI mechanism, though the specific field names to use vary widely from vendor to vendor.

Table 1

Summary

Because full-text search engines evolved after, and borrowed heavily from, traditional database engines, administrators should feel right at home. Internalizing the new vocabulary will help complete the transition. If you find yourself still thinking about “inner and outer joins”, remember that these things need to happen up front, at index time, not at search time.


Home  | Products  | Services  | Newsletter  | Resources  | About Us  | Contact Info  | Privacy Policy
Copyright New Idea Engineering, Inc 1996 - 2008