Do You Need Synthetic Tokens? (part 1)
Last Updated Aug 2009
By: Mark Bennett
This month we have a bit of a propeller head article, but we think you’ll enjoy it. Although casual users of search don’t need to worry about the internals, the senior search engine folks at any company should. Like so many other markets, a well informed consumer will ask better questions and cut through vendor hype. Plus, think how smart you’ll sound at the next company party! Review of Regular Tokens and Evolving Fulltext IndicesTo understand where we’re going, you should double check that you understand the existing technology it builds on. We've talked about how virtually all modern search engines have, at their heart, a set of highly optimized binary files which we call the "fulltext search engine index", or fulltext index for short. This is because search engines would be very inefficient if they had to scan every line of every document for every search, so instead they create an index of all words in every document, kind of like a librarian's card catalog, and then users' query terms can be quickly looked up in this index. The specifics of the index are vendor specific and often proprietary. |
Contents:
|
So for a given word, say "jacket", the system consults its fulltext index and immediately see a list of documents that match. A primitive fulltext index, where the word jacket is in four documents, would literally look like:
jacket = { 5, 8, 24, 31 }
This "instance vector" has a head, the actual word or token being looked up, and then a compact list of document IDs that contain that word. In reality the instance vector would also include information about where, within each document, the term appeared. You'd need that information to do phrase and proximity searching. So our instance vector might actually look like:
jacket = { doc5-offset7, doc5-offset28, doc5-offset271 ... , doc8-offset19, doc8-offset91 ... doc31-offset842 ... }
For the rest of this article we're going to stick with the simpler form in our examples, but understand that it is the simplest form. And of course the actual index would be stored in an efficient binary format, and would likely contain additional information about the word's positions in each of those documents, but this is the basic structure of a fulltext index.
And a hint of things to come. We said that an instance vector has, at its head, the word that is being indexed. Those words are actually just the simplest type of a token! Moving on…
So what are “Regular” Tokens?
At the simplest level, search engine tokens are usually just the words that are extracted and indexed from the source documents.
Modern search engines don’t do a linear scan of every document each time a search is submitted – this would be WAY too slow! Instead, they break up documents into words, and then catalog those words into a binary Search Index. That fulltext search index is then consulted whenever a search is received. Each word in the search is quickly looked up in the index to quickly find the documents that contain those terms. When you hear a vendor talk about “indexing documents”, this is generally what they’re referring to.
Details… details… Punctuation and Stemming
Of course “the devil’s in the details” – the exact format of the search index is different for every vendor, and usually a closely guarded secret.
Most engines go beyond just breaking the text up into words. At a minimum, many engines will store copies of words converted to lowercase, and perhaps replace accented characters with their non-accented form. Typically the original words are indexed 'as is': this extra processing happens at index time at the small cost of extra indexing size.
And there are other decisions to be made when breaking up a document into its constituent tokens. Should the term “AT&T” be stored as literal string “at&t”, or maybe without the punctuation, as in “at” and then “t” or “at (space) t” or even “att”. Different engines make different choices about which type of token to store. Some engines will store it multiple ways for maximum flexibility.
And should the word “cars” be stored as the singular form “car”, or perhaps be expanded to include car, automobile and vehicle? Should we normalize the tokens “running” and “ran” to the base form “run”? We’ve talked about stemming and lemmatization before, we won’t bore you again. This also has an impact using thesaurus based search, as the system decides which synonyms of a word to use as a token.
Tokenizing Non-Western Text
And breaking up documents into words, AKA “Tokenization”, gets even trickier in some Asian languages because they don’t always have easily identifiable word breaks. Some vendors treat each Japanese or Chinese graphic symbol as a separate word token. Other engines use a dictionary to break apart the text into proper words, and store those as tokens.
Even with Western text, some vendors create sub-word tokens based on character N-grams, so that word “Texas” might stored as a series of tokens (“te”, “ex”, “xa”, “as”) and (“tex”, “exa”, “xas”) – these types of tokens can help wildcard and regular expression matching, and they also work somewhat well for Asian text, though are not considered state of the art.
As an example, users of the open source search engines based on Lucene, such as Solr and Nutch, have two choices for handling Japanese text. They can use the CJKAnalyzer, which uses the NGram technique, or the “SEN” toolkit, which uses a dictionary to do linguistically proper word breaks.
But all these choices lead to a specific set of base tokens, and then the list of occurrences for them in each document.
Recapping Regular Tokens
So the process of Tokenizing a document means breaking up the TEXT of the document into little chunks (Tokens) that can be efficiently stored in an index. At a later time, queries can then be similarly broken apart and matched against the index. The details are surprisingly varied and often held as closely guarded secrets.
In summary, tokens are the building blocks of the special index that drives a search engine.
Search Engines and their Evolving Data Structures
To review, search engines break documents up into Tokens, and then each unique token forms the head of an Instance Vector. Put all these instance vectors together into a big binary file and you can quickly lookup any words in a query.
The Traditional Dual-Index Structure
But there's more to a search index than just these vectors. Even in the early days of search, deep inside those binary files, there were actually two fundamental subsets of index data. There was the main "word inversion" instance vectors we just mentioned, and secondary more traditional looking database that held the documents' metadata like the title, author, and the file path or URL to the source document. The primary use of this secondary database was to provide data for displaying the results. Although you'd find matching documents using the main index, when you want to display the title and URL, you need to get that data from a more traditional database. So searches that were just normal words use the primary inverted word index to find the matching documents, then the secondary index to grab the meta data for the results list.
Primary Word Index:
This is the primary set of Instance Vectors we talked about.
abbey = { 1, 4, 9, 12 ... 2279 }
...
baker = { 2, 19, 42, .... 1673 }
...etc....
And a more traditional table of Metadata Table:
Secondary index with document attributes stored in a table, similar to a traditional database engine.
ID, author, pubdate, price, title
1, Jones, 6/1/2005, 0.95, What I did on my Summer Vacation
2, Stevens, 2/28/2001, 7.95, How to be a Better Baker
.....
2085, Sellers, 5/1/2002, 5.95, Follow the Horses
And just like other traditional database engines, this table oriented index can have additional indices associated with it, to make the row oriented lookup run faster. Even traditional databases employed multiple levels of optimization and indices, and smart search engine vendors borrowed these ideas.
Some of these early search engines put that secondary table index to good use. They allowed you to do searches that looked more like traditional SQL searches, like author='Jones' or pubdate>=1/1/2001. These fielded searches used the secondary database to find the matching documents, and then used those same tables to pull the fields to display the results.
These advanced engines even let you mix the two types of queries, so you could do a phrase search for "summer vacation" with a filter of pubdate>=1/1/2001, etc. These "hybrid searches", as we've always called them, provided the best of both words, combinging traditional fielded search with fulltext search in a single unified query language. These secondary meta data tables also allowed for sorting and drilldown and early parametric search.
But an important distinction here is that if you searched for the word "jones" you would be using the fulltext index. Whereas if you said author='jones' you'd be using the secondary table oriented index to find matches. Both would use the secondary tables to bring back titles and such.
Zoned Searching
Vendors evolved this concept a bit since searches using the fulltext index were usually much faster. Queries like "'jones' IN author" or author:jones were now possible. In this case the tokens were expanded to include the section of the document they came from. There were several ways to represent this.
Embedded Zone Information
abbey = { 1, 4, 9, 12(title) ... 2279 }
....
baker = { 2(title), 19, 42, .... 1673 }
....etc....
Engines that tracked every occurrence of a word within each document, which is required if you want to do phrase searches for example, simply extended that offset information to include zone information. We won't bore you with a diagram for that!
But a more modern way to represent this is to include the name of the document section as part of the base instance vector token.
Zone-Specific Token Vectors
“body:abbey” = { 1, 4, 9 ... 2279 }
“body:baker” = { 19, 42, ... 1673 }
....
“title:abbey” = { ... 12 ... }
“title:baker” = { 2, .... }
....
Notice how instances of the word "baker" that appear in the title are stored as the base of a separate instance vector from when it appears in the main body of the document. This is a key point.
Originally this was used to find words anywhere within a particular subsection of the document. So you could look for title:jones, to find the term anywhere in the title, which is slightly different then saying title='jones'. But in the case of a meta data item that has, for example, just a 2 letter state abbreviation, the fielded query state=CA and a fulltext query of state:CA are logically identical, and the fulltext version is much faster!
Extending Zone Tokens to Non-Textual Data Types
These zone specific tokens can be extended even further. What about something like price=>5, something that certainly looks more database-like, and certainly not textual? More modern search engines can still use the fulltext token index to look this up. Let's say a product catalog had thousands of products that were under $10. So there were many products at 4.50, dozens at 4.55, some at 4.60, etc, all the way up to products at the 4.95, 5 and 5.05 price levels. The search engine can quickly find look at the compressed list of tokens, pulling out only the tokens of 5, 5.05, 5.10, etc, and look those up quickly, just like any other tokens. And for each of those tokens, it can quickly see which documents match.
So, instead of storing price in a tabular field, it becomes a set of instance vectors as well!
“price:4.50” = { 83, 96, 122, ... }
“price:4.55” = { 84, 173, 428, ... }
...
“price:5” = { 12, 44, 87 }
....
“price:5.95” = { ..., 2085, ... }
....
Notice the price of $5.95 for one of our books, which was previously stored in a tabular way, is now stored as a token, in the same way as words from the title and body are stored!
Of course representing the numbers as "5.95" is probably not how it would actually be stored in a binary index, we’re speaking conceptually at this point.
But to be clear, when we consider the search:
But this is a good starting point for the main section of our article.
Synthetic Tokens are another class of tokens that can be added to the search index, but that are not derived directly from text. Synthetic tokens are generated from other sources of information, such as document attributes, external data, or higher level patterns that are detected in the document.
Some initial examples are in order. A search engine could take the Mime Type of the document, like whether it’s a text file, an HTML file, or possibly a Microsoft Word document, and store that document type as a token. Or the date a document was last modified could be turned into a set of tokens; one special token could represent the year the document was modified, another could represent the calendar quarter and year, such as “2009Q2”. Another synthetic token could encode how RECENTLY the document was modified, having one value if it was modified in the past day, others representing the past week, the past month, past 90 days, the past year, etc. Another synthetic token could be generated based on how popular a document is; perhaps 1-5 votes gives it one value, 6-25 votes another, 26-100 a third value, etc. We’ll get to more advanced examples later on.
And if you’ve dabbled a bit in statistical methods and machine learning, you might also be thinking about “feature extraction” – very good! You’re getting a bit warmer, though there’s a specific spin on that here as well.
Areas of Overlap?
Frequent readers would be correct in thinking that this sounds a lot like Meta Data, which can also encode document dates, Mime Type, popularity, etc. There is definitely some overlap! But tokens are stored in very specific ways in a search index; in some search engines Meta Data could also be stored, at least in part, as tokens, but there’s a deeper difference here we’ll try to explain. And yes, there’s also some overlap with Entity Extraction, another perennial topic in our articles, but again this isn’t quite the same thing. Low level tokens can do things that normal fielded data can’t.
And before we go any further…. Why should you Care?
This is a question we try to think of in almost everything we write. We feel like we owe you that before we drag you deep into this topic.
For simple search applications, Synthetic Tokens may very well be overkill. If you’ve just got a few thousand generic HTML pages, or a simple product database, you may not need this.
However, as search technology is used for more advanced applications, such as research tools, automated classification, or decision making aids, the engines need better data to work with - fancy patents and PHD’s can only do so much – fundamentally advanced features are much easier to implement, and have a better chance of actually working, if you feed the engine better data up front.
Many advanced search features, that APPEAR to be “magic” at search time, are actually implemented in large part at INDEX time. People see advanced features at search time so there’s a tendency to think something really special is happening right then and there. Since vendors want to look really incredible, peddling their high tech “magic”, they have little incentive to correct this perception. But vast majority of this “magic” happens at index time and is enshrined in these “fulltext indexes” we keep talking about.
In the past we’ve already talked about improving data quality, and things like metadata and entity extraction. Synthetic Tokens are yet another way of enhancing the data that is fed into search engine index, so that it has more to work with. Like that pivotal scene from “The Wizard of Oz”, the fulltext index really is “the man behind the curtain”.
Why not just use Hybrid Search / SQL-like Filters?
Since we freely admit that some of these tokens overlap with Meta Data, which is in many respects an extension of traditional database fields, why not just use that technology. Engines from Verity and Fulcrum offered what we call “hybrid search” back in the 1990’s, where you could combine a fulltext search with a field search. In Verity’s language you could say something like:
<word>(java) <and> <filter>( date > 1-jan-1998 )
The query term “java” would be turned into a word token and compared against the fulltext word index of all documents. But the “date” field would be scanned more like a traditional database search to find documents matching that criteria. Logically, the two sets of matching documents would then be AND’d together to find the intersection, documents that met both criteria. The innovative thing here was that, unlike a traditional databases, Verity could still rank apply relevancy to the matching documents, so that documents where “java” appeared more densely would be ranked closer to the top of the results list.
Many other high end search engines had similar abilities as well, including Semio and Ultraseek. The implementations were different, some engines used techniques much closer to their relational database forbearers – other took a more modern approach, including using some aspects of tokenized searching.
But looking back at the example, although documents where the term “java” appears more densely would be ranked higher, no variable relevancy was given to the “date > 1-jan-1998” clause. Documents either met that criteria or they didn’t.
High-end engines of the day also gave you the ability to sort by date, so if you were really more concerned in recent documents about Java, you could bring them to the top, sorting by date in descending order (most recent document first). This was a nice option, but notice now that you were totally trading in fulltext relevancy (based on Java word density) for date sorting, with the most recent document first, even if the word Java only appeared once in the most recent document.
But what if you wanted to keep the normal fullext relevancy, and just boost the fresher documents a bit higher? This could be done as well, but it was a bit complicated. Assuming it was January 1st, 2000, you could create a query something like:
<accrue>( [.7]<word>(java), [.3] <accrue>( date>1/1/1998, date>1/1/1999, date>7/1/1999), date>12/1/1999)
Don’t get hung up on the syntax, but notice we’ve got 4 date clauses here, one looking back 2 years, the second looking back 1 year, then 6 months and 1 month. The <accrue> operator will essentially <OR> all of these together, but give a bit of extra credit to documents that match multiple criteria. A document from Feb 1st, 1998 would only match one of the four criteria, whereas a document from “last week” (remember, we said we’re performing this query on January 1st, 2000) would match all four date criteria. Verity Query Language (VQL) mavens could point out other, more precise VQL constructs, but I want to keep the example relatively simple.
So you could approximate folding in “freshness” into the query, without totally swamping the normal fulltext relevancy of the words you search for. Similar accommodations could be made in other languages, and could include not only temporal freshness, but also geographic proximity or document popularity, etc. But the point being it gets rather complicated very quickly, to the point where trying fold in a half dozen different factors would be beyond the patience of most search applicaiton developers.
In more recent times, engines like FAST ESP let you add named concepts like “freshness” and “authority” directly into the rank profile. And Google is well known for carefully folding in link based popularity into their rankings. These engines freed developers from the drudgery of creating extremely complex queries, and were another step forward in this process. But you’ll have to trust us that this is not the end of the line. Synthetic tokens allow for even more power and flexibility, though we admit you’ll need to strap on your nerd-cap to really appreciate it.
For all we know Google might already be using techniques very similar to the synthetic tokens we’re discussing here, but of course none of us will ever know for sure!
Things Tokens do More Efficiently!
If your view of a search engine is just, well… for searching, you’re missing out on some of the fun. Search technology, and related technologies, usher in a whole world of new features. If you’ve got a fulltext engine and related tools you can also:
- Automatically cluster documents
- Automatically categorize documents based on a sample training set
- Perform advanced duplicate detection
- Perform spam detection
- Do specialized data mining and Business Intelligence
- Create specialized research robots
- Create search “agents”
- Spot meaningful trends and changes in content and searches
- Perform entity and page recognition
- Do content profiling
- Provide other machine based (statistics based) pseudo-AI tasks
- Incorporate content summarization, consolidation, fact extraction and cross referencing
- Evaluate sentiment analysis
Most of these techniques rely on the highly specialized search indices and the tokens that live inside. If your traditional fielded data, or your document Meta Data, isn’t deeply folded into this index, it’s can’t fully participate in all the fun!
It may be that you don’t need or want to do any of these fancy things… in all fairness we did make the title of this article in a question! Not everyone has these requirements - basic systems may just have a “queries in -> results out” type of usage model.
Coming up in Part 2: Specific Applicatoins of Synthetic Tokens
What can you actually do with all this stuff? We've got some great examples including Spam and Duplicate Detection, Web 2.0 Apps, Image Search, Unsupervised Clustering, Actuarial Decision Making, Quality of Healthcare Assessment, Business Listings, Investigations and AML (Anti-Money Laundering), eDiscovery / Legal War Room, IP (Intellectual Property) Research, Judgments and Citations, and Tech Support / Help Desk / Customer Service.