Search this site:
Enterprise Search Blog
« NIE Newsletter

Do You Need Synthetic Tokens? (parts 1 and 2 combined)

By: Mark Bennett

This month we have a bit of a propeller head article, but we think you’ll enjoy it. Although casual users of search don’t need to worry about the internals, the senior search engine folks at any company should. Like so many other markets, a well informed consumer will ask better questions and cut through vendor hype. Plus, think how smart you’ll sound at the next company party!

Review of Regular Tokens and Evolving Fulltext Indices

To understand where we’re going, you should double check that you understand the existing technology it builds on.

The Search Engine Index

We've talked about how virtually all modern search engines have, at their heart, a set of highly optimized binary files which we call the "fulltext search engine index", or fulltext index for short.  This is because search engines would be very inefficient if they had to scan every line of every document for every search, so instead they create an index of all words in every document, kind of like a librarian's card catalog, and then users' query terms can be quickly looked up in this index.  The specifics of the index are vendor specific and often proprietary.

Contents:
  • Review: Regular Tokens and the Search Index
  • Evolving Data Structures
  • The Traditional Dual-Index Structure
  • Table Oriented Search
  • Zoned Searching
  • Extending to Non-Textual Data Types
  • Synthetic Tokens
  • Areas of Overlap?
  • Why should you Care?
  • Why not just use Hybrid Search / SQL-like Filters?
  • Things Tokens do More Efficiently!
  • Spam Detection
  • Web 2.0 Apps
  • Image Search
  • Duplicate Detection & Unsupervised Clustering
  • Actuarial Decision Making
  • Quality of Healthcare Assessment
  • Business Listings
  • Investigations / AML
  • eDiscovery
  • IP Research, Judgments & Citations
  • Tech Support, Help Desk, Customer Service
  • In Summary
  • Instance Vectors

    So for a given word, say "jacket", the system consults its fulltext index and immediately sees a list of documents that match.  A primitive fulltext index, where the word jacket is in four documents, would literally look like:

    jacket = { 5, 8, 24, 31 }

    This "instance vector" has a head, the actual word or token being looked up, and then a compact list of document IDs that contain that word.  In reality the instance vector would also include information about where, within each document, the term appeared.  You'd need that information to do phrase and proximity searching.  So our instance vector might actually look like:

    jacket = { doc5-offset7, doc5-offset28, doc5-offset271 ... , doc8-offset19, doc8-offset91 ... doc31-offset842 ... }

    For the rest of this article we're going to stick with the simpler form in our examples, but understand that it is the simplest form.  And of course the actual index would be stored in an efficient binary format, and would likely contain additional information about the word's positions in each of those documents, but this is the basic structure of a fulltext index.

    And a hint of things to come: we said that an instance vector has, at its head, the word that is being indexed. Those words are actually just the simplest type of a token! Moving on…

    So what are “Regular” Tokens?

    At the simplest level, search engine tokens are usually just the words that are extracted and indexed from the source documents.

    Modern search engines don’t do a linear scan of every document each time a search is submitted – this would be WAY too slow! Instead, they break up documents into words, and then catalog those words into a binary Search Index. That fulltext search index is then consulted whenever a search is received. Each word in the search is quickly looked up in the index to quickly find the documents that contain those terms. When you hear a vendor talk about “indexing documents”, this is generally what they’re referring to.

    Details… details… Punctuation and Stemming

    Of course “the devil’s in the details” – the exact format of the search index is different for every vendor, and usually a closely guarded secret.

    Most engines go beyond just breaking the text up into words. At a minimum, many engines will store copies of words converted to lowercase, and perhaps replace accented characters with their non-accented form. Typically the original words are indexed 'as is': this extra processing happens at index time at the small cost of extra indexing size.

    And there are other decisions to be made when breaking up a document into its constituent tokens. Should the term “AT&T” be stored as literal string “at&t”, or maybe without the punctuation, as in “at” and then “t” or “at (space) t” or even “att”. Different engines make different choices about which type of token to store. Some engines will store it multiple ways for maximum flexibility.

    And should the word “cars” be stored as the singular form “car”, or perhaps be expanded to include car, automobile and vehicle? Should we normalize the tokens “running” and “ran” to the base form “run”?   We’ve talked about stemming and lemmatization before, we won’t bore you again. This also has an impact using thesaurus based search, as the system decides which synonyms of a word to use as a token.

    Tokenizing Non-Western Text

    And breaking up documents into words, AKA “Tokenization”, gets even trickier in some Asian languages because they don’t always have easily identifiable word breaks. Some vendors treat each Japanese or Chinese graphic symbol as a separate word token. Other engines use a dictionary to break apart the text into proper words, and store those as tokens.

    Even with Western text, some vendors create sub-word tokens based on character N-grams, so that word “Texas” might stored as a series of tokens (“te”, “ex”, “xa”, “as”) and (“tex”, “exa”, “xas”) – these types of tokens can help wildcard and regular expression matching, and they also work somewhat well for Asian text, though are not considered state of the art.

    As an example, users of the open source search engines based on Lucene, such as Solr and Nutch, have two choices for handling Japanese text. They can use the CJKAnalyzer, which uses the NGram technique, or the “SEN” toolkit, which uses a dictionary to do linguistically proper word breaks.

    But all these choices lead to a specific set of base tokens, and then the list of occurrences for them in each document.

    Recapping Regular Tokens

    So the process of Tokenizing a document means breaking up the TEXT of the document into little chunks (Tokens) that can be efficiently stored in an index. At a later time, queries can then be similarly broken apart and matched against the index. The details are surprisingly varied and often held as closely guarded secrets.

    In summary, tokens are the building blocks of the special index that drives a search engine.

    Search Engines and their Evolving Data Structures

    To review, search engines break documents up into Tokens, and then each unique token forms the head of an Instance Vector. Put all these instance vectors together into a big binary file and you can quickly lookup any words in a query.

    The Traditional Dual-Index Structure

    But there's more to a search index than just these vectors.  Even in the early days of search, deep inside those binary files, there were actually two fundamental subsets of index data.  There was the main "word inversion" instance vectors we just mentioned, and secondary more traditional looking database that held the documents' metadata like the title, author, and the file path or URL to the source document.  The primary use of this secondary database was to provide data for displaying the results.  Although you'd find matching documents using the main index, when you want to display the title and URL, you need to get that data from a more traditional database.  So searches that were just normal words use the primary inverted word index to find the matching documents, then the secondary index to grab the meta data for the results list.

    Primary Word Index:

    This is the primary set of Instance Vectors we talked about.

    abbey = { 1, 4, 9, 12 ... 2279 }

    ...

    baker = { 2, 19, 42, .... 1673 }

    ...etc.... 

    And a more traditional table of Metadata Table:

    Secondary index with document attributes stored in a table, similar to a traditional database engine.

    ID, author, pubdate, price, title

    1, Jones, 6/1/2005, 0.95, What I did on my Summer Vacation

    2, Stevens, 2/28/2001, 7.95, How to be a Better Baker

    .....

    2085, Sellers, 5/1/2002, 5.95, Follow the Horses

    And just like other traditional database engines, this table-oriented index can have additional indices associated with it to make the row oriented lookup run faster. Even traditional databases employed multiple levels of optimization and indices, and smart search engine vendors borrowed these ideas.

    Table Oriented Search

    Some of these early search engines put that secondary table index to good use.  They allowed you to do searches that looked more like traditional SQL searches, like author='Jones' or pubdate>=1/1/2001.  These fielded searches used the secondary database to find the matching documents, and then used those same tables to pull the fields to display the results.

    These advanced engines even let you mix the two types of queries, so you could do a phrase search for "summer vacation" with a filter of pubdate>=1/1/2001, etc.  These "hybrid searches", as we've always called them, provided the best of both words, combining traditional fielded search with fulltext search in a single unified query language.  These secondary meta data tables also allowed for sorting and drilldown and early parametric search.

    But an important distinction here is that if you searched for the word "jones" you would be using the fulltext index.  Whereas if you said author='jones' you'd be using the secondary table oriented index to find matches.  Both would use the secondary tables to bring back titles and such.

    Zoned Searching

    Vendors evolved this concept a bit since searches using the fulltext index were usually much faster.  Queries like "'jones' IN author" or author:jones were now possible.  In this case the tokens were expanded to include the section of the document they came from.  There were several ways to represent this.

    Embedded Zone Information

    abbey = { 1, 4, 9, 12(title) ... 2279 }

    ...

    baker = { 2(title), 19, 42, .... 1673 }

    ...etc....

    Engines that tracked every occurrence of a word within each document, which is required if you want to do phrase searches for example, simply extended that offset information to include zone information.  We won't bore you with a diagram for that!

    But a more modern way to represent this is to include the name of the document section as part of the base instance vector token.

    Zone-Specific Token Vectors

    “body:abbey” = { 1, 4, 9 ... 2279 }

    “body:baker” = { 19, 42, ... 1673 }

    ....

    “title:abbey” = { ... 12 ... }

    “title:baker” = { 2, .... }

    ....

    Notice how instances of the word "baker" that appear in the title are stored as the base of a separate instance vector from when it appears in the main body of the document.  This is a key point.

    Originally this was used to find words anywhere within a particular subsection of the document.  So you could look for title:jones, to find the term anywhere in the title, which is slightly different then saying title='jones'.  But in the case of a meta data item that has, for example, just a 2 letter state abbreviation, the fielded query state=CA and a fulltext query of state:CA are logically identical, and the fulltext version is much faster!

    Extending Zone Tokens to Non-Textual Data Types

    These zone specific tokens can be extended even further.  What about something like price=>5, something that certainly looks more database-like, and certainly not textual?  More modern search engines can still use the fulltext token index to look this up.  Let's say a product catalog had thousands of products that were under $10.  So there were many products at 4.50, dozens at 4.55, some at 4.60, etc, all the way up to products at the 4.95, 5 and 5.05 price levels.  The search engine can quickly find look at the compressed list of tokens, pulling out only the tokens of 5, 5.05, 5.10, etc, and look those up quickly, just like any other tokens.  And for each of those tokens, it can quickly see which documents match.

    So, instead of storing price in a tabular field, it becomes a set of instance vectors as well!

    “price:4.50” = { 83, 96, 122, ... }

    “price:4.55” = { 84, 173, 428, ... }

    ....

    “price:5” = { 12, 44, 87 }

    ....

    “price:5.95” = { ..., 2085, ... }

    ....

    Notice the price of $5.95 for one of our books, which was previously stored in a tabular way, is now stored as a token, in the same way as words from the title and body are stored!

    Of course representing the numbers as "5.95" is probably not how it would actually be stored in a binary index, we’re speaking conceptually at this point.

    But to be clear, when we consider the search:

    price >= 5

    There are now two distinct ways an engine can accomplish this:

    1. Use the secondary table-based index, like a traditional database would
    2. OR

    3. Use the primary token based index, find all roots that match the criteria, and collect all of their document offsets.

    To actually display matching documents in a results list, including the title and price, etc, most engines will fall back to the secondary table index. A few engines can actually interleave displayable metadata inside the fulltext index, or have various options for how to configure it.

    Note that we’ve shown the colon used in both queries and as part of a token. It’s understood that in query title:winter means “look for the term ‘winter’ somewhere in the title”, whereas in the fulltext index title:winter represents a base token for the word winter that was found in any title, which is followed by the list of offsets. Engines vary on the punctuation they use in their query language and stored tokens, but colons are commonly used in both.

    In the broadest of definitions in this article a token of the form “price:4.95” could be considered a "synthetic token" as well, although nobody does that. But this is a good starting point for the main section of our article.

    Synthetic Tokens

    Synthetic Tokens are another class of tokens that can be added to the search index, but that are not derived directly from text. Synthetic tokens are generated from other sources of information, such as document attributes, external data, or higher level patterns that are detected in the document.

    Some initial examples are in order. A search engine could take the Mime Type of the document, like whether it’s a text file, an HTML file, or possibly a Microsoft Word document, and store that document type as a token. Or the date a document was last modified could be turned into a set of tokens; one special token could represent the year the document was modified, another could represent the calendar quarter and year, such as “2009Q2”. Another synthetic token could encode how RECENTLY the document was modified, having one value if it was modified in the past day, others representing the past week, the past month, past 90 days, the past year, etc. Another synthetic token could be generated based on how popular a document is; perhaps 1-5 votes gives it one value, 6-25 votes another, 26-100 a third value, etc. We’ll get to more advanced examples later on.

    And if you’ve dabbled a bit in statistical methods and machine learning, you might also be thinking about “feature extraction” – very good! You’re getting a bit warmer, though there’s a specific spin on that here as well.

    Areas of Overlap?

    Frequent readers would be correct in thinking that this sounds a lot like Meta Data, which can also encode document dates, Mime Type, popularity, etc. There is definitely some overlap! But tokens are stored in very specific ways in a search index; in some search engines Meta Data could also be stored, at least in part, as tokens, but there’s a deeper difference here we’ll try to explain. And yes, there’s also some overlap with Entity Extraction, another perennial topic in our articles, but again this isn’t quite the same thing. Low level tokens can do things that normal fielded data can’t.

    And before we go any further…. Why should you Care?

    This is a question we try to think of in almost everything we write. We feel like we owe you that before we drag you deep into this topic.

    For simple search applications, Synthetic Tokens may very well be overkill. If you’ve just got a few thousand generic HTML pages, or a simple product database, you may not need this.

    However, as search technology is used for more advanced applications, such as research tools, automated classification, or decision making aids, the engines need better data to work with - fancy patents and PHD’s can only do so much – fundamentally advanced features are much easier to implement, and have a better chance of actually working, if you feed the engine better data up front.

    Many advanced search features, that APPEAR to be “magic” at search time, are actually implemented in large part at INDEX time. People see advanced features at search time so there’s a tendency to think something really special is happening right then and there. Since vendors want to look really incredible, peddling their high tech “magic”, they have little incentive to correct this perception. But vast majority of this “magic” happens at index time and is enshrined in these “fulltext indexes” we keep talking about.

    In the past we’ve already talked about improving data quality, and things like metadata and entity extraction. Synthetic Tokens are yet another way of enhancing the data that is fed into search engine index, so that it has more to work with. Like that pivotal scene from “The Wizard of Oz”, the fulltext index really is “the man behind the curtain”.

    Why not just use Hybrid Search / SQL-like Filters?

    Since we freely admit that some of these tokens overlap with Meta Data, which is in many respects an extension of traditional database fields, why not just use that technology. Engines from Verity and Fulcrum offered what we call “hybrid search” back in the 1990’s, where you could combine a fulltext search with a field search. In Verity’s language you could say something like:

    <word>(java) <and> <filter>( date > 1-jan-1998 )

    The query term “java” would be turned into a word token and compared against the fulltext word index of all documents. But the “date” field would be scanned more like a traditional database search to find documents matching those criteria. Logically, the two sets of matching documents would then be AND’d together to find the intersection, documents that met both criteria. The innovative thing here was that, unlike a traditional databases, Verity could still rank apply relevancy to the matching documents, so that documents where “java” appeared more densely would be ranked closer to the top of the results list.

    Many other high end search engines had similar abilities as well, including Semio and Ultraseek. The implementations were different, some engines used techniques much closer to their relational database forbearers – other took a more modern approach, including using some aspects of tokenized searching.

    But looking back at the example, although documents where the term “java” appears more densely would be ranked higher, no variable relevancy was given to the “date > 1-jan-1998” clause. Documents either met that criteria or they didn’t.

    High-end engines of the day also gave you the ability to sort by date, so if you were really more concerned in recent documents about Java, you could bring them to the top, sorting by date in descending order (most recent document first). This was a nice option, but notice now that you were totally trading in fulltext relevancy (based on Java word density) for date sorting, with the most recent document first, even if the word Java only appeared once in the most recent document.

    But what if you wanted to keep the normal fullext relevancy, and just boost the fresher documents a bit higher? This could be done as well, but it was a bit complicated. Assuming it was January 1st, 2000, you could create a query something like:

    <accrue>( [.7]<word>(java), [.3] <accrue>( date>1/1/1998, date>1/1/1999, date>7/1/1999), date>12/1/1999)

    Don’t get hung up on the syntax, but notice we’ve got 4 date clauses here, one looking back 2 years, the second looking back 1 year, then 6 months and 1 month. The <accrue> operator will essentially <OR> all of these together, but give a bit of extra credit to documents that match multiple criteria. A document from Feb 1st, 1998 would only match one of the four criteria, whereas a document from “last week” (remember, we said we’re performing this query on January 1st, 2000) would match all four date criteria. Verity Query Language (VQL) mavens could point out other, more precise VQL constructs, but I want to keep the example relatively simple.

    So you could approximate folding in “freshness” into the query, without totally swamping the normal fulltext relevancy of the words you search for. Similar accommodations could be made in other languages, and could include not only temporal freshness, but also geographic proximity or document popularity, etc. But the point being it gets rather complicated very quickly, to the point where trying fold in a half dozen different factors would be beyond the patience of most search application developers.

    In more recent times, engines like FAST ESP let you add named concepts like “freshness” and “authority” directly into the rank profile. And Google is well known for carefully folding in link based popularity into their rankings. These engines freed developers from the drudgery of creating extremely complex queries, and were another step forward in this process. But you’ll have to trust us that this is not the end of the line. Synthetic tokens allow for even more power and flexibility, though we admit you’ll need to strap on your nerd-cap to really appreciate it.

    For all we know Google might already be using techniques very similar to the synthetic tokens we’re discussing here, but of course none of us will ever know for sure!

    Things Tokens do More Efficiently!

    If your view of a search engine is just, well… for searching, you’re missing out on some of the fun. Search technology, and related technologies, usher in a whole world of new features. If you’ve got a fulltext engine and related tools you can also:

    • Automatically cluster documents
    • Automatically categorize documents based on a sample training set
    • Perform advanced duplicate detection
    • Perform spam detection
    • Do specialized data mining and Business Intelligence
    • Create specialized research robots
    • Create search “agents”
    • Spot meaningful trends and changes in content and searches
    • Perform entity and page recognition
    • Do content profiling
    • Provide other machine based (statistics based) pseudo-AI tasks
    • Incorporate content summarization, consolidation, fact extraction and cross referencing
    • Evaluate  sentiment analysis

    Most of these techniques rely on the highly specialized search indices and the tokens that live inside. If your traditional fielded data, or your document Meta Data, isn’t deeply folded into this index, it’s can’t fully participate in all the fun!

    It may be that you don’t need or want to do any of these fancy things… in all fairness we did make the title of this article in a question! Not everyone has these requirements - basic systems may just have a “queries in -> results out” type of usage model.

    Spam Detection

    We’re going to start with spam detection - systems that try to spot junk email. In most forms these are not technically search engines, but advanced spam systems have made effective use of synthetic tokens, and there’s more overlap with search than you might initially think.

    Early Rule-Based Spam Detection

    A quick bit of history. Early spam detection systems just used simple rules. They’d look at the domain of the From address, or the IP address of the sending computer, or other electronic hints of spam. These rules worked OK for a while, but spammers got wise and worked around all these rules.

    Really Old Rules – Easily Thwarted

    • The From address, as in “we know this guy sends a lot of spam” – so spammers stopped using their own account
    • The domain of the From email address – spammers forged email addresses from other sites
    • Look at the number of people on the To and CC lines, because spammers send to lots of users – the servers were not fooled by BCC for very long either – the workaround here was to lots of little spam transactions instead of one giant transaction.

    Some examples of evolving rules:

    • Look at IP of the server that connected to send, the so called “black lists” – spammers learned to use lots of different computers to send out their spam; meanwhile innocent users who happened to be on big ISPs that had spammers at one time found themselves blocked.
    • Look for emails with specific suspicious attachments – spammers quickly changed, but legitimate users found their emails blocked.
    • Demand pre-verified communications or other “white lists” of approved senders. - Spammers were able to forge the email addresses of large legitimate emailers. Meanwhile legitimate users rarely bothered to setup approved lists and found their spam blocked.
    • Encryption and authenticated senders, using digital signatures and other watermarks to prove the sender’s identity. This works when used, but is rarely practiced by casual legitimate users, so can’t be used as a primary filter.

    The Arms Race that was Never Won

    These rule based systems were difficult to keep up with. Administrators had to study the spam, adjust or create new rules, deploy them, and then find that spammers had simply changed their methods 2 weeks later.

    Early Bayesian Classifiers Used Text

    A better approach was to look at the actual text of the message, to look at the actual verbiage in the message. Users give the system feedback on which emails are, and are NOT, spam and over time it learns and gets better and better at it. The core of these systems is Bayesian mathematics, a branch of statistics used in machine learning. If you’ve used SpamBayes on the PC, or some of the Mac email clients, you’ve probably used a Bayesian system. These early Bayesian spam systems looked at the text of an email as tokens, just like a search engine would. Emails with words like “free”, “Viagra” and “Nigerian” tended to be spam, and emails with your boss’s name and text relating to company meetings and budgets tended to not be spam. If the spammers get wise and change their wording, users provide immediate feedback, and the spam detection automatically adapts to the new language.

    So far, so good. Instead of creating a fixed set of rules, Bayesian filtering looks at the text of emails and quickly adapts. Spammers tried to misspell words, but the systems quickly adapted to even that practice, with millions of users training the systems to recognize the new obfuscated spellings.

    But spammers got wise to even that, and started avoiding text completely, or to include blobs of reasonable looking text. They would send emails that had images attached, and the images had advertisement. The images would be blurred or speckled in such a way that OCR could not decipher it. You may remember the fad of speckled investment pitches from a few years back.

    Old Spam Rules Re-Cast as Tokens!

    This is finally where our Synthetic Tokens enter the story. The old spam rules were partially revived. But instead of going back to fixed logic, these rules were used to inject additional tokens into the Bayesian learning algorithm. If an email had an attachment, an extra token was created to represent that. If the email had nobody on the To or CC list, and was sent via BCC, this generated another token. If the ratio of hyperlinks to non-hyperlink text passed a certain threshold, another token would be inserted.

    The point is that these rules were employed to generate new tokens, in addition to the tokens extracted from the text. These additional tokens could then be included in the AUTOMATIC rules the Bayesian system would come up with as it was trained. If spammers switched to having plain looking text, but embedded their spam in a PDF attachment, then a PDF_attachment token would be injected into the system. As soon as enough users flagged the new spam, the presence of the PDF_attachment token would automatically evolve into a statistically important rule. However, it wasn’t a fixed rule. Many employees would get work related emails with PDF attachments as well. Even if the system accidently flagged a few of them as being “spam”, users could correct the system and tell it that those messages were not spam. The system could automatically re-balance the PDF_attachment token against other tokens, such as particular email authors or names in the body of the email. The point is that, at no time did somebody have to create a rule about PDF attachments, and nobody else had to go back in and undo that rule when legitimate emails were falsely flagged. The fact that a PDF file was attached was simply available to the algorithm, along with all the other attributes and text of the emails, and the algorithm was free to adjust over time based on new inputs.

    Tokens Enable Algorithms

    The point here is not how cool Bayesian algorithms are, (though they certainly are cool!) but that synthetic tokens allowed for metadata and other non-textual attributes to be mixed in with regular tokens derived from the text. Your search engine, regardless of whatever fancy algorithm it uses, could probably benefit from a similar boost.

    Examples of Spam Detection Synthetic Tokens

    Here are some examples of email attributes that could be encoded as tokens, along with the text of the email message. No one single rule is important here, they are all available to lend statistical credence or suspicion to an email, right beside the actual text. As users identify spam, the system can reweight all factors automatically in near real time.

    • Sender
    • Domain of sender
    • Receiver
    • Number of receivers
    • Number of people on To, CC, BCC
    • Whether the receiver is actually listed in the To header
    • IP Address of connecting server
    • Mime types
    • Number of attachments
    • Attachment ratio of types
    • Ratio of text to images
    • Ratio of link text to plain text
    • Image dimensions and ratios
    • Ratio of text to JavaScript
    • Number or ratio of spelling mistakes
    • Ratio of punctuation to text
    • Ratio of word breaks to text
    • Combinations of foreground and background text colors (sometimes to hide spam, or show text that appears legitimate)
    • Obfuscated text with background images
    • Character encoding and encoding errors
    • Presence or ratio Non-primary language Unicode characters

    There are many more that truly smart people have thought of, to be sure.

    Web 2.0 Apps

    One of the hallmarks of Web 2.0 applications is the participation of eager users, to rate and tag content.  A web page may have good text that a search engine can digest, but popularity among users is an important additional piece of data.

    A search engine could use some set of rules based on popularity to re-order the list of documents, shoving popular documents up a bit higher in the list. But that type of logic would be an overlay, a separate mechanism, a different part of the engine.

    But if some gauge of popularity were turned into one or more tokens and added to the search engine’s main index then its main underlying engine could directly include popularity alongside all of the other fulltext rules. The boost provided by popularity wouldn’t need to be a separate system that comes along after the fact to simply shuffle documents around.

    Other types of tokens could be injected. One easy candidate would be the tags that users specifically assign to documents, such as on Flickr. These are already in the form of text, so very easy to add. In fact, when a user does a search and then clicks on a document, they are implicitly associating their search terms with that document, what we call implicit tagging; those words could also be added.

    However, we believe that the words provided by users, either explicitly by tagging or implicitly by click-throughs, should NOT be mixed in with the document’s original text to the point where they cannot be distinguished. So the word budget might appear in the text of a document and generate the token “budget”, whereas if a user tags it with budget a token of usertag:budget could be added. If a user searched for budget, and then clicked on a document, a token of clickthrough:budget might be added. Some fulltext systems allow for different sections or zones of a document, so that the token “budget” can be added to either the “body” zone, or “usertags”, or “clickthrough” zone. These are implementation details that are logically equivalent. The point is that tokens arising from specific user tagging should probably be ranked much higher than implicit tokens added by clickthroughs. This is similar to how engines can rank a term appearing in a title or bolded heading higher than if the term appears in the plain text of the document.

    We’re not advocating any particular relative set of rankings here, the point is simply that words arriving into the document from different places should probably be represented by different tokens, so that you can change your mind as needed!  Tokens = flexibility.

    Synthetic Tokens and Image Search

    Image search continues to be something that even the big guys haven’t “solved” yet. Most well known image search tools still look for text surrounding images, such as alt tags, textual metadata within the image header, the name of the image file itself, or text occurring on the web page that linked to the image. Web 2.0 applications add user submitted textual tags. Popularity of image can certainly be stored as a token as well. This is better than nothing of course, but these aren’t looking at the actual pixels within an image.

    In terms of looking at the actual pixels within an image, there’s been some progress. One simple approach we’ve seen is to just look at the majority of colors in each image and categorize them by hue and intensity. Facial recognition has been used in security applications for some time, with mixed success. License plate recognition is well understood. OCR (Optical Character Recognition) has been able to extract text from clean paper scans for years, and at least one company now allows photos to be OCR’d for text. On the specialized image processing side of things some consumer grade cameras can recognize human faces these days. Some even claim to distinguish between adult and children’s faces, and can tell when a person smiles. And Apple’s latest iPhoto version can recognize the faces of individual friends and family members in different photos and automatically associate them, a pretty impressive feat. But these are very specialized applications.

    Am I now going to claim that Synthetic Tokens bridge this gap and make all things possible? No - of course not! This is all about substantially lowered expectations – synthetic tokens would help make some small incremental progress in this area, an iterative approach with reasonably that does very simple tasks, but would be well matched with open source engines such as Lucene / Solr and Nutch. Let’s look a bit further.

    The basic idea is various attributes of images would be pre-calculated and stored as tokens in a fulltext search engine. These tokens could be searched directly, or used in clustering and other machine learning techniques as previously described. New images could have similar tokens calculated, and then used as a seed search.

    People involved in image processing and machine learning will immediately recognize this as “feature extraction” or “feature vectors”. This is a good starting point for those in the know, but our slight extension to this is to package these various feature vectors up in such a way that they can stored and managed as tokens in a fulltext search engine, right alongside all of the text based tokens. Already in Solr and Lucene quite a bit of pages talk about encoding other non-text data types as retrievable tokens, such as dates and floating point numbers. Unlike commercial vendors, since it is open source you’re free to look at the various implementations. This makes for interesting reading even if you ultimately use a commercial engine.

    The advantage of cohabitating image features in a search engine is that fulltext search software is getting quite a bit of attention and funding again, and includes quite a bit of the other machinery that a good image search and classification system would need.

    For those readers with a more traditional database and text retrieval background, we’ll now talk a bit more about feature extraction and feature vectors, and give some trivial examples of how they might be transcoded into tokens.

    As a first perhaps overly trivial example let’s talk about color. The shade of color in a digital photo can be thought of in terms of its hue, saturation and brightness, instead of the normal red, green and blue we’ve all heard so much about. The hue is of particular interest, it can be expressed as an angle on a virtual circle from 0 to 360 degrees. Zero degrees can be considered red, 120 degrees green, and 240 degrees blue. On this scale angles between about 30 and 90 would give you various shades of orange and yellow, and at 180 degrees you’d have cyan.

    Suppose as a first pass we tabulated all of the pixels in a photo selected the largest batch with approximately the same hue. We could actually assign that hue to the nearest named color on a standardized color wheel, such as “red” or “magenta”, and emit that as a token in the “color” meta data field, so we’d have color:orange as a token. Not particularly useful, but if you were looking for pictures with lots of “red” it would do the trick.

    For the predominant color in the photo, we could emit the color token more than once, which would give a search engine the ability to do rudimentary relevancy ranking. We could also include tokens for a few of the other colors as well. We’d scale the number of tokens based on a logarithmic scale based on the percentage of the photo that had the color. So a picture of Fall foliage in New England would automatically generate color:{orange orange orange orange orange yellow yellow brown green blue} Still a pretty simple example, but even this is slightly more useful than the previous paragraph’s example, and is starting to combine pixel values with search engine relevancy.

    The color tokens I’ve shown so far are easy to understand because they use English words. However it’s also possible to searchable tokens that, although not so human friendly, are perfectly acceptable to a search engine. Lucene and Solr do this with dates and times now – they take dates and encode them into normalized numerical representation. They’d be tough to read if you looked at them, but they are only used internally. I’ll use the term “encoded tokens” to denote tokens that are not really meant to be directly viewed or input by a human, but that can still be easily stored and searched.

    With encoded tokens in mind, we can take our color example a bit further. Some colors are very saturated, like bright red, whereas other colors are most pastel, such as pink. In digital photos we talk about “saturation”, and this is expressed as a percentage. Further, we could force all colors into “buckets” that would be the nearest 5 degrees on our color circle, and the nearest 5 percent on the saturation scale. We could then encode them, similar to how dates are encoded in Lucene, and even vary the number of times the token appears based on density, as we did before. So a photo with bright red and pale blue might generate tokens like encoded_color:{0_95 0_95 0_90 5_95 5_90 … 240_20 245_15 …} Sure, these are very cryptic to a human, we would not expect a person to every search for these specific tokens. But if they gave us a photo that we ran similar rules on, we’d get similarly formatted tokens, that could then be fed directly into the search engine. These is actually a service that does something very similar for public domain photos on Flickr – it’s marketed to publishers who are looking for stock photos to match certain color schemes. The point being that, as simple as this is, it’s still useful!

    Some other simple tokens we could generate could be based on the dimensions. Instead of just recording the raw dimensions as strings, say width:2048 height:1536 we could generate meta tokens like “resolution:hires”, “medres”, “lowres”, and ratio based like “portrait” and “landscape”, “square”, “tall”, “panoramic”, etc, or even numerically such as overall_ratio:4to3, overall_ratio:5by9, etc. Sizes could be binned into the nearest number of megapixels, etc. The point being that the dimensions would be stored in different ways.

    You’ll notice some examples have a prefix and a colon before the token’s value. This is actually quite common in high end search engines. It allows us to formulate tokens that have the same value but a different context. As a contrived example, the word “blue” could apply to an image’s color, or a word in the title, or even its mood (if humans had tagged it as such). We can handle this with color:blue, title:blue and mood:blue, respectively. This gets even more important with encoded tokens or numeric values; a document can have a creation date, a last-modified date, and a publication date. We want to be able clearly separate these. In various contexts this can be referred to as Metadata, fielded data, document attributes or zone searches. Modern engines often simply add a prefix to the token’s value to denote the context.

    Back to image processing and search. There are even more interesting techniques that can generate tokens. For example, a technique called a Hough Transform can give you numerical measures for how many specific horizontal and vertical lines appear in a photo. A different transform can give numerical scores for how many circles and arches there are. Other numerical techniques can easily give contrast ratios and measures of symmetry. The extension here is to encode those measures of vertical and horizontal lines, or circles and arches, into tokens that can coexist in a search engine index. FFT and Wavelet techniques can spot texture and gradients; these could also be encoded in various forms.

    Even more complex features can be extracted. As we said before, face recognition has entered the realm of digital cameras and desktop software. It’s relatively easy for a computer to detect two spots that are sized and spaced appropriately to possibly be a set of eyes. A simple extension can try to locate a third point that might represent a nose. These two classes of matches would also be encoded as various tokens. Of course such simple algorithms will be easily fooled by a polka dot shirt – you’d get a ton of false positives – a form of Electronic Pareidolia to be sure. But if you combined a search for these tokens with skin tones made possible by the color oriented tokens we talked about before, with the appropriate weighting, you’d likely start to find some faces. The good news is that UI studies show users are much more patient with false image matches than they are with false textual matches – probably because they can more quickly scan a page of thumbnails.

    I believe this is the way forward for open source image search efforts. It allows different contributors to clearly focus on technologies they are strongest in, and then easily combine their efforts into a larger system. Some teams will focus on image analysis and token generation. Others will focus search relevancy and clustering algorithms. Other groups can focus on the substantial UI challenges for web and mobile phones.

    The toolkits to do this type of image processing also exist in the open source arena. SciPy, a scientific extension to the Python language, has many advanced statistical features. And the excellent MatLab open source clone Octave has improved image and graphics support. MatLab code examples show up in many image processing papers, and now Octave’s open source license extends that reach. Of course mixing Python or Octave’s core Fortran libraries with Lucene/Solr’s Java implementation presents some technical challenges.

    Mahout is a rather hard core team of open source enthusiasts who are taking machine learning much further, beyond search. You can also find excellent machine learning lectures on YouTube; Stanford has uploaded all the excellent lectures from CS229 by Andrew Ng, and has also posted the class materials.

    These are just some examples, but the seeds for marrying search engines with image processing and machine learning via feature extraction and synthetic tokens are certainly available. If expectations were kept LOW enough, some real progress could be made.

    Duplicate Detection and Unsupervised Clustering

    Clustering similar documents is a subset of fulltext search and machine learning that vendors, entrepreneurs and open source enthusiasts keep talking about, over and over again. Though it excites us too, we almost never get asked for this by clients.

    So give your clustering search engine a million emails, and it’ll automatically sort them into some type of (hopefully) logical grouping. Pick any particular email, either by browsing or from the results of a search, and the engine can show you all other documents that are related.

    If you think about it, advanced duplicate detection can be viewed as a special form of clustering. If I cluster together documents with similar words, then certainly identical documents should wind up in extremely close proximity to each other. Using these powerful statistical techniques is superior to older primitive literal comparisons because it isn’t foiled by small differences in the documents, such as different headers or footers, and can even spot subsets of duplicate content, such as a section of plagiarized paragraphs.

    We won’t delve into the deep statistics involved in performing this magic, nor try to explain the apparent customer apathy, in this article. However, if document meta data and other generated features are stored as synthetic tokens, then they too can play a role in these clustering algorithms. You can’t cluster on things that don’t make it into the vector space!

    Actuarial Decision Making

    Insurance brokers need to make efficient decisions about who gets insurance coverage and who doesn’t, or what a fair rate should be. We understand this could be a sensitive area for some readers and may seem a bit “cold and calculating”. As a small fig leaf, it is nice to see safe drivers and healthy living people get low rates, even if they’ve made 1 or 2 claims over the years.

    • Length of current coverage
    • Percentage of adult life covered
    • Absolute and relative claim activity for recent and lifetime
    • Size of family and coverage group
    • Education, health club membership and credit score
    • Financial arrests and bankruptcies
    • And of course pre-existing conditions and hospitalizations

    Of course insurance companies already have sophisticated systems that take all these things into account and probably thousands of other factors we can’t imagine. And search engines are not as adept at linear regression and other traditional statistical techniques as a pure statistics based decision support system. So if all the data is known, using a search engine might be a bit of a stretch.

    But it turns out companies often don’t have all the information they need. Suppose an agent receives 1,000 pages of poorly scanned PDF files representing a patient’s medical records for the past seven years, and having a data entry clerk key in all the key values is not finically justified. Automation can be applied, to attempt to extract all of the variables, but it will likely need some supervision and will never achieve 100% accuracy.

    In an imperfect world search engines might provide an acceptable compromise between completeness and timeliness. They routinely handle a mix of structured and unstructured data. Search engine relevancy algorithms routinely handle unnormalized and sparse vector spaces. Search engines also often include entity extraction engines that can assist in this data extraction task, which could then feed a token synthesis process.

    Quality of Healthcare Assessment

    As we said, search engines and machine learning are being re-tooled to work in other non-traditional applications. We just talked about one application using medical records, actuarial insurance coverage decisions, but let’s consider a completely different medical application, possibly consuming the very same medical records, but with a very different goal in mind.

    Imagine a government agency and were tasked with assessing the quality of healthcare delivered to patients receiving benefits. You might want to search for specific types of issues, or find other cases similar to a selected case.

    Traditionally this would be done with a format statistic engine or perhaps an expert system. But these systems have rather heavy setup and staffing requirements; the types of people who can design and maintain them don’t grow on trees. And those systems also require high quality, carefully input data sets; you might be surprised at how often that type of data is NOT readily available. How could a search engine be leveraged?

    We’ll assume the normal Metadata would be turned into tokens, such as patient name, age, ethnic group, geographic region, and of course medical diagnostic codes! Some of this data could be pulled from a database, other items could be extracted from text using entity extraction.

    But additional tokens that *might* relate to “quality” could also be generated and included. We might not know in advance which ones will turn out to be the most critical, so we’d cast a broad net by introducing as many tokens as we can think of.

    Diagnosis codes and specialist codes make great tokens, assuming they can be normalized.

    • Tokens for heart attack, stroke, diabetes, specific cancers, etc.
    • Tokens for cardiologist, oncologist, neurologist, dietitian, etc.
    • Tokens for CAT scans, specific blood tests, EKG’s, EEG’s, etc.

    Notice with these tokens it’s OK to repeat them. If the patient has had 3 CAT scans in the past 5 years, it’s fine to have 3 tokens for it. Fulltext search engines automatically understand and balance different word densities all the time – by turning these codes into tokens they inherit that ability.

    Numeric quantities and RATIOS can be lumped into various buckets, and then each entry could generate one or more tokens:

    • Number of hospital admissions per year, for past year, for past 5 years, etc.
    • Emergency room admissions per year, past year, past 5, etc.
    • Average days per hospital stay
    • Total number of doctors seen
    • Number of doctors per year
    • Doctors seen per hospital admission

    Other types of numbers and ratios might help spot abuse or neglect:

    • Numbers of tests performed
    • Length of medical records per incident, or per day institutional care
    • Numbers of prescriptions issued
    • Number of pharmacies used by the patient
    • Accepted and rejected claims

    Of course government agencies, pharmaceutical companies, insurance companies and medical researchers have been using advanced statistics and expert systems for years. Re-implementing with a search engine could be used as an adjunct to an existing system, perhaps helping to manage textual and scanned medical records that are not accurately keyed into the main database. Search engines do bring powerful document processing filters and entity extraction to the table. And whereas a statistic system requires a highly trained statistician to ask numerical questions, the hybrid system we’ve described is also accessible to non-statisticians using the now widely familiar search engine paradigm. If certain new trends are uncovered by searchers, those could perhaps then be migrated into the formal statistical / decision rule set.

    A Few Other Ideas for using Synthetic Tokens for Business Problems

    Here are some other examples where synthetic tokens can be folded in with traditional textual search to create more flexible and extensible systems. In all cases it’s assumed normal text fields will also be included. Rather than repeat ourselves, we’ll focus on what type of synthetic tokens could be generated.

    The trick is replicating the metadata in various forms and storing the encoded tokens in the fulltext index. Even seemingly simple items like dates and locations can be represented in multiple ways. Then, instead of just using traditional fielded filters, more advanced algorithms can be applied. Advanced online services are likely already doing this in some form or another.

    Business Listings

    Some pretty obvious ones as a first example.

    • Standard business categories and synonyms
    • Hierarchal product and service categories
    • Popularity
    • Recentness
    • Multiple geographic characterizations

    At some point engines that don’t gracefully combine these different types of data will be gone.

    Investigations / AML (Anti-Money Laundering)

    Specialized search engines are already being used for financial and criminal investigations.

    • Membership in associations and groups
    • Association with other individuals, possibly to second or third level (stored as different token types of course, so for example they could be weighted differently)
    • Multiple geographic characterizations
    • Amounts of transactions and routes of payments
    • Forms of payment and denominations
    • Number of arrests, pleads, convictions, highest bail posted
    • Civil and fiduciary actions, settlements, etc.

    eDiscovery / Legal War Room

    In previous articles we’ve talked about the problem companies face when they are involved in a lawsuit. Suddenly millions of documents are subpoenaed and produced, then the “needle in the haystack” search for incriminating or defensive evidence begins. And as you can imagine specialized search applications are already in use.

    Ideas for synthetic tokens might include:

    • Advanced, multiple timeline tokens, corresponding to calendar days, months, quarters and years, project timeline, tokens representing time relative to particular events named in the lawsuit, etc.
    • Number of persons who are believed to have received or viewed the document, or knew of its existence. Email tends to have more of this data built in, but it could possibly be gather elsewhere or entered manually.
    • Highest level of management with believed knowledge.
    • People and companies referenced. Again, an extension to general Meta Data and Entity Extraction.
    • Amounts of money mentioned.
    • Type of document (we mean “contract” vs. “email” vs. “press release”, and not .doc vs. PDF)
    • Reviewed by paralegals, partners, etc. A standard eDiscovery field, just tweaked to be available for relevancy and other statistical calculations.
    • Etc.

    Intellectual Property, Research, Judgments and Citations

    Entity Extraction is being used in legal, patent, financial and investigatory systems. As an example, a legal database can recognize the pattern of a citation in one legal decision referencing another previous, precedent setting decision. Patent applications can reference other patents, medical texts cite other texts, and financial disclosures have their own patterns of cross referencing. Customized search engine applications leverage the effortless handling of various text formats and agility at parsing to find these references easily. These systems make knowledge workers in the respective fields more efficient.

    A higher order benefit can be built up by backtracking along all these cross references. Instead of looking at a legal decision to see the precedents it cites, the earlier precedent itself can be considered. A core legal decision can be the basis for many other decisions. Normally that earlier decision wouldn’t be able to predict the future, it would not have links to future decisions that it would be cited in; but having a thorough bidirectional citation database makes this possible. Which patents have been cited the most by other patents? Which research publications have been pivotal in advancing other research? And of course this can be extended by cover multi-hop chains of citations, with some weighting added in. This reminds me a bit of the modern “Bacon number” phenomenon, but that too would be another whole article. J

    These systems could be built with traditional databases, but again fulltext technology already has many of the core tools. In reality you’d want a hybrid system. The fulltext Entity Extraction is great for finding links, but you’d want a database somewhere to manage them all efficiently.

    Revisiting the main thrust of this article, what tokens can this citation data produce?

    • Direct normalization and encoding of the citation itself; you’d want to represent both the source and the target of the citation, each in the others’ token stream.
    • The links might include some qualifiers, such as “direct proof” vs. “overlapping subject area”, etc.
    • Also the timeliness of the links might be good to know. Was it days, months or years between the original publication and appearance of this citation of it?
    • I would also consider, for efficiency sake, encoding second and third generation links in the documents, in a different field that can be ranked at a lower weight. A purest would say that this denormalizes that data, but this is a common practice in fulltext systems.
    • Numbers of outbound and inbound citation links

    Aspects of findings or conclusions (“success/inconclusive”, in favor of the defendant or plaintiff, prior art judgments, etc.)

    And of course you’d have access to encode the standard Meta Data:

    • Dates in general, possibly encoded to several resolutions.
    • Authors
    • Publications
    • Geographic Region
    • Jurisdiction, university / research group, or patent office

    Tech Support, Help Desk, Customer Service

    We saved one of our favorites for last. Tech Support and Customer Service is near and dear to many of us here at NIE. Of course a well trained and professional staff is paramount. But efficient information systems can also make a big difference, and fulltext search ought to play a central role. The first really complex search application I every worked on was “TSDB”; the “Tech Support Database” pulled in and cross-linked records from call tracking and bug tracking systems, documentation, FAQs, faxes, newsletter articles and a number of other sources. Not very impressive these days, but this was more than 15 years ago.

    Back then the obvious benefit was to customers; Tech Support folks could respond quicker, and happy customers presumably buying more (or paying for the software faster!). And it also made the valuable employees a bit happier, being able to find things more quickly. And then of course there’s the presumed ROI (Return on Investment), being able to handle more customers without linearly scaling up your staff.

    But now there’s a potential for higher level benefits at a more strategic level. Properly designed incident and defect tracking systems employing advanced fulltext engines can help spot important business trends that might not be obvious to the front line reps. Unlike traditional database centric BI (Business Intelligence) and Data Mining applications, fulltext based systems have convenient access to all the verbiage being used. On the public Internet we talk about Web 2.0 and how all those dynamic blogs can track the pulse of what the public is thinking about. But in the Enterprise, you have a lot more content being constantly updated in tech support call logs and bug reports, not to mention all the emails going back and forth. In the Enterprise you’ve always had LWP (Lightweight Publishing), it just hasn’t got all the hype that public blogs and wikis do!

    You want to know what’s really going on? Sure, sales figures and Marketing surveys are important, but if the terms “crash” and “bluescreen” start popping up in your Tech Support calls and search activity logs, right after a new release, you’ve likely got a problem! Buts and calls can be augmented with lots of extra Meta Data, which can also represented as tokens. A systems architect might be skeptical at first, because it’s hard to imagine somebody explicitly filtering or searching on an item, or even clicking on it in a set of facets. But the point is that advanced algorithms might find statistical significance from a bunch of variables that nobody really thought. It’s hard to know ahead of time, so in this case “more” might be better.

    Examples of tokenizable Meta Data:

    • Number of calls
    • Number of repeat calls
    • Number of escalations
    • Most recent version deployed relative to currently shipping version
    • Other Account characteristics, length of relationship, number of installs, attended training, etc.
    • Total revenue from each customer, and recent revenue
    • Interaction with specific employees
    • And of course we’ve already mentioned the text in call logs, emails, bug reports and those ever important search activity logs. So you get those regular tokens for free!
    • Fulltext add-ons can also spot the Sentiment of past and recent customer interactions.

    Ironically your database administrator might have lots more ideas about data that could be exported to your search engine’s index. In larger organizations fulltext BI would be an adjunct to traditional Data Mining, not a replacement.

    And of course once you make these amazing discoveries, what do you actually DO about it? Well… that’s fodder for another day!

    In Summary

    Despite the flashy name, Synthetic Tokens are not “magic”. They’re simply another way of storing Meta Data, deeper inside the fulltext index.

    As fulltext engines break out of their old “type a query in, get a results list back” usage model, they are being used in very nontraditional ways. A search engine can be reworked into a BI or a clustering engine or endowed with advanced machine learning capabilities.

    In these high end systems, the algorithms like consider all available data, not just text. So getting your other metadata and entities into the fulltext index as tokens makes it available to these new techniques.

    What happens after that point depends a lot on the business problems you’re trying to solve, and the particular uber-nerds you happen to hire. “Search” is just the beginning.