Hooking into the Document Pipeline the FAST way

« NIE Newsletter

Hooking into the Document Pipeline the FAST way

Introduction to the FAST Document Pipeline

We've said for years that the secret to great search relevance is great indexing. The problem is, with many enterprise search engines, it takes a good deal of effort to intercept and improve the document indexing process. Sometimes you want to add extra metadata from an external source, or you want to clean up the field values that will be populated automatically during indexing. Some technologies, like Hummingbird's Fulcrum Technologies, allowed developers to write C code to sit in the indexing process; and Ultraseek has long had the patches.py code to add custom code to pre-process each document in the indexing process. But generally it's been a challenge.

FAST has gone one step further in creating the "Document Pipeline" architecture, a series of Python applications that can be inserted virtually anywhere along the Document Pipeline to manipulate data, to add look-aside content, or even just to keep an eye on the indexing process, The bad news is that there are not many Python experts in the corporate world, so it's left to FAST and its partners - including long time Python coders New Idea Engineering - to create various stages in the Document Pipeline.

The good news is that FAST has included a Document Pipeline stage that will let you run just about any program. in any language in the pipeline. With the dubious name ExternalDataFilterTimeout, this little-known Document Pipeline stage lets you write just about any executable language - C, C++, Java, and even shell scripts and batch files. It provides the FAST Document Pipeline with "external system call" capabilyt present in many programming languages.

An Overview of the FAST Indexing Architecture

The FAST overall architecture, shown in Figure 1, gives you an idea of how FAST approaches indexing and searching and where the Document Pipeline fits into the overall architecture. The Document Pipeline in FAST is roughly analogous to the gateways and filters of the Autonomy/Verity K2 engine. or patches.py in Ultraseek.

Figure 1 - How Content Becomes Searchable in FAST

The figure presents much more than simply the pipeline; but we'll have to get back to other elements in the technology in future issues.

Customizing the Pipeline

Now that you see where the Document Pipeline fits in the overall indexing architecture, we can get back to the issue at hand: creating a pipeline stage that doesn't need to be written in Python by using the standard ExternalDataFilterTimeout stage template. The FAST description for ExternalDataFilterTimeout is:

Process an attribute with an external program, subject to a timeout.

The external program is run for each document, using the attribute named in the Input configuration parameter as input, and placing the program output in the attribute named in the Output configuration parameter.

The external program must terminate with a successfull exit code (0), within the number of seconds configured in the Timeout configuration parameter. If the configured timeout is exceeded, the program will be forcibly terminated (TerminateProcess()/SIGKILL) and the pipeline will be aborted.

Input and output to the external program will be provided through temporary files on disk. The formatting codes %(input)s and %(output)s must be present in the Command configuration parameter, and will be substituted with the names of the input and output temporary files.

If the external program requires a temporary working directory, one will be created and passed to the program if the %(tmpdir)s formatting code is present in the Command configuration parameter.

Standard Unix shell input and output redirection characters may be used in the Command configuration parameter, for example if the external program takes input from stdin and sends output to stdout. Errors written to stderr will be picked up and logged if the process exits with an unsuccessfull (non-zero) exit code.

That pretty much says it all, doesn't it?

Making It Work

You may find the above description, extracted verbatim from the FAST console, all the information you need to make it work. But if, like us, you found the text a bit confusing, perhaps an example will make it clear.

In this example, we're going to perform a simple task: using a standard Window batch file, we want to insert the value of an environment variable into one of the unused 'generic' fields defined in the default index profile. We can pick any environment variable - PATH, USERNAME - but for this simple example, let's use the name of the server we are indexing on: COMPUTERNAME.

Getting started

To do the job, we will follow these steps:

Create the batch file
Create a document pipeline stage
Create a document pipeline using the new stage
Create a collection that uses our new pipeline and submit a single document
Verify that our stage works by using the FAST Query Tool

It may sound complicated, but it's not rocket science.

1. Create the batch file

Our first step is to create a simple batch file that will echo the value of the environment variable. This is a trivial batch program, listed in Figure 2. Call it SYSNAME.BAT and save it in the BIN directory within the FAST home directory - check the environment variable FASTSEARCH if you're not sure.

@echo off
	echo %COMPUTERNAME%

Figure 2 - SYSNAME.BAT

You could leave the first line off - the real meat is the second line. But as you will see, if you omit the @ECHO OFF line, you will populate the field with the Windows CMD prompt as well as the system name.

In the CMD program, test your new batch file to verify it displays something like the test shown in Figure 3.

C:\datasearch\bin>cwSYSNAME
	FIREBOLT
	C:\datasearch\bin>

Figure 3 - Command Line Output from SYSNAME.BAT

Now we're ready to move on.

2. Create a document pipeline stage

Start the FAST Administrative Console and click on Document Pipeline. If you see a link that reads "Advanced Mode" to the right of the Overview label, click it so you can see all of the stages in the document pipeline we will create.

Scroll down the list of pipelines and stages until you see one named ExternalDataFilterTimeout in the section labeled Default Stages. Click the plus-sign (+) to the right of the row, and you will see a screen like the one in Figure 3. If you are using InStream of the newest version FAST you may see a slightly different screen.

Figure 4 - Create Pipeline Stage

Fill out the form as follows:

Name: myExternalStage
input: size
Command: c:\datasearch\bin\sysname.bat <%(input)s >%(putput)s
Output: generic1

The name is easy - and it makes sense that we want to put the output from the script into the generic1 field. We've found that you need to provide a non-null field name to make the stage work; and in almost any data source you should find that size is defined and populated. By the way, we've found that URL, COLLNAME and even CONTENTID are not available at this stage of the pipeline.

The syntax for the command looks odd, but you need to match it exactly. What happens when you execute this stage is that FAST puts the input field value in a temporary file and will expect the output field to be stored in a different temporary file. FAST passes you the names of the input and output files, and the syntax the FAST Python code is looking for matches that shown above.

When you have created your new stage, click the Submit button at the lower right of the form, then click OK to return to the Document Pipeline screen.

3. Create a document pipeline using the new stage

Now that you have a stage defined, we need to create a document pipeline to experiment with.

At the top of the Document Pipeline screen you'll see the Generic pipeline. Click the plus-sign (+) at the right and you'll see a screen like that shown in Figure 4. Depending whether you are using FDS, InStream, or ESP you may see a slightly different screen, likely with a pull down list of stages.

Figure 5 - Create Document Pipeline

Name your new document pipeline - perhaps myExternalDP. Select the stage you created in step 2 above, and click the right arrow to move it into the list of stages in the current pipeline. Use the up and down arrows on the right to move your stage between the FastHTMLParser and the TeaserGenerator. (Hint: If you do not see a long list of stages, you are not in 'Advanced' mode. Click Cancel, then at the top of the Document Pipeline screen click the Advanced Mode link and start at the beginning of this step.)

When you have named your document pipeline and inserted your customer stage in the right place click 'Submit' and then 'OK'. Time for the nest step.

4. Create a collection that uses our new pipeline and submit a single document

Now we've created a custom stage, and added that stage to a new document pipeline. It's time to create a collection using our new document pipeline.

Click Create Collection, then enter a name and brief description for your new collection. Click 'Next'.

You may need to select a Cluster if you have more than one, but generally you will see a prompt saying that webcluster is your only option, and click 'Next'.

At the Pipeline Configuration screen, pull down the list to see the list of available document pipelines; yours should be visible. Select it, and click 'add selected', Click 'Next'. If your document pipeline is not in the pull-down list, go back to step 3 to make sure you created the pipeline properly. When you view the Document Pipeline screen you should see your pipeline in the list.

Now you have the option to select Data Sources - do not! Rather than select one, click 'OK' and return to the collection Details screen. Click the 'Add Document' link, and enter a URL for a web page you have access too - your company home page, or a public site you know. Click 'next' and FAST will spider your URL and attempt to parse it. Hopefully, if you did everything right, you will see a status indicating the document was submitted. Return to the Collection Menu and wait until it shows the collection is searchable; you may need to refresh the screen every now and then. Once you see that collection is searchable, go on to the final step.

5. Verify that our stage works by using the FAST Query Tool

Once you see the collection in searchable, click on the Search View link at the top of the screen. Select the collection, press the Advanced bar and select 'All Fields'. Press 'Submit Query' and you should see a page like that shown in Figure 5.

Figure 6 - Create Pipeline

Scroll down if necessary and confirm that the system name is displayed in the generic1 field - success!

Summary

Here you've seen how you can use something as simple as a Windows Batch file to customize your FAST document pipeline. You can accomplish the same thing in just about any command-line-enabled language: C, Perl, or WSH, or any of the popular shells on Unix or Linux. When you want to use Python, you don't need to execute an external program; you can integrate that right into the pipeline natively. We'll make a point of showing you how in an upcoming issue.