How do I convert K2 BIF to IDOL IDX files?

« NIE Newsletter

Ask Doctor Search: Converting K2 BIF to IDOL IDX Files

A customer writes:

This month a subscriber asks: We are in the process of moving from K2 to IDOL, but we would like to move some of our K2 collections directly to IDOL to understand how the native IDOL relevance may impact our existing ranking algorithms. Is there a way to do this quickly, even before we are actually ready to perform a full index run within IDOL?

Dr Search replies:

Dr. Search replies: Since the K2 and IDOL collections are not binary compatible, you cannot simply move a collection between K2 and IDOL. However, there is a way to use text-based interchange formats to create an IDOL collection directly.

In fact, like many search technologies, both K2 and IDOL support text interchange formats. In K2, the interchange format is via 'bulk insert files' or 'BIFs'. In IDOL, the corresponding file is called an "IDX file". Like the collections themselves, these files are not 100% compatible - some assembly is necessary. But it is far easier to create a process to convert BIF to IDX than to attempt to convert binary files that are not fully documented.

The bulk insert file format has been part of Verity since its earliest days, even pre-dating K2 itself. Figure 1 shows the format of a very simple bulk insert file.

title:      bennett
  vdkvgwkey:  /docs/M_Bennett_Projects.doc
  <<EOD>>
  title:      Projects_Fall_2007
  vdkvgwkey:  /docs/C_Allen_proj.doc
  <<EOD>>
  title:      kehoe - booth layout
  vdkvgwkey:  /docs/Show_Booth_06.xls
  <<EOD>>

Figure 1 : Simple BIF File

The general format is a field name followed by the value for the field, separated by a colon character. There should be an entry for each field you want to populate, as well as the special key field VdkVgwKey. It is this special field which uniquely defines the document, whether it is a full filename or URL of the document to index, or even a database key value. What this key contains is a function of the gateway in use. For our purposes here, file or web based documents are much easier than database or customer gateway applications.

Note: It is possible to use the DOC_FN field in some cases instead of the VdkVgwKey, but even then the K2 indexing process will define a VdkVgwKey.

If you are including more than one document in a bulk insert file, specify the record separator constant "<<EOD>>" after the last field definition for each document.

Some other things to remember about BIFs

Fields defined in the bulk file take precedence over fields defined in the actual document. The first and third document records each define a value for TITLE; so even if the Office files have a title defined in the properties, the title in the BIF file will be the one in the K2 collection. On the other hand, if a field has no value in the bulk file, the standard filters will apply, and if there is a field defined in the document properties, that value will be indexed into the collection.
Even when you use Windows, you can use the forward slash "/" character as part of a path. However, if you use the standard Windows backslash separator, you must escape any path separator with an extra backslash, ie, docs\\Show_Booth_06.xls. Remember, BIFs normally use the VdkVgwKey to define the actual document.
A BIF is submitted to the standard K2 indexing tools – normally mkvdk or via the K2 index API. You can choose to provide the full document text within a BIF, but normally bulk files are used to define metadata field contents and the K2 indexer processes the full document. Thus, with BIFs, you do not need to create the filtered stream of text to include in the bulk file.
Finally, any metadata fields defined in the style files will be populated normally. Note: Defining a field in the BIF will override index-time field extraction.

Extracting data from a Collection

K2 has a command line tool called extract that can dump the contents of a collection into a full bulk insert file; you can choose to include the text of the actual document if you want. Refer to the Enterprise Search newsletter article Using the EXTRACT Utility in April of 2005 which describes this handy tool in detail.

IDOL IDX Files

Like K2, IDOL supports an interchange format which Autonomy calls an IDX file. Like BIFs, you can specify field name/value pairs, although the syntax is slightly different. Unlike BIFs in K2, an IDX file must contain all of the content you want to index for a given document, including both metadata and full-body text. Figure 2 shows a simple IDX file.

#DREDBNAME Collection Name
  #DREFILENAME C:\Autonomy\OracleFetch\Example\49\Example1.htm
  #DREFIELD DRETITLE="Sample Document"
  #DRECONTENT
  This is the full document
  #DREENDDOC

Figure 2: Simple IDX File

You can see from Figure 2 that IDX files start each line with a # character and use an = to split the name/value pair. And IDX files use the #DREENDDOC tag rather than the <<EOD>> tag.

In the K2 world, each document is fully defined within a single "record" for a particular filename or URL, terminated by the <<EOD>>. However, IDOL allows multiple records terminated by the #DREENDDOC tag to refer to the same physical document. If you want to use more than a single record to describe a single document, you needs to add a #DRESECTION tag and define a unique section number for each part of the document.

This comes in handy when you learn that unlike BIFs, an IDX file has to contain the full text of the document to be added in the #DRECONTENT field. Each record ("section" in IDOL parlance) permits no more than 500 words, so you need to split the document into multiple sections; Figure 3 shows the first two sections that you might use to index this article.

#DREREFERENCE=E:\entsrch\2008\numbner_01\drsearch.html
  #DREFIELD authortitle="Dr "
  #DREFIELD authorname1="Search"
  #DREDATE 2008/01/06
  #DRETITLE "Ask Dr Search"
  #DRETYPE text
  #DRESTORECONTENT yes
  #DRESECTION 0
  #DRECONTENT
  This month a subscriber asks: We are in the process of moving from
  K2 to IDOL, but we would like to move some of our K2 collections
  directly to IDOL to understand how the native IDOL relevance may
  impact our existing ranking algorithms. Is there a way to do this
  quickly, even before we are actually ready to perform a full index
  run within IDOL?
  #DREENDDOC
  #DREREFERENCE=E:\entsrch\2008\numbner_01\drsearch.html
  #DREFIELD authortitle="Dr "
  #DREFIELD authorname1="Search"
  #DREDATE 2008/01/06
  #DRETITLE "Ask Dr Search"
  #DRETYPE text
  #DRESTORECONTENT yes
  #DRESECTION 1
  #DRECONTENT
  Dr. Search replies: Since the K2 and IDOL collections are not
  binary compatible, you cannot simply move a collection between K2 and
  IDOL. However, there is a way to use text-based interchange formats
  to create an IDOL collection directly. 
  #DREENDDOC

Figure 3: A Multi-Section Document

Once you have converted your BIF into an IDX file, you can submit it to the IDOL server via an HTTP statement that directs the IDOL server to load and index the contents of the IDX file. The format of the command, assuming the IDX file is named INDEX.IDX and resides on the same system as the IDOL server, might look like this:

http://bean.ideaeng.com:9001/DREADD?d:\docs\index.idx

This should index your content which can then be searched.