Ask Doctor Search: Converting K2 BIF to IDOL IDX Files
Last Updated Mar 2009
Volume 5 Number 1 - January 2008
A customer writes:
This month a subscriber asks: We are in the process of moving from K2 to IDOL, but we would like to move some of our K2 collections directly to IDOL to understand how the native IDOL relevance may impact our existing ranking algorithms. Is there a way to do this quickly, even before we are actually ready to perform a full index run within IDOL?
Dr Search replies:
Dr. Search replies: Since the K2 and IDOL collections are not binary compatible, you cannot simply move a collection between K2 and IDOL. However, there is a way to use text-based interchange formats to create an IDOL collection directly.
In fact, like many search technologies, both K2 and IDOL support text interchange formats. In K2, the interchange format is via 'bulk insert files' or 'BIFs'. In IDOL, the corresponding file is called an "IDX file". Like the collections themselves, these files are not 100% compatible - some assembly is necessary. But it is far easier to create a process to convert BIF to IDX than to attempt to convert binary files that are not fully documented.
The bulk insert file format has been part of Verity since its earliest days, even pre-dating K2 itself. Figure 1 shows the format of a very simple bulk insert file.
title: bennettFigure 1 : Simple BIF File
vdkvgwkey: /docs/M_Bennett_Projects.doc
<<EOD>>
title: Projects_Fall_2007
vdkvgwkey: /docs/C_Allen_proj.doc
<<EOD>>
title: kehoe - booth layout
vdkvgwkey: /docs/Show_Booth_06.xls
<<EOD>>
The general format is a field name followed by the value for the field, separated by a colon character. There should be an entry for each field you want to populate, as well as the special key field VdkVgwKey. It is this special field which uniquely defines the document, whether it is a full filename or URL of the document to index, or even a database key value. What this key contains is a function of the gateway in use. For our purposes here, file or web based documents are much easier than database or customer gateway applications.
Note: It is possible to use the DOC_FN field in some cases instead of the VdkVgwKey, but even then the K2 indexing process will define a VdkVgwKey.
If you are including more than one document in a bulk insert file, specify the record separator constant "<<EOD>>" after the last field definition for each document.
Some other things to remember about BIFs
-
Fields defined in the bulk file take precedence over fields defined in the actual document. The first and third document records each define a value for TITLE; so even if the Office files have a title defined in the properties, the title in the BIF file will be the one in the K2 collection. On the other hand, if a field has no value in the bulk file, the standard filters will apply, and if there is a field defined in the document properties, that value will be indexed into the collection.
-
Even when you use Windows, you can use the forward slash "/" character as part of a path. However, if you use the standard Windows backslash separator, you must escape any path separator with an extra backslash, ie, docs\\Show_Booth_06.xls. Remember, BIFs normally use the VdkVgwKey to define the actual document.
-
A BIF is submitted to the standard K2 indexing tools – normally mkvdk or via the K2 index API. You can choose to provide the full document text within a BIF, but normally bulk files are used to define metadata field contents and the K2 indexer processes the full document. Thus, with BIFs, you do not need to create the filtered stream of text to include in the bulk file.
-
Finally, any metadata fields defined in the style files will be populated normally. Note: Defining a field in the BIF will override index-time field extraction.
Extracting data from a Collection
K2 has a command line tool called extract that can dump the contents of a collection into a full bulk insert file; you can choose to include the text of the actual document if you want. Refer to the Enterprise Search newsletter article Using the EXTRACT Utility in April of 2005 which describes this handy tool in detail.
IDOL IDX Files
Like K2, IDOL supports an interchange format which Autonomy calls an IDX file. Like BIFs, you can specify field name/value pairs, although the syntax is slightly different. Unlike BIFs in K2, an IDX file must contain all of the content you want to index for a given document, including both metadata and full-body text. Figure 2 shows a simple IDX file.
#DREDBNAME Collection NameFigure 2: Simple IDX File
#DREFILENAME C:\Autonomy\OracleFetch\Example\49\Example1.htm
#DREFIELD DRETITLE="Sample Document"
#DRECONTENT
This is the full document
#DREENDDOC
You can see from Figure 2 that IDX files start each line with a # character and use an = to split the name/value pair. And IDX files use the #DREENDDOC tag rather than the <<EOD>> tag.
In the K2 world, each document is fully defined within a single "record" for a particular filename or URL, terminated by the <<EOD>>. However, IDOL allows multiple records terminated by the #DREENDDOC tag to refer to the same physical document. If you want to use more than a single record to describe a single document, you needs to add a #DRESECTION tag and define a unique section number for each part of the document.
This comes in handy when you learn that unlike BIFs, an IDX file has to contain the full text of the document to be added in the #DRECONTENT field. Each record ("section" in IDOL parlance) permits no more than 500 words, so you need to split the document into multiple sections; Figure 3 shows the first two sections that you might use to index this article.
#DREREFERENCE=E:\entsrch\2008\numbner_01\drsearch.htmlFigure 3: A Multi-Section Document
#DREFIELD authortitle="Dr "
#DREFIELD authorname1="Search"
#DREDATE 2008/01/06
#DRETITLE "Ask Dr Search"
#DRETYPE text
#DRESTORECONTENT yes
#DRESECTION 0
#DRECONTENT
This month a subscriber asks: We are in the process of moving from
K2 to IDOL, but we would like to move some of our K2 collections
directly to IDOL to understand how the native IDOL relevance may
impact our existing ranking algorithms. Is there a way to do this
quickly, even before we are actually ready to perform a full index
run within IDOL?
#DREENDDOC
#DREREFERENCE=E:\entsrch\2008\numbner_01\drsearch.html
#DREFIELD authortitle="Dr "
#DREFIELD authorname1="Search"
#DREDATE 2008/01/06
#DRETITLE "Ask Dr Search"
#DRETYPE text
#DRESTORECONTENT yes
#DRESECTION 1
#DRECONTENT
Dr. Search replies: Since the K2 and IDOL collections are not
binary compatible, you cannot simply move a collection between K2 and
IDOL. However, there is a way to use text-based interchange formats
to create an IDOL collection directly.
#DREENDDOC
Once you have converted your BIF into an IDX file, you can submit it to the IDOL server via an HTTP statement that directs the IDOL server to load and index the contents of the IDX file. The format of the command, assuming the IDX file is named INDEX.IDX and resides on the same system as the IDOL server, might look like this:
http://bean.ideaeng.com:9001/DREADD?d:\docs\index.idx
This should index your content which can then be searched.