Raritan Technologies eRoom Collection Builder
Last Updated Feb 2009
By: Mark Bennett & Ted Sullivan, Raritan Technologies, Volume 4 Number 3 - June 2007
(Editor's Note: We have been seeing Web 2.0 technolgies like wikis, blogs, and other lightweight publishing and collaboration tools beginning to take hold in corportions, and Enterprise Search 2.0 is a critical element in the success of these great new tools. Our partner, Raritan Technolgies in New Jersey has developed a tool to extract Documentum eRoom workspace files and metadata for search indexing. we thought you may want to learn about what they've done - mbk)
Overview:
The Raritan Technologies eRoom Collection Builder is a configurable process that can extract files and metadata objects stored in Documentum (EMC) eRooms and submits them to various search engine indexing processors. The eRoom build process can be configured as a semi-automated process that periodically queries the eRoom servers to obtain its set of eRooms, user access lists and indexable content.
This document describes the architecture, security model, components, configuration, deployment and execution procedures of the eRoom collection builder.
Architecture:
The following diagram shows the basic architecture of the eRoom collection builder:
Figure 1: eRoom Collection Builder Architecture
The main component in this architecture is the ERoomGateway. This component is responsible for quering the eRoom server and extracting file and metadata objects for processing by the indexing GatewayOutputProcessor component. The GatewayOutputProcessor is a configurable module that maps the eRoom files and metadata fields extracted and transmitted by the eRoomGateway to the API of the search engine's index server. Implementations of this module are available for several of the major search engines (Autonomy, Fast, Verity) as well as to other repository types (such as relational databases and file systems). Finally, the ERoomBuildProcess is a schedulable process that queries the eRoom server to determine the set of available eRooms, maintains a persistent record of eRoom indexing operations and coordinates the operations of the ERoomGateway and the GatewayOutputProcessor.
Distributed Build Architecture:
The ERoomBuildProcess can be run as a single stand-alone Java process. It can also be packaged as a Web Application that can be deployed in a J2EE web application server such as BEA Weblogic or Apache Tomcat. This enables the indexing process to be distributed to multiple machines in a coordinated fashion. The distributed architecture requires that the master eRoom build process be defined and run on a central or master WebApplication server. As it runs on a configurable schedule, the eRoom build process will create individual eRoom crawler jobs that can then be dispatched to other machines using a WebServices connection protocol. This enables large builds to be distributed to multiple machines. The following diagram illustrates the basic architecture. The EroomBuildProcess runs in a master WebApplication Server. It is responsible for querying the ERoom servers to determine the set of eRooms that need to be indexed and keeps track of this information in a Build Log Database. The EroomBuild process then creates a set of Jobs that are distributed through a WebServices (SOAP) protocol to other machines in which a RemoteDispatcher process is running. The remote dispatcher downloads jobs from the master EroomBuild Process which contains information on the ERoom Server, the specific ERoom to index and the location of the output Index Server (e.g, Autonomy, Fast, Verity …) and the collection to index the eRoom data to. The remote process then executes the eRoom crawl to extract content from the eRoom and depending on the type of build (create or update), sends the extracted data for indexing to the Index Server.
Figure 2: eRoom Distributed Build Architecture
Security Model:
Access to eRooms is obtained by using a valid member user id and password assigned by an eRoom administrator. The eRoom security model enables administrators to assign security privileges to users to eRooms or to specific items within an eRoom. The security levels are edit access and open (read) access. Secured eRooms and eRoom items contain member lists that specify the set of user ids that have access to the eRoom or to the specific item. The security model is thus highly granular and simple – each user belongs to a single access control list that corresponds to the user's user id. Items that do not have open access member lists are public or universal access items – that is can be opened by any user that has access rights to the containing eRoom.
The Raritan eRoom build process extracts the open access member lists for secure items and adds this as a metadata field for indexing. A separate filtering operation can be configured to collect all of the user ids that have member privileges within the eRoom for submission to security access modules such as Fast ESP. Alternatively, the Access Control field can be used with custom query cooking front-ends to provide item level security in search results.
Component Configuration:
The eRoom build process is configured using a single XML configuration file. The following shows the basic structure of this configuration for a single ERoomGateway / GatewayOutputProcessor processing pipeline that can index a set of eRooms in a single pass:
<ERoomConfiguration>
<CollectionGateway
class="com.raritantechnologies.federated.eRoom.ERoomGateway">
<!-- See configuration details for ERoomGateway below -->
</CollectionGateway>
<!-- Indexing output processor: Fast, Lucene etc. -->
<!-- Configuration details are type-dependent (see below) -->
<GatewayOutputProcessor>
<!-- Implementation dependent configuration parameters -->
</GatewayOutputProcessor>
lt;/ERoomConfiguration>
ERoomBuildProcess:
The ERoomBuildProcess configuration provides a more automated eRoom build process that can monitor the eRoom server to automatically detect new eRooms, update collections associated with previously indexed eRooms and delete records associated with eRooms that have been removed from the eRoom server. The basic structure of the configuration XML is shown below. The ERoomBuildProcess is an implementation of the Raritan Framework IJobProcess interface. Combining this module with an RTI Framework IJobScheduler module creates a schedulable Job. Details on the module types and configuration details for IJobScheduler modules can be found in the online Raritan Framework Documentation found at:
http://demos2.raritantechnologies.com/FrameworkDocumentation/index.html.
<ERoomBuildConfiguration>
<Scheduler pollingInterval="10000"
queueLimit="10" numDispatchers="1" >
<Jobs>
<Job>
<JobSchedule class="[ class IJobScheduler ]" >
<!-- Implementation dependent configuration parameters -->
</JobSchedule>
<JobProcess
class="com.raritantechnologies.federated.eRoom.ERoomBuildProcess" >
<ERoomGateway >
<!-- see ERoomGateway configuration below -->
</ERoomGateway>
<!-- Typically a DatabaseOutputProcessor or FileOutputProcessor -->
<LoggingOutputProcessor class="[ class of IGatewayOutputProcessor ]">
<!-- Implementation dependent configuration parameters -->
</LoggingOutputProcessor>
<!-- Indexing output processor: Fast, Verity etc. -->
<IndexOutputProcessor class="[ class of IGatewayOutputProcessor ]">
<!-- Implementation dependent configuration parameters -->
</IndexOutputProcessor>
<!-- Map of logging parameters to index output processor setup -->
<LogInputMap>
<Param logField=”[ field name in logging database ]”
indexParam=”[ index output processor initialization parameter ]”/>
</LogInputMap>
</JobProcess>
</Job>
</Jobs>
</Scheduler>
</ERoomBuildConfiguration>
ERoomGateway:
The eRoomGateway configuration consists of several sections. <ERoomSource> tag(s) defines the eRooms and starting location paths. The <eRoomAllContainerFields> tag defines the common fields that all eRoom objects will have. The <ExcludeContent>tag lists the file extension types that are not to be indexed. Finally, <ERoomItem> tags define how each type of eRoom object is to be extracted, formatted and indexed. These tags may reference custom <Structure> and <FieldMap> tags that define various XML structures in the eRoom server output and how they are to be formatted for indexing.
<CollectionGateway
class=”com.raritantechnologies.federated.eRoom.ERoomGateway” >
<!-- ERoomSource tags define locations in ERoom Facility -->
<ERoomSource endPointURL=”[ url to ERoom facility ]”
startPoint=”[ path to eRoom start point ]” >
<LoginProcess UserName=”[ user name ]”
Password=”[ password ]”
PasswordEnc=”[ option: DES encrypted Password ]” />
</ERoomSource>
<!-- ExcludeContent tags define file extensions that -->
<!-- are to be excluded from indexing -->
<ExcludeContent>
<!-- One or more ExcludeExtension tags -->
<ExcludeExtension ext="[ file extension of excluded content ]" />
</ExcludeContent>
<!-- eRoomAllContainterFields define common eRoom fields -->
<eRoomAllContainerFields>
<!- - one or more eRoomAllContainerField tags - ->
<eRoomAllContainerField>[ name of common eRoom field ]
</eRoomAllContainerField>
</eRoomAllContainerFields>
<!-- eRoomItem tags define how to extract and format
individual eRoom item types -->
<eRoomItems>
<eRoomItem name=”[ name of eRoom principle Item type ]”>
<!-- eRoomFields tag can be used to describe component -->
<!-- fields that are nested immediately below the Item -->
<!-- tag in the eRoom SOAP response. -->
<eRoomFields>[ comma separated list of fields in item ]
</eRoomFields>
<!-- More complex XML structures can be described using a -->
<!-- a Structure and a FieldMap tag. The Structure tag -->
<!-- describes the SOAP envelope request needed to extract -->
<!-- the item data. The FieldMap tag describes how to format -->
<!-- the extracted data into indexable fields. -->
<Structure>
<!-- nested tags that define the SOAP envelope structure of -->
<!-- the web services request (see example) -->
</Structure>
<FieldMap>
<!-- simple field mappings consist of a field name and an -->
<!-- XPath expression locating the field value in the -->
<!-- eRoom SOAP response. -->
<Field ID=”[ field ID in indexed object ]”
xPath=”[ location of data in SOAP response ]” />
<!-- complex fields can be mapped with a CompositeField tag -->
<CompositeField ID="[ field ID in indexed object ]">
<Template>[ template for composing a single value from a set
of components values. {} braces are used to mark
place holders for component data(
e.g. {Question} = {DisplayContent} ]
</Template>
<!-- two or more Field tags that define how individual -->
<!-- components of the composite field are to be -->
<!-- extracted from the response and formatted. -->
<Field ID=”[ component ID – matches placeholder of
composite template ]”
xPath=”[ location of component data ]” >
<Template>[ template for formatting component sub tags ]
</Template>
</Field>
</CompositeField>
</FieldMap>
</eRoomItem>
</eRoomItems>
</CollectionGateway>
GatewayOutputProcessors
GatewayOutputProcessors are Raritan Framework modules that implement the IGatewayOutputProcessor interface. This is an abstract interface that enables varied types of processing to be applied to metadata objects (RTI IResultSet objects). GatewayOutputProcessors are used by the ERoomBuildProcess to store logging information and to push extracted content to search engine index servers (e.g. Autonomy, Lucene, Fast, Verity K2). For logging, the typical choices are to a relational database using the DatabaseOutputProcessor or to a file system using the FlatFileOutputProcessor.
Details on the configuration of these output processors are available in the online Raritan Framework Documentation:
http://demos2.raritantechnologies.com/FrameworkDocumentation/index.html
The GatewayOutputProcessor module may comprise a single index processor or it may include additional processing using a composite GatewayOutputProcessor such as the SequentialOutputProcessor. This may be used for example in systems that have a separate security server to which the eRoom valid user ids are collected and submitted as access control lists.
Deployment
The eRoom collection builder can be run as an executable process from a command line, or from a Web Application deployed to a J2EE application server. A custom J2EE administration interface to view collection indexing logs, start/stop processes and to edit configuration parameters can also be developed using Raritan Framework components.
Command Line Process:
The ERoom builder can be deployed as a command line process. The process requires that a Java runtime to be installed on the machine that the process is invoked on. The following diagram shows the structure of the eRoom build deployment:
Figure 3: eRoom Build Deployment Directory Structure
The jars folder contains the java binaries needed to run the EroomBuilder. The Config.properties and EroomBuildConfig.xml files are used to initialize the process. The CollectionBuild.lic file is required to run the process.
The temp, OutputFiles and logs folders are used to store temporary data before passing the data to the indexer and to generate build logs respectively.
build.bat file:
The build.bat file sets up the collection build process and the Java classpath needed to run the builder and initiates the build process. This example is set up to perform a single-pass collection build. To run the scheduled option, change the class to com.raritantechnologies.searchApp.scheduler.Scheduler and configure the EroomFileBuild.xml as for the ERoomBuildProcess described above.
@set class = com.raritantechnologies.searchApp.dataCollection.CollectionBuilder
@set configFile=ERoomFileBuild.xml
@set cp=.
@set cp=%cp%;jars\raritan.jar
@set cp=%cp%;jars\axis.jar
@set cp=%cp%;jars\commons-discovery.jar
@set cp=%cp%;jars\commons-logging.jar
@set cp=%cp%;jars\jaxrpc.jar
@set cp=%cp%;jars\log4j-1.2.8.jar
@set cp=%cp%;jars\saaj.jar
@set cp=%cp%;jars\servlet.jar
@set cp=%cp%;jars\xerces.jar
@set cp=%cp%;jars\Distribution.jar
@set cp=%cp%;jars\gnu-regexp-1.1.3.jar
@set java_args=-Xms128m -Xmx128m
@date /t
@time /t
java %java_args% -cp %cp% %class% %configFile% >logs/eroom.txt 2>logs/eroom.err
@date /t
@time /t
Web Application Deployment
The EroomBuild can also be deployed as a J2EE web application so that it can be run in an application server environment. This deployment requires that the scheduled ERoomBuildProcess configuration is used. The deployment uses the same files as in the command line process mapped to the standard J2EE directory structure.
Figure 4: eRoom Web Server Directory Structure
As shown above, the CollectionBuild.lic and the Config.properties files are deployed in the WEB-INF/classes directory and the ERoomBuildConfig.xml file in the WEB-INF/conf directory.
Sample ERoom Gateway configuration:
<CollectionGateway name="ERoomCG" class="com.raritantechnologies.federated.eRoom.ERoomGateway"
dataDir="BASE_PATH/temp"
dataFileField="fullText"
resultSetSize="10"
maxFileSize="10000000" >
<!-- One or more ERoomSource tags: -->
<ERoomSource
endPointURL="http://eroom:80/eroomxml/facilities/SampleERoom/"
startPoint="/C1/" >
<LoginProcess UserName="adminUser"
Password="hackerSecurePassword" />
</ERoomSource>
<ExcludeContent>
<ExcludeExtension ext="mpp"/>
<ExcludeExtension ext="jpg"/>
<ExcludeExtension ext="gif"/>
<ExcludeExtension ext="psd"/>
<ExcludeExtension ext="mpg"/>
<ExcludeExtension ext="mdb"/>
<ExcludeExtension ext="zip"/>
</ExcludeContent>
<CommentStructure>
<Comments>
<Item>
<ModifyDate/>
<PollEntries>
<Item>
<Ballots>
<Item>
<Votes>
<Vote>
<WriteIn/>
</Vote>
</Votes>
</Item>
</Ballots>
<Choices>
<Choice/>
</Choices>
</Item>
</PollEntries>
</Item>
</Comments>
</CommentStructure>
<CommentFieldMap>
<CompositeField ID="Comments">
<Template>
{CommentData} {Polls} ({Choices})({WriteIns})
</Template>
<Field ID="CommentData" xPath="Comments/Item" >
<Template>*Comment:{Name}={Body}</Template>
</Field>
<Field ID="Polls" xPath="Comments/Item/PollEntries/Item" >
<Template>{Question}={DisplayContent}</Template>
</Field>
<Field ID="Choices"
xPath="Comments/Item/PollEntries/Item/Choices" >
<Template>{Choice}</Template>
</Field>
<Field ID="WriteIns"
xPath="Comments/Item/PollEntries/Item/Ballots/Item/Votes/Vote">
<Template>{WriteIn}</Template>
</Field>
</CompositeField>
<DateField ID="ModifyDate"
xPath="Comments/Item/ModifyDate" select="LATEST" />
</CommentFieldMap>
<eRoomAllContainerFields>
<eRoomAllContainerField>Name</eRoomAllContainerField>
<eRoomAllContainerField>ID</eRoomAllContainerField>
<eRoomAllContainerField>Size</eRoomAllContainerField>
<eRoomAllContainerField>SizeUnits</eRoomAllContainerField>
<eRoomAllContainerField>URL</eRoomAllContainerField>
<eRoomAllContainerField>RelativeURL</eRoomAllContainerField>
<eRoomAllContainerField>CreateDate</eRoomAllContainerField>
<eRoomAllContainerField>ModifyDate</eRoomAllContainerField>
<eRoomAllContainerField>IsContainer</eRoomAllContainerField>
<eRoomAllContainerField>Path</eRoomAllContainerField>
<eRoomAllContainerField>IsPage</eRoomAllContainerField>
<eRoomAllContainerField>IsTopic</eRoomAllContainerField>
<eRoomAllContainerField>IsUnread</eRoomAllContainerField>
<eRoomAllContainerField>EffectiveIsUnread</eRoomAllContainerField>
<eRoomAllContainerField>HasExternalLink</eRoomAllContainerField>
<eRoomAllContainerField>Description</eRoomAllContainerField>
</eRoomAllContainerFields>
<eRoomItems>
<eRoomItem name="erItemTypeFolderPage" >
<eRoomFields>BundleFiles</eRoomFields>
</eRoomItem>
<eRoomItem name="erItemTypeInboxPage" >
<Structure tagName="CommentStructure" />
<FieldMap tagName="CommentFieldMap" />
</eRoomItem>
<eRoomItem name="erItemTypeDiscussionPage" >
<Structure tagName="CommentStructure" />
<FieldMap tagName="CommentFieldMap" />
</eRoomItem>
<eRoomItem name="erItemTypeTopicPage" >
<Structure tagName="CommentStructure" />
<FieldMap tagName="CommentFieldMap" />
</eRoomItem>
<eRoomItem name="erItemTypeLink" />
<eRoomItem name="erItemTypeCalendarEventPage" >
<Structure tagName="CommentStructure" />
<FieldMap tagName="CommentFieldMap" />
</eRoomItem>
<eRoomItem name="erItemTypeFile">
<eRoomFields>BaseName,Extension,IsVirtualDocument,SupportsMinorVersions,
SupportsVersionInformation,TrackVersions</eRoomFields>
</eRoomItem>
<eRoomItem name="erItemTypeNotePage" />
<eRoomItem name="erItemTypeDBPage"
itemsPath="/Database/Rows/Item" >
<Structure>
<Database>
<Rows>
<Item>
<Name />
<ID />
<RelativeURL />
<URL />
<Size />
<Cells>
<DBCell />
</Cells>
</Item>
</Rows>
</Database>
</Structure>
<FieldMap>
<CompositeField ID="DBData">
<Template>{DBCells} {Comments}</Template>
<Field ID="DBCells" xPath="Database/Rows/Item/Cells/DBCell" >
<Template>{Name}={DisplayContent}</Template>
</Field>
<Field ID="Comments" xPath="Database/Rows/Item/Cells/DBCell/Content/Item" >
<Template>*Comment:{Name}={Body}</Template>
</Field>
</CompositeField>
</FieldMap>
</eRoomItem>
<eRoomItem name="erItemTypeDBRow"
itemsPath="/Cells/DBCell/Content/Item" >
<Structure>
<Cells>
<DBCell />
</Cells>
</Structure>
<FieldMap>
<CompositeField ID="DBData">
<Template>{DBCells} {Comments}</Template>
<Field ID="DBCells" xPath="Cells/DBCell" >
<Template>{Name}={DisplayContent}</Template>
</Field>
<Field ID="Comments" xPath="Cells/DBCell/Content/Item" >
<Template>*Comment:{Name}={Body}</Template>
</Field>
</CompositeField>
</FieldMap>
</eRoomItem>
<eRoomItem name="erItemTypeProjectSchedulePage"
itemsPath="/ProjectSchedule/Tasks/Item" >
<Structure>
<ProjectSchedule>
<StartDate />
<SummaryStatus />
<Tasks>
<Item/>
</Tasks>
</ProjectSchedule>
</Structure>
</eRoomItem>
<eRoomItem name="erItemTypeProjectTaskPage" >
<Structure>
<ProjectTask>
<Category />
<Resources>
<Member>
<DisplayName />
</Member>
</Resources>
<Notes />
</ProjectTask>
</Structure>
<Structure tagName="CommentStructure" />
<FieldMap>
<Field ID="Category" xPath="ProjectTask/Category" />
<Field ID="Resource" xPath="ProjectTask/Resources/Member/DisplayName" />
<Field ID="Notes" xPath="ProjectTask/Notes" />
<CompositeField ID="Comments">
<Template>{CommentData} {Polls} ({Choices})({WriteIns})</Template>
<Field ID="CommentData" xPath="Comments/Item" >
<Template>*Comment:{Name}={Body}</Template>
</Field>
<Field ID="Polls" xPath="Comments/Item/PollEntries/Item" >
<Template>{Question}={DisplayContent}</Template>
</Field>
<Field ID="Choices" xPath="Comments/Item/PollEntries/Item/Choices" >
<Template>{Choice}</Template>
</Field>
<Field ID="WriteIns"
xPath="Comments/Item/PollEntries/Item/Ballots/Item/Votes/Vote" >
<Template>{WriteIn}</Template>
</Field>
</CompositeField>
</FieldMap>
</eRoomItem>
<eRoomItem name="erItemTypeCalendarPage"
itemsPath="/Calendar/CalendarEvents/Item" >
<Structure>
<Calendar>
<CalendarEvents>
<Item/>
</CalendarEvents>
</Calendar>
</Structure>
</eRoomItem>
<eRoomItem name="erItemTypePollPage" >
<ItemPaths>
<Path container="erItemTypePollEntry" itemsPath="/Poll/PollEntries/Item" />
</ItemPaths>
<Structure>
<Poll>
<PollEntries>
<Item>
<Ballots>
<Item>
<Votes>
<Vote>
<WriteIn/>
</Vote>
</Votes>
</Item>
</Ballots>
<Choices>
<Choice/>
</Choices>
</Item>
</PollEntries>
</Poll>
</Structure>
<FieldMap>
<CompositeField ID="Comments">
<Template>{Choices} {WriteIns}</Template>
<Field ID="Choices" xPath="Poll/PollEntries/Item/Choices" >
<Template>{Choice}</Template>
</Field>
<Field ID="WriteIns" xPath="Poll/PollEntries/Item/Ballots/Item/Votes/Vote" >
<Template>{WriteIn}</Template>
</Field>
</CompositeField>
</FieldMap>
</eRoomItem>
<eRoomItem name="erItemTypePollEntry"
itemsPath="/Ballots/Item" >
<Structure>
<Ballots>
<Item/>
</Ballots>
</Structure>
</eRoomItem>
<eRoomFields>Question</eRoomFields>
</eRoomItems>
</CollectionGateway>