7
Understanding the ConText Data Dictionary: Indexing

This chapter introduces the concepts necessary for understanding the objects in the ConText data dictionary.

The following topics are discussed in this chapter:

Policies

This section provides conceptual, as well as reference, information about policies, which are stored in the ConText data dictionary:

What is a Policy?

A policy is a logical grouping of six indexing preferences (one preference for each of the supported categories), assigned to a column in the database. A policy specifies the options used by ConText to create the index for the text in the column. It is also used to generate linguistic information for use in ConText applications.

Note:

A policy must exist for a column before a ConText server can create a index or generate linguistic output for the column.

Policies can be created by any ConText user with the CTXAPP role. Policies are stored in the ConText data dictionary. In addition to the preferences for a policy, users specify a name for the policy and the text column for the policy, and a number of other policy attributes.

The policies created by a user must be unique for the user. As such, the same policy for a user cannot be assigned to more than one column.

Column Policies

A column policy is a policy that has a text column assigned to it. Only column policies can be used to create ConText indexes or generate linguistic output.

Multiple Policies on a Column

Multiple policies, as long as they are unique for the user, can be assigned to a column. As a result, a column can have more than one index.

When a query is performed, you can specify a policy name to indicate the index that is used to process the query.

This feature is particularly useful if you have English-language documents for which you want to enable both text and theme queries. To enable text and theme queries, you must create both a text indexing policy and a theme indexing policy on the column containing the documents and create a ConText index for each policy.

Template Policies

A template policy is a policy that does not have a text column. Template policies are stored in the ConText data dictionary and are owned by the user who created them.

A template policy can be used by the policy owner as a source policy when creating new column or template policies. When a template policy is used as a source policy in a new policy, all of the preferences for the template policy are copied to the new policy. Any preference from the template policy can be overridden by explicitly naming a preference (for the same category) during the creation of the new policy.

Text Indexing Policies

A text indexing policy is any policy created with a Lexer preference that uses the BASIC LEXER Tile or one of the Tiles provided for the pictorial languages supported by ConText.

Once a text index is created for the policy, any text requests, including text queries, on the policy will result in the text index being accessed.

See Also:

For an example of creating a text indexing policy, see "Creating a Column Policy" in Chapter 9, "Setting Up and Managing Text".

For more information about text indexes, see "Text Indexes" in Chapter 6, "Text Concepts".

For more information about text queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Theme Indexing Policies

By specifying the THEME LEXER Tile in the Lexer preference used in a column policy, you designate the policy as a theme indexing policy.

In addition, stoplists are not used by the theme lexer, so a NULL Stoplist preference can be specified for the policy.

Once a theme index is created for a theme indexing policy, any text requests, including queries, on the policy will result in the theme index being accessed.

See Also:

For an example of creating a theme indexing policy, see "Creating a Theme Indexing Policy" in Chapter 9, "Setting Up and Managing Text".

For more information about theme indexes, see "Theme Indexes" in Chapter 6, "Text Concepts".

For more information about theme queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Policy Examples

Consider a table with two text columns: one holds Microsoft Word documents and the other holds (plain text) comments for the documents. The table structure is:

Table name Column Name Datatype Description

DOC_AND_COMMENT

TEXTKEY

NUMBER

Primary key column

DATE

DATE

Publishing date of document

AUTHOR

VARCHAR2(50)

Name of document author

COMMENTS

VARCHAR2(2000)

Text column storing comments (ASCII text) for documents

TEXT

LONG RAW

Text column storing MS Word documents

Table name	Column Name	Datatype	Description
DOC_AND_COMMENT	TEXTKEY	NUMBER	Primary key column
	DATE	DATE	Publishing date of document
	AUTHOR	VARCHAR2(50)	Name of document author
	COMMENTS	VARCHAR2(2000)	Text column storing comments (ASCII text) for documents
	TEXT	LONG RAW	Text column storing MS Word documents

To create a text index for both the comment and doc columns in doc_and_comment, a policy must be defined for each column. The following example illustrates two policies named i_doc and i_comments that could be created:

Policy Name Indexing Option Indexing Option Value

I_DOC

Text Column

DOC_AND_COMMENT.DOC

Data Store

Direct (text in column)

Filter

MS Word

Lexer

General purpose text lexer

Engine

General purpose indexing engine

Stoplist

Default English stoplist

Wordlist

Soundex and stemming

I_COMMENTS

Text Column

DOC_AND_COMMENT.COMMENTS

Data Store

Direct (text in column)

Filter

None (ASCII text)

Lexer

General purpose lexer

Engine

General purpose indexing engine

Stoplist

Default English stoplist

Wordlist

None

Policy Name	Indexing Option	Indexing Option Value
I_DOC	Text Column	DOC_AND_COMMENT.DOC
	Data Store	Direct (text in column)
	Filter	MS Word
	Lexer	General purpose text lexer
	Engine	General purpose indexing engine
	Stoplist	Default English stoplist
	Wordlist	Soundex and stemming
I_COMMENTS	Text Column	DOC_AND_COMMENT.COMMENTS
	Data Store	Direct (text in column)
	Filter	None (ASCII text)
	Lexer	General purpose lexer
	Engine	General purpose indexing engine
	Stoplist	Default English stoplist
	Wordlist	None

To create a theme index for the doc column, a theme indexing policy must be defined. The following example illustrates a policy named i_theme that could be created for the table:

Policy Name Indexing Option Indexing Option Value

I_THEME

Text Column

DOC_AND_COMMENT.DOC

Data Store

Direct (text in column)

Filter

MS Word

Lexer

Theme lexer

Engine

General purpose indexing engine

Stoplist

Not applicable

Wordlist

Not applicable

Policy Name	Indexing Option	Indexing Option Value
I_THEME	Text Column	DOC_AND_COMMENT.DOC
	Data Store	Direct (text in column)
	Filter	MS Word
	Lexer	Theme lexer
	Engine	General purpose indexing engine
	Stoplist	Not applicable
	Wordlist	Not applicable

Policy Attributes

To define a policy, a user specifies a name for the policy and a number of optional attributes.

Policy Name

Because a policy is owned by the user who creates it, the policy name must be unique for a user; however, different users can have policies with the same name.

Optional Attributes

The following policy attributes are optional:

Text Column

specifies the column in a table to which a policy is assigned. It is the column used to store text in the table.

Note:

If the policy does not include a text column, the policy is a template policy, which can be used as a source policy in another policy.

Description

species a description of the policy.

Textkey

specifies the primary key column or columns (up to sixteen) for the table. This attribute is required if the policy is being assigned to a column.

Line Number

specifies the column storing the unique identifier for the text column in a master-detail table. A master-detail table does not store a document as a single row, but rather breaks the document (identified by the textkey) into sections and stores each section in a separate row in the table. The collection of rows with the same textkey represents the whole document.

Note:

This attribute is used only for policies that include a preference for the MASTER DETAIL Tile.

Source Policy

specifies an existing template policy that you want to use as the basis for a new policy. When you specify a source policy in a policy, all of the preferences for the template (source) policy are copied into the new policy. The preferences from the source policy can be overwritten by explicitly specifying a preference for the category.

Note:

When specifying a source policy in a policy, a user can specify either their template policies or CTXSYS-owned template policies.

Preferences in Policies

To define a policy, the user specifies a preference for each of the six supported categories. ConText does not require the user to specify a preference for the seventh category, Compressor, because data compression is not currently supported.

A preference can be used in more than one policy; however, two preferences from the same category cannot be used in the same policy.

Note:

If you want to use the same preferences for two text columns, you must create two separate policies. The policies will be identical (having all of the same preferences), but they must have unique names and be attached to different columns. This is true whether the columns are in the same table or in different tables.

Preference Defaults

In a policy, if a user does not specify a preference for one of the preference categories, ConText uses the default preference for the category.

The above figure illustrates how the default preferences and user-specified preferences work together to create a complete policy.

Predefined Template Policies

ConText provides the following template policies (listed in alphabetical order):

DEFAULT_POLICY

This template policy uses all of the default preferences. It can be used to create a policy with the following characteristics:

Preferences Characteristics

DEFAULT_DIRECT_DATASTORE

Text stored in database

DEFAULT_NULL_FILTER

No filter (text stored in plain, ASCII format)

DEFAULT_LEXER

Basic lexer (standard punctuation and continuation characters, no printjoins or skipjoins characters)

DEFAULT_INDEX

Indexing memory = 12582912 bytes, default storage/other clauses for ConText index tables and indexes

NO_SOUNDEX

No Soundex word mappings stored during text indexing

DEFAULT_STOPLIST

Stoplist is active, default list of stop words

Preferences	Characteristics
DEFAULT_DIRECT_DATASTORE	Text stored in database
DEFAULT_NULL_FILTER	No filter (text stored in plain, ASCII format)
DEFAULT_LEXER	Basic lexer (standard punctuation and continuation characters, no printjoins or skipjoins characters)
DEFAULT_INDEX	Indexing memory = 12582912 bytes, default storage/other clauses for ConText index tables and indexes
NO_SOUNDEX	No Soundex word mappings stored during text indexing
DEFAULT_STOPLIST	Stoplist is active, default list of stop words

Note:

DEFAULT_POLICY is the default for source_policy in both CTX_DDL.CREATE_POLICY and CTX_DDL.CREATE_TEMPLATE_POLICY.

TEMPLATE_AUTOB

This template policy uses the AUTOB predefined preference and all the remaining default preferences. It can be used to create a column policy for a text column that contains documents in mixed formats.

TEMPLATE_BASIC_WEB

This template uses the following predefined preferences and can be used to create a column policy which enables basic section searching for a text column containing HTML documents:

Preferences Characteristics

DEFAULT_URL

Text stored in external files, URLs to external files stored in text column

BASIC_HTML_FILTER

HTML filter with certain HTML tags specified for keep_tag

BASIC_HTML_LEXER

Basic lexer with characters specified for startjoins and endjoins

DEFAULT_LEXER

Indexing memory = 12582912 bytes, default storage/other clauses for ConText index tables and indexes

BASIC_HTML_WORDLIST

No Soundex word mappings stored during text indexing; HTML section group specified for section_group

DEFAULT_STOPLIST

Stoplist is active, default list of stop words

Preferences	Characteristics
DEFAULT_URL	Text stored in external files, URLs to external files stored in text column
BASIC_HTML_FILTER	HTML filter with certain HTML tags specified for keep_tag
BASIC_HTML_LEXER	Basic lexer with characters specified for startjoins and endjoins
DEFAULT_LEXER	Indexing memory = 12582912 bytes, default storage/other clauses for ConText index tables and indexes
BASIC_HTML_WORDLIST	No Soundex word mappings stored during text indexing; HTML section group specified for section_group
DEFAULT_STOPLIST	Stoplist is active, default list of stop words

TEMPLATE_DIRECT

This template policy uses all the default preferences. It can be used to create a policy for indexing basic text stored in a text column.

TEMPLATE_LONGTEXT_STOPLIST_OFF

This template policy uses the NO_STOPLIST predefined preference and all the remaining default preferences. It can be used to create a policy that does not use a stoplist during indexing.

TEMPLATE_LONGTEXT_STOPLIST_ON

This template policy uses the DEFAULT_STOPLIST predefined preference and all the remaining default preferences. It can be used to create a policy that uses the supplied English stoplist during indexing.

TEMPLATE_MD

This template policy use the MD_TEXT predefined preference and all the remaining default preferences. It can be used to create a policy for indexing text stored in the detail column in a master-detail table.

TEMPLATE_MD_BIN

This template policy uses the MD_BINARY predefined preference and all the remaining default preferences. It can be used to create a policy for indexing text stored in the detail column in a master-detail table.

TEMPLATE_WW6B

This template policy uses the WW6B predefined preference and all the remaining default preferences. It can be used to create a policy for indexing text in Microsoft Word for Windows 6 format.

Preferences for Indexing

This section provides conceptual, as well as reference, information for indexing preferences, which are stored in the ConText data dictionary:

What is an Indexing Preference?

Indexing preferences specify the options that ConText uses to create ConText indexes. Each preference represents one (and only one) indexing option.

A preference consists of a ConText Tile and one or more attributes (and their corresponding values) for the Tile.

In addition, each preference is grouped into one of six types or categories, which determine the indexing operation that the preference controls. While a category is not explicitly assigned to a preference, it is implied through the association of the Tile with the preference.

When creating a policy, six preferences are specified for the policy, one for each of the six categories.

User-defined Preferences

A ConText user with the CTXAPP role can create their own preferences by setting the required attributes for one of the Tiles provided by ConText, then calling CTX_DDL.CREATE_PREFERENCE and specifying the name of the Tile.

Note:

When creating a policy, users can use all preferences that have been defined in the ConText data dictionary, including their own preferences, preferences created by other users, or the predefined preferences provided by ConText.

Predefined Preferences

ConText provides a number of predefined preferences (owned by CTXSYS) for each category. These predefined preferences can be used by any ConText user with the CTXAPP role to create policies without having to first create preferences.

What is a Tile?

Tiles are the objects in the ConText data dictionary that provide ConText servers with information about how text is managed in the system, as well as indexing instructions. Each Tile specifies a distinct indexing option within the ConText framework.

A Tile is the main component of a preference. When you define a preference, you specify a Tile and attributes for the Tile, as well as a value for each attribute.

Tile Attributes

Each Tile may have none, one, or many attributes that are specified to define a preference. The attributes identify which indexing options are active for the Tile in a preference.

Each Tile attribute has a value (either a number or a string) that you assign when you specify attributes in a preference.

Tile Categories

The indexing options that must be specified for ConText are divided into seven functional categories or classes.

Each category contains one or more Tiles for which you specify attributes when creating preferences. The Tiles in the categories essentially provide answers to the questions necessary for ConText to generate an index for a text column:

Where and how is the text stored? (Data Store Tiles)
What format is the text in? (Filter Tiles)
Is text compressed? (Compressor Tiles -- Not currently implemented)
How are tokens in the text identified? (Lexer Tiles)
How is the index generated and where is it stored? (Engine Tiles)
Are any special querying functions enabled? (Wordlist Tiles)
Which words should not have entries in the index? (Stoplist Tiles)

Data Store Predefined Preferences

ConText provides the following predefined Data Store preferences:

DEFAULT_DIRECT_DATASTORE

This preference calls the DIRECT Tile, which is used to indicate that text is stored directly in the text column of a text table.

DEFAULT_OSFILE

This preference calls the OSFILE Tile, which is used to indicate that text is stored as files in a file system,

DEFAULT_OSFILE uses the path attribute and a hardcoded set of dummy directory paths to indicate the directories in which the text files are located.

The hard-coded paths, delimited by colons are: /oracle/data, /oracle/data2, /oracle/data3.

Note:

If the locations of your files do not match the hard-coded paths, do not use the DEFAULT_OSFILE preference in a policy.

DEFAULT_URL

This preference calls the URL Tile which is used to indicate that text is stored as URLs.

DEFAULT_URL uses all of the attribute defaults for the URL Tile:

timeout of 30 seconds
up to 8 HTTP threads handled simultaneously
up to 256 HTML documents can be accessed simultaneously
the maximum length of a URL stored in the text column is 256 bytes
the maximum size of an HTML file that the URL data store will access without error is 2 megabytes
no proxy server

MD_BINARY

This preference calls the MASTER DETAIL Tile which is used to indicate text is stored in a master detail table.

MD_BINARY uses the binary attribute and a value of YES to indicate that the text in the table is stored in binary format (newline characters do not indicate end of line).

MD_TEXT

This preference calls the MASTER DETAIL Tile which is used to indicate text is stored in a master detail table.

MD_TEXT uses the binary attribute and a value of NO to indicate that the text in the table is stored in plain text format (newline characters indicate end of line).

Filter Predefined Preferences

ConText provides the following predefined Filter preferences:

AUTOB

This preference calls the BLASTER FILTER Tile which specifies an internal filter used to extract text from formatted documents in a text column.

AUTOB uses the format attribute and a value of 997 to indicate that ConText uses the autorecognize filter to extract text. It can be used to filter text in a column that contains the following document formats:

Document Format Version

AmiPro for Windows

1, 2, 3

ASCII

N/A

HTML

1, 2, 3

Lotus 123 for DOS

4, 5

Lotus 123 for Windows

2, 3, 4, 5

Microsoft Word for Windows

2, 6.x

Microsoft Word for DOS

5.0, 5.5

Microsoft Word for MAC

3, 4, 5.x

Word Perfect for Windows

5.x, 6.x

WordPerfect for DOS

5.0, 5.1, 6.0

Xerox XIF for UNIX

5, 6

Document Format	Version
AmiPro for Windows	1, 2, 3
ASCII	N/A
HTML	1, 2, 3
Lotus 123 for DOS	4, 5
Lotus 123 for Windows	2, 3, 4, 5
Microsoft Word for Windows	2, 6.x
Microsoft Word for DOS	5.0, 5.5
Microsoft Word for MAC	3, 4, 5.x
Word Perfect for Windows	5.x, 6.x
WordPerfect for DOS	5.0, 5.1, 6.0
Xerox XIF for UNIX	5, 6

BASIC_HTML_FILTER

This preference is identical to the HTML_FILTER predefined preference, except the keep_tag attribute is set with the following values to support basic section searching in HTML documents:

'P'
'TITLE'
'H1','H2','H3','H4','H5','H6'
'HEAD'
'BODY'

DEFAULT_NULL_FILTER

This preference calls the FILTER NOP Tile which indicates that the text column in a text table contains plain, unformatted (ASCII) text and does not require filtering for indexing and highlighting.

HTML_FILTER

This preference calls the HTML FILTER Tile and can be used to filter documents in a column that contains only HTML-formatted documents.

WW6B

This preference calls the BLASTER FILTER Tile and specifies a value of 11 for the format attribute to indicate ConText uses the Word for Windows 6 filter to extract text. It can be used in a column that contains only Word for Windows 6-formatted documents.

Lexer Predefined Preferences

ConText provides the following predefined Lexer preferences:

BASIC_HTML_LEXER

This preference is identical to DEFAULT_LEXER, except the startjoins and endjoins attributes for the BASIC LEXER Tile are set with '</' and '>' respectively to support basic section searching in HTML documents.

DEFAULT_LEXER

This preference calls the BASIC LEXER Tile, which indicates the lexer settings used to identify word and sentence boundaries for text indexing and text queries.

DEFAULT_LEXER uses the following Tile attributes and values to indicate the lexer settings:

Attribute Values

punctuations

. ? !

printjoins

NULL (indicates no characters defined as printjoins for the BASIC LEXER)

skipjoins

NULL (indicates no characters defined as skipjoins for the BASIC LEXER)

continuation

- \

Attribute	Values
punctuations	. ? !
printjoins	NULL (indicates no characters defined as printjoins for the BASIC LEXER)
skipjoins	NULL (indicates no characters defined as skipjoins for the BASIC LEXER)
continuation	- \

KOREAN

This preference calls the KOREAN LEXER Tile and can be used for parsing Korean text. It has no attributes.

THEME_LEXER

This preference calls the THEME LEXER Tile, which indicates the preference can be used in a column policy to create theme indexes for a column.

The THEME_LEXER preference does not set any attributes because the THEME LEXER preference doesn't have any attributes.

VGRAM_CHINESE_1 and VGRAM_CHINESE_2

This preference call the CHINESE V-GRAM LEXER Tile, which indicates the preferences can be used for parsing Chinese text.

The 1 or 2 indicates that the preference uses either method 1 or 2 for identifying tokens in Chinese text (hanzi_indexing attribute).

VGRAM_JAPANESE_1 and VGRAM_JAPANESE_2

This preference call the JAPANESE V-GRAM LEXER Tile which indicates the preferences can be used for parsing Japanese text.

The 1 or 2 indicates that the preference uses either method 1 or 2 for identifying tokens in Japanese text (kanji_indexing attribute).

Engine Predefined Preferences

ConText supplies a single predefined Engine preference, DEFAULT_INDEX.

DEFAULT_INDEX

This preference calls the GENERIC ENGINE Tile which is used to specify the amount of memory reserved for indexing.

DEFAULT_INDEX uses the index_memory attribute and specifies the amount of memory allocated for indexing: 12582912 bytes.

Wordlist Predefined Preferences

ConText provides the following predefined Wordlist preferences:

BASIC_HTML_WORDLIST

This preference is identical to the NO_SOUNDEX preference, except the section_group attribute has a value of 'BASIC_HTML_SECTION', which is a predefined section group provided by ConText for basic section searching of HTML text.

NO_SOUNDEX

This preference specifies a value of 0 for the soundex_at_index attribute to indicate that ConText does not generate Soundex word mappings during text indexing.

SOUNDEX

This preference specifies a value of 1 for the soundex_at_index attribute to indicate that ConText generates Soundex word mappings during text indexing.

KOREAN_WORDLIST

This preference specifies a value 3 for the fuzzy_match attribute to ensure fuzzy matching is not enabled for Korean.

VGRAM_CHINESE_WORDLIST

This preference specifies a value 4 for the fuzzy_match attribute to ensure fuzzy matching is not enabled for Chinese.

VGRAM_JAPANESE_WORDLIST

This preference specifies a value 2 for the fuzzy_match attribute to enable fuzzy matching for Japanese.

Stoplist Predefined Preferences

ConText provides the following predefined Stoplist preferences for creating text indexes:

DEFAULT_STOPLIST

This preference defines a list of English terms treated as stop words during indexing.

Note:

The stop words in DEFAULT_STOPLIST are in all-uppercase. As a result, DEFAULT_STOPLIST should be used only when creating case-insensitive text indexes.

If you are creating case-sensitive text indexes and want to use a stoplist during indexing, first create a Stoplist preference that includes all forms (i.e. uppercase, initial uppercase, lowercase) of the words to be processed as stop words.

For more information about case-sensitivity in text indexes, see "What's in a Text Index?" in Chapter 6, "Text Concepts"

NO_STOPLIST

This preference specifies that no list of stop words is used during text indexing. All words that ConText encounters are stored in the text index.

FRENCH_STOPLIST

This preference defines a list of French terms treated as stop words during indexing.

GERMAN_STOPLIST

This preference defines a list of German terms treated as stop words during indexing.

ITALIAN_STOPLIST

This preference defines a list of Italian terms treated as stop words during indexing.

SPANISH_STOPLIST

This preference defines a list of Spanish terms treated as stop words during indexing.

Note:

The stop words in FRENCH_STOPLIST, GERMAN_STOPLIST, ITALIAN_STOPLIST, and SPANISH_STOPLIST are in all-lowercase. As a result, these preferences can be used in case-sensitive and case-insensitive text indexes; however, if they are used in case-sensitive indexes, stop words that appear in all-uppercase or initial uppercase in your text will not be processed as stop words.

To ensure all stop words are processed correctly in a case-sensitive text index, create a Stoplist preference that includes all forms (i.e. uppercase, initial uppercase, lowercase) of the words to be processed as stop words.

For more information about case-sensitivity in text indexes, see "What's in a Text Index?" in Chapter 6, "Text Concepts"

Data Store Tiles

Data Store Tiles are used to create preferences which specify how text (data) is stored in the database. ConText supports the following methods of storing text in the database:

direct
master/detail
external operating-system files
external Web files

See Also:

For more information about text storage, see "Text Storage" in Chapter 6, "Text Concepts".

List of Data Store Tiles and Attributes

ConText provides the following Data Store Tiles:

Tile Attributes Attribute Values

DIRECT

** none **

N/A

MASTER DETAIL

binary

0 (plain text)

1 (binary text)

MASTER DETAIL NEW

binary

0 (plain text)

1 (binary text)

detail_table

name of the detail table (string)

detail_key

name of the foreign key column in the detail table (string)

detail_lineno

name of the line number column in the detail table (string)

detail_text

name of the text column in the detail table (string)

detail_text_size

Internal use only

OSFILE

path

path1:path2:...:pathn

URL

timeout

seconds (0 to 3600, default 30)

maxthreads

number of threads (0 to 1024, default 8)

maxurls

buffer length in bytes (1 to 4294967295, default 256)

urlsize

URL length (32 to 65535, default 256)

maxdocsize

document size (256 to 4294967295, default 2000000)

http_proxy

host name

ftp_proxy

host name

no_proxy

string (up to 16 strings, separated by commas)

Tile	Attributes	Attribute Values
DIRECT	none	N/A
MASTER DETAIL	binary	0 (plain text)
		1 (binary text)
MASTER DETAIL NEW	binary	0 (plain text)
		1 (binary text)
	detail_table	name of the detail table (string)
	detail_key	name of the foreign key column in the detail table (string)
	detail_lineno	name of the line number column in the detail table (string)
	detail_text	name of the text column in the detail table (string)
	detail_text_size	Internal use only
OSFILE	path	path1:path2:...:pathn
URL	timeout	seconds (0 to 3600, default 30)
	maxthreads	number of threads (0 to 1024, default 8)
	maxurls	buffer length in bytes (1 to 4294967295, default 256)
	urlsize	URL length (32 to 65535, default 256)
	maxdocsize	document size (256 to 4294967295, default 2000000)
	http_proxy	host name
	ftp_proxy	host name
	no_proxy	string (up to 16 strings, separated by commas)

DIRECT Tile

The DIRECT Tile is used for text stored directly in the database. It has no attributes.

MASTER DETAIL Tile

The MASTER DETAIL Tile is used for text stored directly in the database in master-detail tables, with the textkey column located in the detail table. The column policy is assigned to this column.

Attributes

MASTER DETAIL has the following attribute(s):

binary

The binary attribute specifies whether the text in a master detail table is in plain text format (0) or binary format (1).

Text in plain text format uses newline characters at the end of each line to indicate the end of the line. In contrast, binary format does not use newline characters to indicate the end of the line.

MASTER DETAIL NEW Tile

The MASTER DETAIL NEW Tile is used for text stored directly in the database in master-detail tables, with the textkey column located in the master table. The column policy is assigned to this column and all detail information is stored in the Data Store preference, rather than the column policy.

Attributes

MASTER DETAIL NEW has the following attribute(s):

binary

The binary attribute specifies whether the text in a master detail table is in plain text format (0) or binary format (1).

detail_table

The detail_table attribute specifies the name of the detail table in the master-detail relationship.

detail_key

The detail_key attribute specifies the name of the foreign key column in the detail table.

detail_lineno

The detail_lineno attribute specifies the name of the column in the detail table that identifies rows in the table.

detail_text

The detail_text attribute specifies the name of the text column in the detail table.

OSFILE Tile

The OSFILE Tile is used for text stored in files accessed through the local file system.

Attributes

OSFILE has the following attribute(s):

path

The path attribute specifies the location of text files that are stored externally in a file system.

Multiple paths can be specified for path, with each path separated by a colon (:). File names are stored in the text column in the text table. If path is not used to specify a path for external files, ConText requires the path to be included in the file names stored in the text column.

Note:

If text is stored in external files rather than in a database, the files must be accessible from the host machine on which the ConText server is running.

This can be accomplished by storing the files in the file system for the host machine or by mounting the file system where the files are stored to the host machine.

URL Tile

The URL Tile is used for text stored:

in files in the local file system (accessed through the file protocol)
in files on the World Wide Web (accessed through HTTP or FTP)

Attributes

URL has the following attribute(s):

timeout

The timeout attribute specifies the length of time, in seconds, that a network operation such as 'connect' or 'read' waits before timing out and returning a timeout error to the application. The valid range for timeout is 0 to 3600 and the default is 30.

Note:

Since timeout is at the network operation level, the total timeout may be longer than the time specified for timeout.

maxthread

The maxthreads attribute specifies the maximum number of threads that can be running at the same time. The valid range for maxthreads is 1 to 1024 and the default is 8.

Note:

The upper range of maxthreads corresponds to the number of file descriptors that the operating system can process at one time. If the number of files the operating system can process at one time is less than the value set, an invalid socket error may be returned.

maxurls

The maxurls attribute specifies the maximum number of rows that the internal buffer can hold for HTML documents (rows) retrieved from the text table. The valid range for maxurls is 1 to 4294967295 and the default is 256.

urlsize

The urlsize attribute specifies the maximum length, in bytes, that the URL data store supports for URLs stored in the database. If a URL is over the maximum length, an error is returned. The valid range for urlsize is 32 to 65535 and the default is 256.

Note:

The values specified for maxurls and urlsize, when multiplied, cannot exceed 5000000.

In other words, the maximum size of the memory buffer (maxurls * urlsize) for the URL Tile is approximately 5 Megabytes.

maxdocsize

The maxdocsize attribute specifies the maximum size, in bytes, that the URL data store supports for accessing HTML documents whose URLs are stored in the database. The valid range for maxdocsize is 1 to 4294967295 and the default is 200000 (2 Mb).

http_proxy

The http_proxy attribute specifies the fully-qualified name of the host machine that serves as the HTTP proxy (gateway) for the machine on which ConText is installed. This attribute must be set if the machine is in an intranet that requires authentication through a proxy server to access Web files located outside the firewall.

ftp_proxy

The ftp_proxy attribute specifies the fully-qualified name of the host machine that serves as the FTP proxy (gateway) for the machine on which ConText is installed. This attribute must be set if the machine is in an intranet that requires authentication through a proxy server to access Web files located outside the firewall.

no_proxy

The no_proxy attribute specifies a string of domains (up to sixteen, separate by commas) which are found in most, if not all, of the machines in your intranet. When one of the domains is encountered in a host name, no request is sent to the machine(s) specified for ftp_proxy and http_proxy. Instead, the request is processed directly by the host machine identified in the URL.

For example, if the string 'us.oracle.com, uk.oracle.com' is entered for no_proxy, any URL requests to machines that contain either of these domains in their host names are not processed by your proxy server(s).

Data Store Example

The following example creates a preference named doc_ref for the OSFILE Tile:

begin
  ctx_ddl.set_attribute ('PATH', '/private/mydocs');
  ctx_ddl.create_preference ('DOC_PREF', 'Path my for my documents' 'OSFILE');
end;

Note:

This example illustrates usage of OSFILE for documents stored in a UNIX-based environment.

The directory path syntax may be different for other environments.

Filter Tiles

Filter Tiles are used to create preferences which determine how text is filtered for indexing and highlighting. Filters allow word processor and formatted documents, as well as ASCII and HTML text documents, to be indexed and highlighted by ConText.

For formatted documents, ConText stores documents in their native format and uses filters to build temporary ASCII versions of the documents. ConText indexes the temporary ASCII text of the formatted document. ConText also uses the ASCII version to highlight query terms.

ConText provides internal filters for processing many of the popular document formats, including Microsoft Word, WordPerfect, and AmiPro.

In addition, ConText allows users to specify external filters for filtering documents in formats not supported by the internal filters provided with ConText.

External filters can also be used to perform operations, such as cleaning up or converting text, before the text is filtered for indexing and highlighting.

See Also::

For examples of creating Filter preferences, see "Creating Filter Preferences" in Chapter 9, "Setting Up and Managing Text".

For more information about internal and external filters, see "Text Filtering" in Chapter 6, "Text Concepts".

List of Filter Tiles and Attributes

ConText provides the following Filter Tiles:

Tile Attributes Attribute Values

BLASTER FILTER

executable

format id (number), filter executable, sequence (number)

format

0 or 999 (No filter -- plain/ASCII text)

1 or 4 (Word Perfect for Windows 5.x; Word Perfect for DOS 5.0, 5.1)

2 (MS Word for DOS 5.0, 5.5)

5 (Word Perfect for Windows 6.x; Word Perfect for DOS 6.0)

6 (MS Word for Mac 3, 4, 5.x)

7 (MS Word for Windows 2)

8 (AMIPRO for Windows 1, 2, 3)

9 (Lotus 1-2-3 for Windows 2, 3, 4, 5; Lotus 1-2-3 for DOS 4, 5)

11 (MS Word for Windows 6.x, 7.0)

13 (Xerox XIF for UNIX 5, 6)

997 (Autorecognize)

FILTER NOP

** none **

N/A

HTML FILTER

code_conversion

0 (disabled)

1(enabled)

keep_tag

tag (string), sequence (number)

USER FILTER

command

filter executable

BLASTER FILTER Tile

The BLASTER FILTER Tile is used to specify that the internal filters are used to filter documents. It can also be used to specify that multiple external filters are used to filter documents in a mixed-format column.

Attributes

BLASTER FILTER has the following attribute(s):

format

The format attribute specifies the internal filter used for filtering text stored in a text column.

executable

The executable attribute specifies the external filters that are used to filter text stored in a mixed-format text column. It has three values that must be specified:

format_id (document format for the external filter)
filter_executable (name of executable that performs the filtering for the document format)
sequence_num (identifier for the executable and document format used in the preference)

Note:

format and executable cannot both be set in the same preference.

See Also:

For a list of the format IDs supported by the executable attribute, see "Supported External Filter Formats for Mixed-Format Columns" in Chapter 6, "Text Concepts".

FILTER NOP Tile

The FILTER NOP Tile is used to specify that plain text is stored in the text column and no filtering needs to be performed. It has no attributes.

HTML FILTER Tile

The HTML FILTER Tile is used to specify that the internal HTML filter is used to filter plain text that contains HTML tags.

Attributes

HTML_FILTER has the following attribute(s):

code_conversion

The code_conversion attribute specifies whether code conversion is enabled for documents which contain Japanese ASCII text with HTML tags.

Code conversion is required for Japanese HTML documents if the documents use more than one of the three character sets supported for HTML text in Japanese. If code conversion is enabled, all Japanese HTML documents are converted to a single, common character set before indexing.

The default for code_conversion is 0 (disabled).

Note:

For mixed-format columns that use Autorecognize (BLASTER Tile, format attribute = 997) or use external filters (BLASTER Tile, executable attribute) for all formats except HTML, code conversion is always enabled.

keep_tag

The keep_tag attribute takes two values: the HTML tag to retain during indexing and a sequence number that uniquely identifies the tag.

The following rules apply to keep_tag:

the angle brackets '<>' that identify tags in HTML are not required when setting keep_tag
multiple tags can be specified for a Filter preference by calling CTX_DDL.SET_ATTRIBUTE once for each tag, then calling CTX_DDL.CREATE_PREFERENCE
the sequence number specified for each tag must be unique within the preference
if the tag specified for keep_tag contains additional (i.e. meta) information, the additional information is filtered by the HTML filter
For example, keep_tag is set to BODY and the following string occurs in a document:
```
<HTML><BODY BGCOLOR=#ffffff>hello</BODY></HTML>
```
ConText translates the string to:
```
<BODY>hello</BODY>
```
This string is passed to the HTML filter, which ignores the HTML tags, then to the lexer, which indexes the token hello as belonging to the BODY section.

USER FILTER Tile

The USER FILTER Tile is used to specify an external filter for filtering documents in a column.

Attributes

USER FILTER has the following attribute(s):

command

The command attribute specifies the executable for the single external filter used to filter all text stored in a column. If more than one document format is stored in the column, the external filter specified for command must recognize and handle all such formats, otherwise the BLASTER FILTER Tile and the executable attribute should be used instead of the USER FILTER Tile.

Filter Examples

The following section provides two Filter preference examples:

Example 1 (MS Word 6 documents)

The following example creates a preference named word6 for the BLASTER FILTER Tile:

begin
  ctx_ddl.set_attribute ('FORMAT', '11');
  ctx_ddl.create_preference ('WORD6', 'Microsoft Word docs', 'BLASTER FILTER');
end;

Example 2 (HTML documents with document sections enabled)

The following example creates a preference named sect_filt_pref for the HTML FILTER Tile:

begin
   ctx_ddl.set_attribute('KEEP_TAG', 'TITLE', 1);
   ctx_ddl.set_attribute('KEEP_TAG', 'HEAD', 1);
   ctx_ddl.set_attribute('KEEP_TAG', 'BODY', 1);
   ctx_ddl.set_attribute('KEEP_TAG', 'H1', 1);
   ctx_ddl.create_preference('sect_filt_pref','sect search filt','HTML FILTER');
end;

In this example, the <TITLE>, </TITLE>, <HEAD>, </HEAD>, <BODY>, </BODY>, <H1>, and </H1> HTML tags are retained by the HTML filter during filtering, provided the startjoins and endjoins attributes for the BASIC LEXER Tile are set appropriately.

Note:

When using keep_tag to specify tags to be retained, you do not need to specify the angle bracket or forward slash characters in the tag strings.

Lexer Tiles

The Tiles in the Lexer category are used to create preferences which specify the lexer used to perform indexing.

Text Lexers

A text lexer parses text and identifies tokens for indexing. English and other single-byte languages, including most European languages, can use the same lexer because tokens (words) in those languages are delimited by blank spaces and standard punctuation (commas, periods, question marks, etc.).

Japanese, Chinese, and many other Asian languages are pictorial (multi-byte) languages that cannot be tokenized in the same manner as single-byte languages.

Single-Byte Languages

ConText includes a single Lexer Tile, BASIC LEXER Tile, for all of the single-byte languages, such as English (7-bit character set) and the other European languages (8-bit character sets), supported by ConText. The basic lexer also works with languages such as Greek, which have different alphabets, but still utilize blank spaces to delimit words.

Multi-Byte Languages

ConText includes three separate Lexer Tiles for processing Japanese, Chinese, and Korean text.

The CHINESE V-GRAM LEXER Tile and JAPANESE V-GRAM LEXER Tile do not rely on finding token boundaries within text; instead, they uses a list of terms to match and index patterns of characters at user-specified, variable points of length.

The Japanese and Chinese lexers also work with languages that use a 7-bit character set, such as English. As a result, ConText supports indexing and querying Japanese and Chinese text that also contains English text.

Note:

Languages that use an 8-bit character set, such as many of the European languages, are not supported by the Japanese and Chinese lexers.

The Korean lexer, KOREAN LEXER Tile, works similarly to the Japanese and Chinese lexers by finding character patterns in the text and matching the patterns to a dictionary of terms. However, due to the significant morphological transformations that Korean verbs undergo, the Korean lexer only indexes nouns and noun phrases.

NLS Compliance

The BASIC LEXER Tile supports all NLS-compliant character sets, including the AL24UTFFSS (UTF-8) character set. UTF-8 is a character set that recognizes the characters from most single-byte and multi-byte character sets.

Users with multi-lingual environments, such as multi-national companies, can specify UTF-8 for a database and use the database to store documents that use any one of the character sets supported by UTF-8. ConText supports indexing all documents stored in a UTF-8 database and queries to the database from clients running any of the UTF-8 supported character sets.

Supported Languages

The BASIC LEXER Tile currently supports the UTF-8 character set only for space-delimited, single-byte languages, which includes English and other Western European languages.

The BASIC LEXER Tile does not support UTF-8 for the multi-byte languages, nor do the Japanese, Chinese, and Korean lexers currently support UTF-8.

Enabling the NLS-compliant Lexer

The BASIC LEXER Tile does not require any setup to enable it to handle UTF-8 or other NLS-compliant character sets; however, the NLS_LANG environment variable must be set to the appropriate language/territory/character set. In addition, the ORA_NLS32 and ORA_NLS environment variables must be set to the directories containing the appropriate NLS data.

Limitations

The lexer has the following limitations when UTF-8 is the character set specified for the database:

base-letter conversion is not supported
characters from 8-bit character sets are not supported in the BASIC LEXER Tile attributes (i.e. printjoins, skipjoins, startjoins, endjoins, punctuations, numjoin, numgroup, continuation)

Composite Word Indexing

For German-language text, the BASIC LEXER Tile provides an attribute for enabling composite word indexing. With composite word indexing, tokens that are compound words (specifically, nouns in German text) are divided into their constituent (root) nouns, including inflected forms of the roots, and the roots are stored in the ConText index along with the entry for the compound word.

For example, if the word Hauptbahnhof is encountered in a document during composite word indexing, the following entries are created in the index: HAUPTBAHNHOF, HAUPT, BAHN, BAHNEN, HOF.

Note:

Because each token that is encountered has to be processed through the ConText decompounding routines, composite indexing may affect indexing performance.

In addtion, because composite word indexes may be substantially larger than standard text indexes, composite word indexing may affect query performance.

Supported Character Sets

Composite word indexing supports both single-byte and multi-byte character sets, specifically WE8ISO8859P9 (extended, single-byte) and AL24UTFFSS (multi-byte).

Limitations

Composite indexes have the following limitations:

composite indexing must be enabled for text columns containing only German text. If the column contains text in other languages, composite indexing will fail
composite word indexes do not support exact word searches (i.e. standard text queries). If you want to enable composite and exact word queries for a column, you must create a compound index and a standard index for the column

case-sensitivity is not supported for composite indexes (all tokens are stored in all-uppercase)

Note:

The uppercasing of all tokens in a composite index results in the composite routines not recognizing some compound nouns. As a result, those nouns are not divided into their root nouns and are indexed as regular tokens with a single entry only in the index.

Word Queries

Composite word indexing enables text queries to return all documents that contain either the query term itself or the query term as a root of a compound word; however, queries for phrases that contain one or more compound words return only the documents that contain the exact phrase.

Note:

For more information about composite word queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Theme Lexer

For English-language text, a separate lexer, THEME LEXER Tile, is provided for creating theme indexes. This lexer breaks text into tokens; however, the tokens are not stored in the theme index. The tokens are passed to the ConText linguistic core where they are analyzed within the context of the sentences and paragraphs in which they appeared to determine whether they are content-bearing words. The linguistic core then generates themes, which are stored in the theme index.

The themes generated by ConText are based on, but are not identical to, the content-bearing tokens in the text.

See Also:

For more information about the theme lexer and theme indexing, see "Theme Indexes" in Chapter 6, "Text Concepts".

List of Lexer Tiles and Attributes

ConText provides the following Lexer Tiles for creating preferences for indexing:

Tile Attributes Attribute Values

BASIC LEXER

base_letter

0 (disabled)

1 (enabled)

continuation

characters (string)

numgroup

characters (string)

numjoin

characters (string)

printjoins

characters (string)

punctuations

characters (string)

skipjoins

characters (string)

startjoins

non-alphanumeric characters that occur at the beginning of a token (string)

endjoins

non-alphanumeric characters that occur at the end of a token (string)

mixed_case

0 (disabled)

1 (enabled)

composite

0 (no composite word indexing)

1 (German composite word indexing)

CHINESE V-GRAM LEXER

hanzi_indexing

1

2

JAPANESE V-GRAM LEXER

kanji_indexing

1

2

KOREAN LEXER

** none **

N/A

THEME LEXER

** none **

N/A

BASIC LEXER Tile

The BASIC LEXER Tile is used to identify tokens for creating text indexes for English and all other supported single-byte languages. It is also used to enable base-letter conversion for single-byte languages that have extended character sets.

Note:

Any changes made to tokens before text indexing (e.g. removing of characters, base-letter conversion) are also performed on the query terms in a text query. This ensures that the query terms match the form of the tokens in the text index entries.

Attributes

BASIC LEXER has the following attribute(s):

Note:

The character strings for the BASIC LEXER Tile attributes can contain multiple characters. Each character in the string serves as a punctuation, join, or continuation character.

For example, if the string '*_.-' are specified for the printjoins attribute, each individual character ('*', '_', '.', and '-') in the string is treated by ConText as a join character that is included in the index entry for a token in which the character occurs.

base_letter

base_letter specifies whether characters that have diacritical marks (umlats, cedillas, acute accents, etc.) are converted to their base form before being stored in the text index. The default is 0 (base-letter conversion disabled).

continuation

continuation specifies the characters that indicate a word continues on the next line and should be indexed as a single token. The most common continuation characters are a hyphen '-' and a backslash '\'.

numgroup

numgroup specifies the characters that, when they appear in a string of digits, indicate that the digits are groupings within a larger single unit.

For example, comma ',' or period '.' may be defined as numgroup characters because they often indicate a grouping of thousands when they appear in a string of digits.

numjoin

numjoin specifies the characters that, when they appear in a string of digits, cause ConText to index the string of digits as a single unit or word.

For example, period '.' or comma ',' may be defined as numjoin characters because they often serve as decimal points when they appear in a string of digits.

Note:

The default values for numjoin and numgroup are determined by the NLS initialization parameters that are specified for the database.

In general, a value does not need to be specified for either numjoin or numgroup when creating a Lexer preference for the BASIC LEXER Tile.

printjoins

printjoins specifies the non-alphanumeric characters that, when they appear anywhere in a word (beginning, middle, or end), are processed by ConText as alphanumeric and included with the token in the text index. This includes printjoins that occur consecutively.

For example, if the hyphen '-' and underscore '_' characters are defined as printjoins, terms such as pseudo-intellectual and _file_ are stored in the text index as pseudo-intellectual and _file_.

Note:

If a printjoins character is also defined as a punctuations character, the character is only processed as an alphanumeric character if the character immediately following it is a standard alphanumeric character or has been defined as a printjoins or skipjoins character.

punctuations

punctuations specifies the non-alphanumeric characters that, when they appear at the end of a word, indicate the end of a sentence or a grouping within a sentence.

Characters that are defined as punctuations are removed from a token before text indexing; however, if a punctuations character is also defined as a printjoins character, the character is only removed if it is the last character in the token and it is immediately preceded by the same character.

For example, if the period (.) is defined as both a printjoins and a punctuations character, the following transformations take place during indexing and querying as well:

Token Indexed Token

.doc

.doc

dog.doc

dog.doc

dog..doc

dog..doc

dog.

dog

dog...

dog..

skipjoins

skipjoins specifies the non-alphanumeric characters that, when they appear within a word, identify the word as a single token; however, the characters are not stored with the token in the text index.

For example, if the hyphen character '-' is defined as a skipjoin, the word pseudo-intellectual is stored in the text index as pseudointellectual.

Note:

printjoins and skipjoins are mutually exclusive. The same characters cannot be specified for both attributes.

startjoins/endjoins

startjoins specifies the characters that, when encountered as the first character in a token, explicitly identify the start of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the ConText index entry for the token. In addition, the first startjoins character in a string of startjoins characters implicitly end the previous token.

endjoins specifies the characters that, when encountered as the last character in a token, explicitly identify the end of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the ConText index entry for the token.

The following rules apply to both startjoins and endjoins:

the characters specified for startjoins/endjoins cannot occur in any of the other attributes for BASIC LEXER.
startjoins/endjoins characters can occur only at the beginning/end of tokens

multiple, contiguous startjoins/endjoins characters are allowed at the beginning/end of a token; however, multiple occurrences of the same startjoins/endjoins character at the beginning/end of a token are not supported

Note:

Defining startjoins and endjoins characters is particularly useful for creating document sections that enable section searching in a column.

For examples of creating sections and section groups, see "Managing Document Sections" in Chapter 9, "Setting Up and Managing Text".

For more information about sections, see "Document Sections" in Chapter 6, "Text Concepts".

For more information about section searching, see Oracle8 ConText Cartridge Application Developer's Guide.

mixed_case

The mixed_case attribute specifies whether the lexer converts the tokens in text index entries to all uppercase or stores the tokens exactly as they appear in the text. The default is 0 (tokens converted to all uppercase).

composite

The composite attribute specifies whether composite word indexing is enabled. For the current release, composite indexing is supported for German-language text only. The default is 0 (no composite word indexing).

Note:

In a Lexer preference that used the BASIC LEXER Tile, the composite and mixed_case attributes cannot both be set. Composite indexes do not support case-sensitivity.

See Also:

For more information, see "Composite Word Indexing" in this chapter.

CHINESE V-GRAM LEXER Tile

The CHINESE V-GRAM LEXER Tile is used for identifying tokens for creating text indexes for Chinese text.

Attributes

CHINESE V-GRAM LEXER has the following attribute(s):

hanzi_indexing

The hanzi_indexing attribute specifies the number of characters used for pattern matching while indexing.

A value of 1 for hanzi_indexing indicates that the Chinese lexer examines each character individually to determine token boundaries.

A value of 2 for hanzi_indexing indicates that the lexer examines characters in pairs to determine token boundaries. Pattern matching using pairs is generally faster than matching individual characters, resulting in faster index creation.

The default is 2.

JAPANESE V-GRAM LEXER Tile

The JAPANESE V-GRAM LEXER Tile is used for identifying tokens for creating text indexes for Japanese text.

Attributes

JAPANESE V-GRAM LEXER has the following attribute(s):

kanji_indexing

The kanji_indexing attribute specifies the number of characters used for pattern matching while indexing.

A value of 1 for kanji_indexing indicates that the Japanese lexer examines each character individually to determine token boundaries.

A value of 2 for kanji_indexing indicates that the lexer examines pairs of characters to determine token boundaries. Pattern matching using pairs is generally faster than matching individual characters, resulting in faster index creation.

The default is 2.

KOREAN LEXER Tile

The KOREAN LEXER Tile is used for identifying tokens for creating text indexes for Korean text. It has no attributes.

THEME LEXER Tile

The THEME LEXER Tile is used to create theme indexes for English-language text. It has no attributes.

Lexer Examples

The following section provides two Lexer preference examples that both use the BASIC LEXER Tile.

Example 1

The following example creates a preference named doc_link:

begin
  ctx_ddl.set_attribute     ('PRINTJOINS', '.-@&$#/');
  ctx_ddl.create_preference ('DOC_LINK', 'numerous joins', 'BASIC LEXER' );
end;

In this example, the '.', '-', '@', '&', '$', '#', and '/' characters are all defined as printjoins characters.

Characters such as the dollar sign '$' and number sign '#' are useful if you want to index tokens that may contain these characters, such as sums of money and numbers.

Example 2 (startjoins and endjoins)

The following example creates a preference named section_pref:

exec ctx_ddl.set_attribute(`startjoins','</');
exec ctx_ddl.set_attribute(`endjoins','>');
exec ctx_ddl.set_attribute(`printjoins','_@-&$#.');
...
exec ctx_ddl.create_preference(`sect_lex_pref','basic lexing + sections','BASIC LEXER');

In this example, the characters `<` and '/' are defined as startjoins characters. The character `>' is defined as an endjoins character.

The open and closed angle brackets '< >' and the forward slash '/' are useful for identifying HTML tags for document sections.

See Also:

For more information about sections, see "Document Sections" in Chapter 6, "Text Concepts"

Engine Tiles

Engine Tiles are used to create preferences which specify how ConText indexes are created by the ConText engine and where in the database the indexes are stored.

The engine is the ConText component that creates a ConText index for a text column. A ConText index is required before text in a column can be queried.

See Also:

For an example of creating an Engine preference, see "Creating an Engine Preference" in Chapter 9, "Setting Up and Managing Text".

List of Engine Tiles and Attributes

ConText provides the following Engine Tiles:

Tile Attributes Attribute Values

ENGINE NOP (NOT USED)

** none **

N/A

GENERIC ENGINE

index_memory

memory in bytes (integer)

optimize_default

default ConText index optimization method

i1t_tablespace, i1t_storage, i1t_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for token table

i1i_tablespace, i1i_storage, i1i_other_parms

tablespace (string), STORAGE clause (string), and other index creation parameters (string) for index on token table

ktb_tablespace, ktb_storage, ktb_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for mapping table

kid_tablespace, kid_storage, kid_other_parms
kik_tablespace, kik_storage, kik_other_parms

tablespace (string), STORAGE clause (string), and other index creation parameters (string) for indexes on mapping table

lst_tablespace, lst_storage, lst_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for control table

lix_tablespace, lix_storage, lix_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for index on control table

sqr_tablespace, sqr_storage, sqr_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for SQE results table

sri_tablespace, sri_storage, sri_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for index on SQE results table

ENGINE NOP Tile

The ENGINE NOP Tile specifies that no engine is used for indexing. This Tile is currently not used and should not be used to create preferences for indexing.

GENERIC ENGINE Tile

The GENERIC ENGINE Tile specifies that the engine provided by ConText is used for indexing. ConText supplies a single engine that creates index entries for Context indexes, independent of the format, location, language, and character set of the text.

In particular, the GENERIC ENGINE Tile attributes specify the amount of memory allocated for indexing, and the tablespace(s) and creation parameters for the database tables and indexes that constitute a ConText index.

See Also:

For descriptions of the ConText index tables and indexes, see "Appendix C, "ConText Index Tables and Indexes".

Attributes

GENERIC ENGINE has the following attribute(s):

index_memory

index_memory specifies the amount of memory, in bytes, allocated for indexing.

Note:

When specifying a value for index_memory in a preference, specify as much real (not virtual) memory as is available on the machine which is running the ConText server that will be creating indexes.

For parallel indexing, the memory specified should be the amount of available memory divided evenly among the number of ConText servers that will perform the indexing in parallel.

optimize_default

optimize_default specifies the type of optimization used when CTX_DDL.OPTIMIZE_INDEX is called without an optimization type. If no value is specified for optimize_default, the default is DEFRAGMENT_TO_TWO_TABLE.

xxx_tablespace

i1t_tablespace, ktb_tablespace, and lst_tablespace specify the tablespaces used for the ConText index tables created during indexing.

sqr_tablespace specifies the tablespace used for the stored query expression result (SQR) table that is created, but not populated, during indexing. The SQR table for a policy stores the results of stored query expressions for the policy.

i1i_tablespace, kid_tablespace, kik_tablespace, and lix_tablespace specify the tablespaces used for the Oracle indexes generated for each ConText index table.

sri_tablespace specifies the tablespace used for the Oracle index generated for each SQR table.

Note:

For each xxx_tablespace attribute that is not specified when creating an Engine preference, the text table owner's default tablespace is used for storing the ConText index objects (tables and indexes).

xxx_storage

i1t_storage, ktb_storage, and lst_storage specify the STORAGE clauses used to create the ConText index tables during ConText indexing.

sqr_storage specifies the STORAGE clause used to create the stored query expression result (SQR) table during ConText indexing.

i1i_storage, kid_storage, kik_storage, and lix_storage specify the STORAGE clauses used to create the Oracle indexes for each ConText index table.

sri_storage specifies the STORAGE clause used to create the Oracle index for each SQR table.

See Also:

For more information about the STORAGE clause, see the CREATE TABLE and CREATE INDEX commands in Oracle8 Server SQL Reference.

xxx_other_parms

i1t_other_parms, ktb_other_parms, and lst_other_parms specify any additional parameters used to create the ConText index tables during ConText indexing.

sqr_other_parms specifies any additional parameters used to create the stored query expression result (SQR) table during ConText indexing.

i1i_other_parms, kid_other_parms, kik_other_parms, and lix_other_parms specify any additional parameters used to create the Oracle indexes for each ConText index table.

sri_other_parms specifies any additional parameters used to create the Oracle index for each SQR table.

Note:

In particular, the xxx_other_parms attributes are used to specify a value for the PARALLEL clause in the CREATE TABLE|INDEX command. The PARALLEL clause determines the degree of parallelism used by the Oracle parallel query option for operations such as generating Oracle indexes.

For more information about the PARALLEL clause in CREATE TABLE and CREATE INDEX, as well as the other parameters that can be used to create database tables and indexes, see Oracle8 Server SQL Reference.

For more information about the parallel query option in Oracle, see Oracle8 Server Tuning.

See Also:

For more information about SQEs, see Oracle8 ConText Cartridge Application Developer's Guide.

Engine Example

The following example creates a preference named doc_engine for the GENERIC ENGINE Tile:

begin
  ctx_ddl.set_attribute ('INDEX_MEMORY',   30000000 );
  ctx_ddl.set_attribute ('I1T_TABLESPACE', 'DOCUMENTS' );
  ctx_ddl.set_attribute ('I1T_STORAGE',' initial 10M next 2M
                         maxextents 10');
  ctx_ddl.set_attribute ('I1T_OTHER_PARMS',' pctfree 20');
  ctx_ddl.set_attribute ('I1I_OTHER_PARMS',' parallel 2');
  ctx_ddl.create_preference ('DOC_ENGINE', 'Test case',
                             'GENERIC ENGINE' );
end;

Wordlist Tiles

The Tiles in the Wordlist category are used to create preferences for enabling three of the ConText query expansion methods:

Stemming
Fuzzy Matching
Soundex

See Also:

For more information about expanding queries and the query expansion operators provided by ConText, see Oracle8 ConText Cartridge Application Developer's Guide.

Stemming

Stemming expands a query by deriving variations (verb conjugation, noun, pronoun, and adjective inflections) of the search token(s) in the query.

For example, a stem search on the verb buy expands to include its alternate verb forms, such as buys, buying, and bought, but not on the noun buyer. A search on the noun buyer would expand only to include its plural form buyers.

Since different languages have different stemming rules, stemming is language-dependent and uses term lists that define the relationships between the words in a given language

ConText provides a stemmer, licensed from Xerox Corporation, that utilizes Xerox Lexical Technology to support inflectional and derivational stemming in English and inflectional stemming in a number of Western European languages.

Fuzzy Matching

Fuzzy matching expands queries by including terms that are spelled similar to the search token in the query. This type of expansion can be useful in queries for text that contains frequent misspellings or has been scanned using OCR software.

For example, a fuzzy matching query for the term cat expands to include cats, calc, case.

The number of expansions generated by fuzzy matching depends on the tokens that ConText identified during indexing; results can vary significantly according to the tokens that were identified and indexed by ConText for the column. As such, fuzzy matching depends on how tokens are delimited in a given language.

Note:

Fuzzy matching is designed primarily for English-language documents, but can be used, with varying degrees of success with many of the Western European languages.

Soundex

During text indexing of a column, Soundex, if enabled, creates a list of all the words that sound alike and assigns one or more IDs to each word to identify the other words in the list that sound like the word.

Note:

Soundex is designed primarily to look for matches in phonetic spellings used in English, but can be used, with varying degrees of success with many of the other Western European languages.

The Soundex wordlist is stored in the DR_nnnnn_I1W ConText index table, where nnnnn is the identifier of the policy for the text index.

If Soundex is enabled for a text column, users can call Soundex in a query to expand the query. Soundex expands a query by searching the I1W table for terms that sound similar to the specified query term.

For example, a Soundex search on the name Smith would also find the names Smythe and Smit.

Note:

Soundex in ConText uses the same algorithm as the SOUNDEX function in SQL.

For more information about the SOUNDEX function in SQL, see Oracle8 Server SQL Reference.

List of Wordlist Tiles and Attributes

ConText provides the following Wordlist Tiles:

Tile Attributes Attribute Values

GENERIC WORD LIST

stclause

STORAGE clause (string) for Soundex wordlist table

instclause

STORAGE clause (string) for index on Soundex wordlist table

soundex_at_index

0 (disabled)

1 (enabled)

stemmer

1 (English)

2 (English -- derivational)

3 (Dutch)

4 (French)

5 (German)

6 (Italian)

7 (Spanish)

fuzzy_match

1 (English and other Western European languages)

2 (Japanese)

3 (Korean)

4 (Chinese)

12 (Soundex emulation)

13 (Dutch)

14 (French)

15 (German)

16 (Italian)

17 (Spanish)

18 (OCR text)

section_group

name of section group

GENERIC WORD LIST Tile

The GENERIC WORD LIST Tile is used to specify the advanced query options for ConText indexes. ConText provides a single Tile for handling stemming, fuzzy matching, Soundex, and named section searching.

See Also:

For more information about expansion methods in queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Attributes

GENERIC WORD LIST has the following attribute(s):

stclause

The stclause attribute specifies the STORAGE clause used to create the Soundex wordlist table during ConText indexing. The Soundex wordlist table is only created if Soundex is enabled through the soundex_at_index attribute.

instclause

The instclause attribute specifies the STORAGE clause used to create the Oracle index for the Soundex wordlist table.

soundex_at_index

The soundex_at_index attribute specifies whether ConText generates Soundex word mappings and stores them in the Soundex wordlist table during text indexing. If Soundex word mappings are not generated and stored in the wordlist table during indexing, queries that use Soundex are not expanded.

stemmer

The stemmer attribute specifies the stemmer used for word stemming in text queries. For all the supported languages, the stemmers return standard inflected forms of a word, such as the plural form (e.g. department --> departments).

For English, an additional stemmer is provided which returns standard inflected forms and derived forms (e.g. department --> departments, departmentalize).

The default for stemmer is 1 (inflectional English)

fuzzy_match

The fuzzy_match attribute specifies which fuzzy matching routines are used for the column. Fuzzy matching is currently supported for English, Japanese, and, to a lesser extent, the Western European languages.

The default for fuzzy_match is 1.

Note:

The fuzzy_match attribute values for Chinese and Korean are dummy attribute values that prevent the English and Japanese fuzzy matching routines from being used on Chinese and Korean text.

section_group

The section_group attribute specifies the name of the section group to assign to a text column. The following rules apply to section_group:

no default value for section_group
all available section groups in the ConText data dictionary can be specified for section_group; the section group owner does not need to be the same as the policy owner

See Also:

For more information about section groups, see "Document Sections" in Chapter 6, "Text Concepts".

Wordlist Example

The following example creates a preference named soundex_yes for the GENERIC WORD LIST Tile:

begin
  ctx_ddl.set_attribute('SOUNDEX_AT_INDEX', '1');
  ctx_ddl.create_preference('SOUNDEX_YES',
                            'Will build the soundex mapping during indexing',
                            'GENERIC WORDLIST');
end;

Stoplist Tiles

The Stoplist Tiles are used to create Stoplist preferences. A stoplist is a list of common terms that ConText does not include in the text index for a text column.

Each stoplist can contain a maximum of 4095 words.

See Also:

For an example of creating a Stoplist preference, see "Creating a Stoplist Preference" in Chapter 9, "Setting Up and Managing Text".

List of Stoplist Tiles and Attributes

The Stoplist category contains the following Tiles:

Tile Attributes Attribute Values

GENERIC STOP LIST

stop_word

word (string), sequence (number)

GENERIC STOP LIST Tile

The GENERIC STOP LIST Tile specifies the terms that should not be included in the text index.

Attributes

GENERIC STOP LIST has the following attribute(s):

stop_word

The stop_word attribute has two values that must be specified:

the word for which ConText does not create an entry in the text index
the sequence for the word

sequence is a value from 1 to 4095 and is used in a text index to record the stop words that proceed and follow an indexed term. ConText records up to eight preceding stop words and eight following stop words for each indexed term. This enables text queries for phrases which contain stop words.

For example, consider the sentence "he is at the top of the class" where at, the, top, and of are stop words. The sequences for each of the stop words are recorded as part of the text index entry for the term class, which allows users to include stop words in a query (e.g. 'top of the class').

Stoplist Example

The following example creates a preference named mini_stoplist for the GENERIC STOP LIST Tile:

begin
  ctx_ddl.set_attribute     ('STOP_WORD', 'a',   1);
  ctx_ddl.set_attribute     ('STOP_WORD', 'A',   2);
  ctx_ddl.set_attribute     ('STOP_WORD', 'the', 3);
  ctx_ddl.set_attribute     ('STOP_WORD', 'The', 4);
  ctx_ddl.set_attribute     ('STOP_WORD', 'and', 5);
  ctx_ddl.set_attribute     ('STOP_WORD', 'And', 6);
  ctx_ddl.create_preference ('MINI_STOPLIST', 'minilist', 'GENERIC STOP LIST' );
end;

Note:

This example illustrates a stoplist for a case-sensitive text index. If the stoplist is for a case-insensitive index, the stoplist requires only one entry for each stop word and the case of the entry has no effect.

Tile	Attributes	Attribute Values
BLASTER FILTER	executable	format id (number), filter executable, sequence (number)
	format	0 or 999 (No filter -- plain/ASCII text)
		1 or 4 (Word Perfect for Windows 5.x; Word Perfect for DOS 5.0, 5.1)
		2 (MS Word for DOS 5.0, 5.5)
		5 (Word Perfect for Windows 6.x; Word Perfect for DOS 6.0)
		6 (MS Word for Mac 3, 4, 5.x)
		7 (MS Word for Windows 2)
		8 (AMIPRO for Windows 1, 2, 3)
		9 (Lotus 1-2-3 for Windows 2, 3, 4, 5; Lotus 1-2-3 for DOS 4, 5)
		11 (MS Word for Windows 6.x, 7.0)
		13 (Xerox XIF for UNIX 5, 6)
		997 (Autorecognize)
FILTER NOP	none	N/A
HTML FILTER	code_conversion	0 (disabled)
		1(enabled)
	keep_tag	tag (string), sequence (number)
USER FILTER	command	filter executable

Tile	Attributes	Attribute Values
BASIC LEXER	base_letter	0 (disabled)
		1 (enabled)
	continuation	characters (string)
	numgroup	characters (string)
	numjoin	characters (string)
	printjoins	characters (string)
	punctuations	characters (string)
	skipjoins	characters (string)
	startjoins	non-alphanumeric characters that occur at the beginning of a token (string)
	endjoins	non-alphanumeric characters that occur at the end of a token (string)
	mixed_case	0 (disabled)
		1 (enabled)
	composite	0 (no composite word indexing)
		1 (German composite word indexing)
CHINESE V-GRAM LEXER	hanzi_indexing	1
		2
JAPANESE V-GRAM LEXER	kanji_indexing	1
		2
KOREAN LEXER	none	N/A
THEME LEXER	none	N/A

Tile	Attributes	Attribute Values
ENGINE NOP (NOT USED)	none	N/A
GENERIC ENGINE	index_memory	memory in bytes (integer)
	optimize_default	default ConText index optimization method
	i1t_tablespace, i1t_storage, i1t_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for token table
	i1i_tablespace, i1i_storage, i1i_other_parms	tablespace (string), STORAGE clause (string), and other index creation parameters (string) for index on token table
	ktb_tablespace, ktb_storage, ktb_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for mapping table
	kid_tablespace, kid_storage, kid_other_parms kik_tablespace, kik_storage, kik_other_parms	tablespace (string), STORAGE clause (string), and other index creation parameters (string) for indexes on mapping table
	lst_tablespace, lst_storage, lst_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for control table
	lix_tablespace, lix_storage, lix_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for index on control table
	sqr_tablespace, sqr_storage, sqr_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for SQE results table
	sri_tablespace, sri_storage, sri_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for index on SQE results table

Tile	Attributes	Attribute Values
GENERIC WORD LIST	stclause	STORAGE clause (string) for Soundex wordlist table
	instclause	STORAGE clause (string) for index on Soundex wordlist table
	soundex_at_index	0 (disabled)
		1 (enabled)
	stemmer	1 (English)
		2 (English -- derivational)
		3 (Dutch)
		4 (French)
		5 (German)
		6 (Italian)
		7 (Spanish)
	fuzzy_match	1 (English and other Western European languages)
		2 (Japanese)
		3 (Korean)
		4 (Chinese)
		12 (Soundex emulation)
		13 (Dutch)
		14 (French)
		15 (German)
		16 (Italian)
		17 (Spanish)
		18 (OCR text)
	section_group	name of section group

7 Understanding the ConText Data Dictionary: Indexing

Policies

What is a Policy?

Column Policies

Multiple Policies on a Column

Template Policies

Text Indexing Policies

Theme Indexing Policies

Policy Examples

Policy Attributes

Policy Name

Optional Attributes

Text Column

Description

Textkey

Line Number

Source Policy

Preferences in Policies

Preference Defaults

Predefined Template Policies

DEFAULT_POLICY

TEMPLATE_AUTOB

TEMPLATE_BASIC_WEB

TEMPLATE_DIRECT

TEMPLATE_LONGTEXT_STOPLIST_OFF

TEMPLATE_LONGTEXT_STOPLIST_ON

TEMPLATE_MD

TEMPLATE_MD_BIN

TEMPLATE_WW6B

Preferences for Indexing

What is an Indexing Preference?

User-defined Preferences

Predefined Preferences

What is a Tile?

Tile Attributes

Tile Categories

Data Store Predefined Preferences

DEFAULT_DIRECT_DATASTORE

DEFAULT_OSFILE

DEFAULT_URL

MD_BINARY

MD_TEXT

Filter Predefined Preferences

AUTOB

BASIC_HTML_FILTER

DEFAULT_NULL_FILTER

HTML_FILTER

WW6B

Lexer Predefined Preferences

BASIC_HTML_LEXER

DEFAULT_LEXER

KOREAN

THEME_LEXER

VGRAM_CHINESE_1 and VGRAM_CHINESE_2

VGRAM_JAPANESE_1 and VGRAM_JAPANESE_2

Engine Predefined Preferences

DEFAULT_INDEX

Wordlist Predefined Preferences

BASIC_HTML_WORDLIST

NO_SOUNDEX

SOUNDEX

KOREAN_WORDLIST

VGRAM_CHINESE_WORDLIST

VGRAM_JAPANESE_WORDLIST

Stoplist Predefined Preferences

DEFAULT_STOPLIST

NO_STOPLIST

FRENCH_STOPLIST

GERMAN_STOPLIST

ITALIAN_STOPLIST

SPANISH_STOPLIST

Data Store Tiles

List of Data Store Tiles and Attributes

DIRECT Tile

MASTER DETAIL Tile

Attributes

binary

MASTER DETAIL NEW Tile

Attributes

binary

7
Understanding the ConText Data Dictionary: Indexing