Oracle8 ConText Cartridge Application Developer's Guide Release 2.3 A58164-01 |
|
This chapter explains how to use ConText to create query expressions to find relevant text in documents. The topics covered in this chapter are:
A query expression defines the search criteria for retrieving documents using ConText. A query expression consists of query terms (words and phrases) and other components such as operators and special characters which allow users to specify exactly which documents are retrieved by ConText.
A query expression can also call stored query expressions (SQEs) to return stored query results or call PL/SQL functions to return values used in the query.
When a query is executed using any of the methods supported by ConText, one of the arguments included in the query is a query expression. ConText then returns a list of all the documents that satisfy the search criteria, as well as scores that measure the relevance of the document to the search criteria.
Query terms can consist of words and phrases. Query terms can also contain stopwords.
The words in a query expression are the individual tokens on which the query expression operators perform an action. If multiple words are contained in a query expression, separated only by blank spaces (no operators), the string of words is considered a phrase and the entire string is searched for during a query.
Stopwords are common words, such as and, the, of, and to, that are not considered significant query terms by themselves because they occur so often in text. However, stopwords can provide useful search information when combined with more significant terms.
For example, a query for documents containing the phrase peanut butter and jelly returns different results than a query for documents containing the terms peanut butter and jelly.
When you define a policy for a column, ConText lets you identify a list of stopwords. When stopwords are encountered in the documents in the column, they are not included as indexed terms in the text index; however, they are recorded.
As a result, stopwords cannot be searched for explicitly in text queries, but can be included as part of a phrase in a query expression.
See Also:
For more information about querying with stopwords, see "Querying with Stopwords" in this chapter. |
Stoplists can be created in any language supported by ConText. ConText provides a default stoplist in English.
In addition to query terms, a query expression may contain any or all of the following components:
With text queries, you can issue case-sensitive and case-insensitive queries. The ability to query in a case-sensitive way depends on the lexer preference used to index the document set.
By default, ConText uses a lexer preference that is not case-sensitive when indexing documents. Therefore, with a policy containing the default lexer preference, queries are not case-sensitive. When queries are not case-sensitive, a query on United returns the same hits as a query on united.
To issue case-sensitive text queries, you or your ConText administrator must first index your document set using a policy with a case-sensitive lexer preference. Using the same policy, you can issue case-sensitive queries. With case-sensitive queries, a query on United is different from a query on united.
Case-sensitive querying helps to identify words that have different meaning when capitalized. For example, to query on the proper noun Church (as someone's name) without getting the hits for the common noun church, you issue Church as your query. ConText returns all appearances of Church.
When you have case-sensitivity enabled, searches on stopwords are also case-sensitive. Thus when you issue a case-sensitive query on a phrase containing stopwords and non-stopwords, ConText searches for the phrase containing the stopwords with the specified case.
For example, assuming the word on is a stopword and case-sensitivity is enabled, a search on the phrase on the waterfront does not return hits for documents containing the phrase On the waterfront.
German-language text contains composite words. With ConText, you can create a composite index and subsequently issue queries to search for composite words using a subcomposite word as your query term.
For example when using a German composite index, a query on the term Bahnhof (train station) returns documents that contain Bahnhof or any word containing Bahnhof as a sub-composite, such as Hauptbahnhof, Nordbahnhof, or Ostbahnhof.
However, a query on Bahnhof does not return documents that contain the single words Bahn or Hof.
To query against a composite index, you specify the policy associated with the composite index with two-step, or in-memory queries. For one-step queries, you must specify the policy if the text column has more than one index attached to it.
.
See Also:
For more information about creating a composite index for German, see Oracle8 Context Cartridge Administrator's Guide. |
You can use text highlighting with composite word queries in German. When you do so, ConText highlights the entire composite word, not just the sub-composite you entered as your query.
For example, when you issue Bahnhof as your query, context highlights the words Hauptbahnhof, Nordbahnhof, and Ostbahnhof entirely.
See Also:
For more information on highlighting text queries, see Chapter 6, "Document Presentation". |
For languages that use an 8-bit character set, such as French and Spanish, Context gives you the option of converting characters to their base-letter representation before text indexing. This means that words with accents, umlauts, and so on are converted to their base-letter representation before their tokens are placed in the text index.
When you specify a text index that has used base-letter conversion in a query, ConText converts the term in the query expression to match the base-letter representation before the query is processed. In addition, all expansion and stopword checking for the query is performed on the base-letter terms.
Note: The terms in a thesaural query are not converted to base-letter representation before look-up in the thesaurus. The base-letter conversion takes place after the thesaurus look-up and is performed on all the terms returned by the thesaurus.:
For more information about creating an index that supports base-letter conversion, see Oracle8 Context Cartridge Administrator's Guide. |
The following example of a one-step query returns all articles that contain the word wine in the TEXTTAB.TEXT_COLUMN column. The query expression consists only of the query term wine, surrounded by single quotes.
SELECT articles FROM texttab WHERE CONTAINS(textcol, 'wine') > 0;
The following example of a one-step query returns all articles that contain the phrase wine and roses in the TEXTTAB.TEXT_COLUMN column. The query expression consists of the query phrase wine and roses, surrounded by single quotes.
SELECT articles FROM texttab WHERE CONTAINS(textcol, '{wine and roses}') > 0;
Logical operators combine the terms in a query expression. All single words and phrases may be combined with logical operators. When query terms are combined, the number of spaces around the logical operator is not significant.
Logical operators link query terms together to produce scores that are based on the relationship of the terms to each other. The logical operators combine the scores of their operands up to a maximum value of 100. Operands can be any query terms, as well as other operators.
Use the AND operator to search for documents that contain at least one occurrence of each of the query terms. For example, to obtain all the documents that contain the terms batman and robin and penguin, issue the following query:
'batman & robin & penguin'
In an AND query, the score returned is the score of the lowest query term. In the example above, if the three individual scores for the terms batman, robin, and penguin is 10, 20 and 30 within a document, the document scores 10.
Use the OR operator to search for documents that contain at least one occurrence of any of the query terms. For example, to obtain the documents that contain the term cats or the term dogs, use one of the following:
'cats | dogs' 'cats OR dogs'
In an OR query, the score returned is the score for the highest query term. In the example above, if the scores for cats and dogs is 30 and 40 within a document, the document scores 40.
Use the NOT operator to search for documents that contain one query term and not another.
For example, to obtain the documents that contain the term animals but not dogs, use the following expression:
'animals ~ dogs'
Similarly, to obtain the documents that contain the term transportation but not automobiles or trains, use the following expression:
'transportation not (automobiles or trains)'
Use the equivalence operator to specify an acceptable substitution for a word in a search. For example, if you want all the documents that contain the phrase alsatians are big dogs or labradors are big dogs, you can write:
'labradors=alsatians are big dogs'
ConText processes the above query faster and more efficiently than the same query written with the accumulate operator. For example, you could write the above query less efficiently and less concisely as follows:
'labradors are big dogs, alsatians are big dogs'
The savings you gain in using the equivalence operator over the accumulate operator is most significant when you have more than one equivalence operator in the query expression. For example, the following query
'labradors=alsatians are big canines=dogs'
is a more efficient, more concise form of:
'labradors are big dogs, alsatians are big dogs, alsatians are big canines, labradors are big canines'
The equivalence operator has higher precedence that all other operators except the unary operators (fuzzy, soundex, stem, and PL/SQL function calls).
Use the WITHIN operator to narrow down a query into pre-defined document sections.
For example in an HTML document set, you or your ConText administrator can define a section for all headings delimited with <HEAD> and <\HEAD> and subsequently issue a query for a term in a heading across all documents.
See Also:
For more information about defining sections, see the Oracle8 Context Cartridge Administrator's Guide. |
The syntax for the WITHIN operator is as follows:
Syntax | Description |
---|---|
term WITHIN section |
Searches for term within the pre-defined section. The WITHIN operator has no effect on score. |
Note: The WITHIN operator requires you to know the name of the section you wish to search. A list of defined sections can be obtained using the CTX_ALL_SECTIONS or CTX_USER_SECTIONS views. |
To find all the documents that contain the term San Francisco within the pre-defined section Headings, write your query as follows:
'San Francisco within Headings'
To find all the documents that contain the term sailing and contain the term San Francisco within the pre-defined section Headings, write your query as follows:
'(San Francisco within Headings) and sailing'
To find all documents that contain the terms dog and cat within the pre-defined section Headings, write your query as follows:
'dog and cat within Headings'
Note that the above query is logically different from:
'dog within Headings and cat within Headings'
which finds all documents that contain dog and cat where the terms dog and cat are in different Headings sections.
To find all documents in which dog is near cat within the section Headings, write your query as follows:
'dog near cat within Headings'
The WITHIN operator has the following limitations:
Score changing operators behave like logical operators in that they return documents given the terms you specify. However, these operators affect document scores differently and, as such, can be used to change a document's rank in a hitlist with respect to a query term. The following table describes these operators:
Use the accumulate operator to search for documents that contain at least one occurrence of any of the query terms, where the documents that contain the most frequent occurrences of the query terms are given the highest score.
For example, to search for documents that contain either term Brazil or soccer and to have the highest scores attached to the documents that contain the most occurrences of these words, you can issue:
'soccer,Brazil'
Accumulate is similar to OR, in the sense that a document satisfies the query expression if any of the terms occur in the document; however, the scoring is different. OR returns a score based only on the query term that occurs most frequently in a document. Accumulate combines the scores for all the query terms that occur in a document, topping out at 100 when the sum exceeds 100. Thus documents that contain the most query terms are ranked the highest.
Use the MINUS operator to search for documents that contain a query term, and when you want the presence of a second query term to cause the document to be ranked lower.
The minus operator is useful for lowering the score of documents that contain "noise". For example, suppose a query on the term cars always returned high scoring documents about Ford cars. You can lower the scoring of the Ford documents by using the expression:
'cars - Ford'
In essence, this expression returns the documents that contain the term cars. However, the score returned for a document is the number of occurrences of cars minus the number of occurrences of Ford. When a returned document does not contain Ford, the occurrence of the term Ford is counted as zero.
Words or phrases that occur close together are considered to be more closely associated than those that are farther apart. The near operator calculates a score based on how close words are to each other rather than on how often the word or phrase appears in the document.
The score for a document is the highest score out of all the query terms that occur in proximity to each other. A score of 100 is returned when the query terms are adjacent. When the terms are not adjacent, ConText returns a score based on the following formula:
score = 100 - (number of words between the two query terms)
When there are more than 100 words separating the terms, ConText scores the document as 1.
For example, if the query expression is ice;cream, the phrase I love ice cream would score 100, while the phrase ice is colder than cream would score 97. If both phrases occurred in a document, ConText retrieves the document and scores it as 100.
You can use the near operator with more than two search terms. For example, you can issue a query such as:
fish;whales;sea
This query asks for all documents that have the three terms fish, whales, and sea close to one another. The score is calculated by the following formula:
score = 100 - size of the smallest block containing all query terms + number of query terms
Thus if a document contained the phrase: Fish, lobsters and whales live in the sea, and this phrase was the smallest block containing all three terms, the document scores (100 - 8 + 3) = 95.
You can use the threshold operator with the near operator to restrict your result set. For example, to request all documents in which the terms fish and whales are at most three words apart or less, you can write:
fish;whales > 97
The weight operator multiplies the score by the given factor, topping out at 100 when the product exceeds 100. For example, the query cat, dog*2' sums the score of cat with twice the score of dog, topping out at 100 when the score is greater than 100.
In expressions that contain more than one query term, use the weight operator to adjust the relative scoring of the query terms. You can reduce the score of a query term by using the weight operator with a number less than 1; you can increase the score of a query term by using the weight operator with a number greater than 1 and less than 10.
The weight operator is useful in accumulate, OR, or AND queries when the expression has more than one query term. With no weighting on individual terms, the score cannot tell you which of the query terms occurs the most. If you are interested in documents that contain a particular query term more than another term, the overall ranking tells you nothing about which documents pertain to the term that you are most interested in.
You have a collection of sports articles. You are interested in the articles about soccer, in particular Brazilian soccer. It turns out that a regular query on soccer, Brazil returns many high ranking articles on US soccer. To raise the ranking of the articles on Brazilian soccer, you can issue the following query:
'soccer, Brazil*3'
Table 3-1 illustrates how the weight operator can change the ranking of three hypothetical documents A, B, and C, which all contain information about soccer. The columns in the table show the total score of four different query expressions on the three documents.
soccer | Brazil | soccer,Brazil | soccer,Brazil*3 | |
---|---|---|---|---|
A |
20 |
10 |
30 |
50 |
B |
10 |
30 |
40 |
100 |
C |
50 |
10 |
60 |
70 |
The score in the third column containing the query soccer, Brazil is the sum of the scores in the first two columns. The score in the fourth column containing the query soccer,Brazil*3 is the sum of the score of the first column soccer plus three times the score of the second, Brazil.
With the initial query of soccer,Brazil, the documents are ranked in the order C B A. With the query of soccer,Brazil*3, the documents are ranked B C A, which is the preferred ranking.
Use the result-set operators to control what documents are returned from a query result set. The operands for these operators are expressions, which can be an individual query term or a logical combination of query terms that use other operators.
Result set operators are typically used to exclude noise from the hitlist (irrelevant documents) and to retrieve documents out of a hitlist more efficiently. There are three result set operators:
You can use the threshold operator in two ways:
Use the expression level threshold operator to eliminate documents in the result set that score below a threshold number. For example, to search for documents that contain relational databases and to return only documents that score greater than 75, use the following expression:
'relational databases > 75'
Use the query term threshold operator in a query expression to select a document based on how a term scores in the document. For example, to select documents that have at least a score of 30 for lion and contain tiger, use:
'(lion > 30) and tiger'
Use the max operator to retrieve a given number of the highest scoring documents. For example, to obtain the twenty highest scoring documents that contain the word dance, you can write:
'dance:20'
The max operator is particularly useful to prevent writing a large number of records to the hitlist table, which could result in performance degradation.
Note: The max operator cannot be used with the CTX_QUERY.COUNT_HITS function or with in-memory queries. |
Use the first/next operator to return a specified range of documents from the hitlist.
For example, to return the first 10 documents encountered by ConText that contain the term dog, use the following expression:
'dog#1-10'
You could then return the next 10 documents using the following expression:
'dog#11-20'
The first/next operator can be used to create an application interface in which query results (rows in the hitlist) are returned incrementally. Because the query results are returned incrementally, query response is generally faster. The application can display the hitlists in a more manageable size, and control can be returned to the user faster.
Note: The first/next operator cannot be used with the CTX_QUERY.COUNT_HITS function or with in-memory queries. |
You can use the first/next operator extract chunks of a sorted hitlist returned by the max operator. For example, if you use the max operator to return only the highest scoring 50 documents that contain the term cat, you can extract the first 10 documents from the 50 as follows:
'cat:50#1-10'
The expansion operators expand a query expression to include variants of the query term supplied by the user. There are three kinds of expansion operators:
The expansion operators are unary operators. They may be used in combination with each other and with any other operators described in this chapter. In addition, searches can be broadened by performing an expansion on an expansion.
The methods used by the expansion operators to perform stemming, fuzzy matching, and soundex matching for a text column are determined by the Wordlist preference in the policy for the column.
See Also:
For more information about setting up preferences and policies, see Oracle8 Context Cartridge Administrator's Guide. |
Use the STEM ($) operator to search for terms that have the same linguistic root as the query term. For example:
Input | Expands To |
---|---|
$scream |
scream screaming screamed |
$distinguish |
distinguish distinguished distinguishes |
$guitars |
guitars guitar |
$commit |
commit committed |
$cat |
cat cats |
$sing |
sang sung sing |
The ConText stemmer, licensed from Xerox Corporation's XSoft Division, supports the following languages: English, French, Spanish, Italian, German, and Dutch.
Note: If STEM returns a stopword, the stopword is not included in the query or highlighted by CTX_QUERY.HIGHLIGHT. |
The soundex (!) operator enables searches on words that have similar sounds; that is, words that sound like other words. This function allows comparison of words that are spelled differently, but sound alike in English.
Soundex in ConText uses the same logic as the soundex function in SQL to search for words that have a similar sound. It returns all words in a text column that have the same soundex value.
The following example illustrates the results that could be returned for a one-step query that uses SOUNDEX:
SELECT ID, COMMENT FROM EMP_RESUME WHERE CONTAINS (COMMENT, '!SMYTHE') > 0 ID COMMENT -- ------------ 23 Smith is a hard worker who.. .
Note:SOUNDEX works best for languages that use a 7-bit character set, such as English. It can be used, with lesser effectiveness, for languages that use an 8-bit character set, such as many Western European languages.
For more information about the SOUNDEX function in SQL, see Oracle8 Server SQL Reference.
Fuzzy (?) expansions generate words that are spelled similarly. This type of expansion is helpful for finding more accurate results when there are frequent misspellings in the documents in the database.
Unlike the stem expansion, the number of words generated by a fuzzy search depends on what is in the text index; results can vary significantly according to the contents of the database index.
For example:
Input | Expands To |
---|---|
?cat |
cat cats calc case |
?feline |
feline defined filtering |
?apply |
apply apple applied April |
?read |
lead real |
Note: Fuzzy works best for languages that use a 7-bit character set, such as English. It can be used, with lesser effectiveness, for languages that use an 8-bit character set, such as many Western European languages. Also, the Japanese lexer provides limited fuzzy matching. In addition, if fuzzy returns a stopword, the stopword is not included in the query or highlighted by CTX_QUERY.HIGHLIGHT. |
Penetration allows complex query expansions to be expressed in short concise notation. Penetration is a system of notation for query expressions and does not affect the meaning of the expansion operators or the order in which operations are performed; it is a tool to help you generate non-ambiguous queries using the expansion operators.
Penetration applies the expansion operators to each term within an explicit expression (i.e., an expression delimited by parentheses or braces). Any expansion operators outside an expression delimited by parentheses ( ) or braces { } is applied to each word or phrase inside the expression.
For example:
Query Before Penetration | Query After Penetration |
---|---|
?(dog, cat, mouse) |
?dog, ?cat, ?mouse |
?(dog,!(cat & mouse)) |
?dog, (!?cat & !?mouse) |
?((cat=feline) meows) |
(?cat =?feline)?meows |
In the first example, a fuzzy expansion is performed on each term.
In the second example, a fuzzy expansion is performed on each term and a soundex expansion is performed only on the terms cat and mouse because cat and mouse are enclosed in a separate set of parentheses
In the third example, a fuzzy expansion is performed on each term, including both equivalence terms.
You can use query expression feedback to examine how ConText expands query expressions containing fuzzy, stem and soundex operators.
If you have base-letter conversion specified for a text column and the query expression contains a SOUNDEX or FUZZY operator, ConText operates on the base-letter form of the query.
The STEM operator does not support base-letter conversion.
The thesaurus operators expand a query for a single term (word or phrase) using a thesaurus that defines relationships between the user-specified term and other semantically related terms.
There are ten kinds of thesaurus operators, corresponding to the ten types of relationships that can be defined in an ISO2788 standard thesaurus.
Internally, ConText processes the expansion by bracketing each individual term returned by the expansion, then the terms are accumulated together using the ACCUMULATE operator.
For example, if bird, birdy, and avian are all synonyms:
SYN(bird) is expanded to {bird},{avian},{birdy}.
If a term in a thesaural query does not have corresponding entries in the specified thesaurus, no expansion is produced and the term itself is used in the query.
See Also:
For more information about viewing thesaural expansions, see Chapter 5, "Query Expression Feedback". For more information about thesaural relationships and creating thesauri, see Oracle8 Context Cartridge Administrator's Guide. |
The thesaurus operators can be used in conjunction with all the other query expression operators and special characters supported by ConText, with the exception of the near operator.
The maximum length of the expanded query is 32000 characters.
Thesaural operations cannot be nested. For example, the following query is not allowed.
'SYN(BT(bird))'
The thesaurus operators are implemented in ConText as PL/SQL functions, and, as such, have arguments that must be specified with the operator. All of the notational conventions and usage rules for PL/SQL apply to the thesaurus operators.
The thesaurus operators have the following arguments:
Specify the operand for the thesaurus operator. You must specify a term when using the NT operator. For preferred term (PT) and top term (TT) queries, term is replaced by the preferred term/top term defined for the term in the specified thesaurus; however, if no PT or TT entries are defined for the term, the term is not replaced and is used in the query.
For all other thesaural queries, term is expanded to include the synonymous, related, broader, or narrower terms defined for the term in the specified thesaurus.
Specify the number of levels traversed in the thesaurus hierarchy to return the broader (BT, BTG, BTP) or narrower (NT, NTG, NTP) term for the specified term. For example, a level of 1 in a BT query returns only the broader term, if one exists, for the specified term. A level of 2 returns the broader term for the specified term, as well as the broader term, if one exists, for the broader term.
The level argument is optional and has a default value of one (1). Zero or negative values for the level argument return only the original query term.
Specify the name of the thesaurus used to return the expansions for the specified term. The thes argument is optional and has a default value of DEFAULT. As a result, a thesaurus named DEFAULT must exist in the thesaurus tables before using any of the thesaurus operators.
Use the synonym operator (SYN) to expand a query to include all the terms that have been defined in a thesaurus as synonyms for a specified term.
The following query returns all documents that contain the term tutorial or any of the synonyms defined for tutorial in the DEFAULT thesaurus:
'SYN(tutorial)'
Expansion of compound phrases for a term in a synonym query are returned as AND conjunctives.
For example, the compound phrase temperature + measurement + instruments is defined in a thesaurus as a synonym for the term thermometer. In a synonym query for thermometer, the query is expanded to:
{thermometer},({temperature}&{measurement}&{instruments})
Use the preferred term operator (PT) to replace a term in a query with the preferred term that has been defined in a thesaurus for the term.
For example, the term building has a preferred term of construction in a thesaurus. A PT query for building returns all documents that contain the word construction. Documents that contain the word building are not returned.
Use the related term operator (RT) to expand a query to include all terms with the related term that has been defined in a thesaurus for the term.
For example, the term dinosaur has a related term of paleontology. A RT query for dinosaur returns all documents that contain the word paleontology. Documents that contain the word dinosaur are not returned.
Use the narrower term operators (NT, NTG, NTP, NTI) to expand a query to include all the terms that have been defined in a thesaurus as the narrower or lower level terms for a specified term. They can also expand the query to include all of the narrower terms for each narrower term, and so on down through the thesaurus hierarchy.
The following query returns all documents that contain either the term tutorial or any of the NT terms defined for tutorial in the DEFAULT thesaurus:
'NT(tutorial)'
The following query returns all documents that contain either fairy tale or any of the narrower instance terms for fairy tale as defined in the DEFAULT thesaurus:
'NTI(fairy tale)'
That is, if the terms cinderella and snow white are defined as narrower term instances for fairy tale, ConText returns documents that contain fairy tale, cinderella, or snow white.
Use the broader term operators (BT, BTG, BTP, BTI) to expand a query to include the term that has been defined in a thesaurus as the broader or higher level term for a specified term. They can also expand the query to include the broader term for the broader term and the broader term for that broader term, and so on up through the thesaurus hierarchy.
The following query returns all documents that contain the term tutorial or the BT term defined for tutorial in the DEFAULT thesaurus:
'BT(tutorial)'
If a homograph (a word or phrase with multiple meanings, but the same spelling) appears in two or more nodes in the same hierarchy branch of a thesaurus, a qualifier is required for each occurrence of the term in the branch.
If the qualifier is not specified for a homograph in a broader or narrower term query, the query expands to include all of the broader/narrower terms for the homograph.
For example, if machine is a broader term for crane (building equipment) and bird is a broader term for crane (waterfoul):
BT(crane) expands to {crane},{machine},{bird}
If the qualifier for a homograph is specified in a broader or narrower term query, only the broader/narrower terms for the qualified homograph are returned.
Using the previous example:
BT(crane{(waterfoul)}) expands to {crane},{bird}
Use the TOP TERM operator (TT) to replace a term in a query with the top term that has been defined for the term in the standard hierarchy (BT, NT) in a thesaurus. Top terms in the generic (BTG, NTG) and partitive (BTP, NTP) hierarchies are not returned.
For example, the term tutorial has a top term of learning systems in the standard hierarchy of a thesaurus. A TT query for tutorial returns all documents that contain the phrase learning systems. Documents that contain the word tutorial are not returned.
Thesaural expansions in text queries can differentiate between terms based on case.
For example, a case-sensitive thesaurus named thes1 is created and Mercury is defined as a narrower term for planets, while mercury is defined as a narrower term for metals.
During a query, the following expansions occur:
BT(mercury,1,thes1) expands to {MERCURY}, {METALS}
BT(Mercury,1,thes1) expands to {MERCURY}, {PLANETS}
Because text queries are case-insensitive, case-sensitive thesauri only affect the expansion of a term and not the terms actually used in the query.
For example:
However, the query returns all documents in which the two terms occur, regardless of case. In other words, documents that contain mercury, Mercury, planets, Planets, or any other combinations of case for the two terms are all returned by the query.
When ConText processes a query on a base-letter index and the expression contains a thesaurus operator, ConText looks up the query term in the thesaurus without converting the query to base-letter. The expansions obtained from the thesaurus are converted to base-letter and looked up subsequently within the index according to query rules.
This sequence of look-up enables base-letter queries to work independent of whether the thesaurus is in base-letter form. However, if the keys in the thesaurus are in base letter form, these keys will not match the corresponding non-base letter form query terms. When you have a base-letter thesaurus, you must specify the base-letter form in the query.
Wildcard characters can be used in query expressions to expand word searches into pattern searches. The wildcard characters are:
For example, the following abbreviated one-step query finds all terms beginning with the pattern scal in a column named text:
...contains(TEXT, 'scal%') > 0
Note:To expand the wildcard query, ConText uses the word list for the text column and rewrites the query with these terms. When your wildcard query expands to a number of terms greater than the maximum allowed in a query, ConText returns an error.
In addition, if a wildcard expression translates to a stopword, the stopword is not included in the query or highlighted by CTX_QUERY.HIGHLIGHT.
The grouping characters control operator precedence by grouping query terms and operators in a query expression. The grouping characters are:
The beginning of a group of terms and operators is indicated by an open character from one of the sets of grouping characters. The ending of a group is indicated by the occurrence of the appropriate close character for the open character that started the group. Between the two characters, other groups may occur.
For example, the open parenthesis indicates the beginning of a group. The first close parenthesis encountered is the end of the group. Any open parentheses encountered before the close parenthesis indicate nested groups.
Brackets perform the same function as the parentheses, but prevent penetration for the expansion operators.
You can store the results of a query expression and then call the SQE later in a query expression to return the stored results. To call a stored query expression, use the SQE operator.
Operator | Syntax | Description |
---|---|---|
|
SQE(SQE_name) |
Returns the stored result of SQE_name. |
The advantage of calling an SQE in a query expression, rather than specifying query terms, is that the results are typically returned faster, since ConText does not have to query the text table directly.
In addition, SQEs can be used to perform iterative queries, in which an initial query is refined using one or more additional queries.
The process for using stored query expressions is:
Administration of stored query expressions can be performed using the REFRESH_SQE, REMOVE_SQE, and PURGE_SQE procedures in the CTX_QUERY PL/SQL package.
To create a session SQE named PROG_LANG, use CTX_QUERY.STORE_SQE as follows:
exec ctx_query.store_sqe('emp_resumes', 'prog_lang', 'cobol', 'session');
This SQE queries the text column for the EMP_RESUMES policy (in this case, EMP.RESUMES) and returns all documents that contain the term cobol. It stores the results in the SQE table for the policy.
PROG_LANG can then be called within a query expression as follows:
select score, docid from emp where contains(resume, 'sqe(prog_lang)')>0 order by score;
When you initially create an SQE using CTX_QUERY.STORE_SQE, you can specify whether the SQE is for the current session or for all sessions (system SQE).
You can use session SQEs only in the current session. These SQEs are stored only for the duration of the session. When a session is terminated, all session SQEs created during the session are deleted from the SQE tables. If you want to use a session SQE in another session, you must recreate the SQE.
System SQEs can be used in all sessions, including concurrent sessions. When a session is terminated, system SQEs created during the session are not deleted from the SQE tables and can be used in future sessions.
If the text column referenced by an stored query expression has been modified since the stored query expression was created, the stored query expression results may be out-of-date. Before returning the results of an stored query expression in a query expression, ConText verifies that the results are current. If they are not current, ConText automatically evaluates the differences and updates the results.
ConText also verifies that any stored query expressions nested within an stored query expression have up-to-date results
Result lists in stored query expression tables may get fragmented by consecutive re-evaluations. You can resolve fragmentation by calling CTX_QUERY.REFRESH_SQE.
Iterative queries are queries built on other queries to refine or add to the result set of the original query. Once you define a stored query expression, you can add additional search criteria in two ways:
Sometimes you might want to add a condition to a stored query expression to re-define your search criteria. You can do so by extending the query with additional operators when you call CTX_QUERY.CONTAINS. When you extend stored queries in this way, the response time is usually faster than an equivalent query without the SQE operator.
For example, you find that wildcard queries take a long time to process. You therefore define a wildcard query as a stored query expression, Q1, to return all documents indexed under policy pol that have words beginning with the letter z:
ctx_query.store_sqe('pol', 'Q1', 'z%', 'session');
You then extend the query by adding an OR condition: You ask for all documents indexed under policy pol that contain words beginning with the letter z or contains the word cat:
ctx_query.contains('pol', 'SQE(Q1) | cat', 'ctx_temp');
Internally, ConText must still use the text index to find those documents that might have the word cat but not z%; however, the response time is generally much faster than the following equivalent query:
ctx_query.contains('pol', 'z% | cats', 'ctx_temp');
You can use stored query expressions to define other stored query expressions. This is useful when you want to refine the result set returned from a stored query expression.
For example, you define the stored query expression, Q1 as follows:
ctx_query.store_sqe('pol', 'Q1', 'lions | tigers', 'session');
You then want to reduce this hitlist by adding another condition, so you define Q2 as follows:
ctx_query.store_sqe('pol', 'Q2', 'SQE(Q1) and zoos', 'session');
You then execute Q2 as follows:
ctx_query.contains('pol', 'SQE(Q2)', 'ctx_temp');
This query searches for all documents that contain the terms lions or tigers and zoos. It is generally faster that the following equivalent query:
ctx_query.contains('pol', 'lions | tigers and zoos', 'ctx_temp');
Each stored query expression is stored in two tables: a central or system table owned by CTXSYS and an text index table attached to the policy for which the stored query expression was created.
The table owned by CTXSYS is an internal table which stores the stored query expression definitions for all the stored query expressions that have been created for all existing policies. It cannot be accessed directly, but can be viewed through two views, CTX_SQES (users with CTXADMIN role) and CTX_USER_SQES (users with CTXAPP and CTXADMIN roles).
The table used to store the results of an stored query expression for a text column is one of the tables created automatically when the column is indexed; however, the SQR table is only populated when an stored query expression is created and updated when an stored query expression is re-evaluated.
The tablespace, storage clause, and other parameters used to create the SQR table are specified by the Engine preference in the policy for the text column of the stored query expression.
Note:
Similar to the other ConText index tables, the SQR table is an internal table that is accessed only by ConText when an stored query expression is processed in a query. For more information about policies, preferences, text indexing, and the structure of the stored query expression tables and views, see Oracle8 Context Cartridge Administrator's Guide. |
You can use all query expression operators in stored query expressions, with the following exceptions:
Stored query expressions also support all of the special characters and other components that can be used in a query expression, including PL/SQL functions and other stored query expressions.
In a query expression, you can call a PL/SQL function that returns a value. The syntax for the PL/SQL operator is as follows:
Calling a PL/SQL function within a query is useful for converting words to alternate forms. For example, you can call a function that takes acronyms and returns the expanded string.
Suppose you, as user ctxuser, create a function named CONVERT that takes an acronym as input and returns the fully-expanded version of the acronym. Then, to obtain all documents that contain either IBM or International Business Machine, you issue the following query:
'execute ctxuser.convert(IBM), IBM'
Likewise, you can call a PL/SQL function that translates words. For example, you can call a function french that converts an English word to its French equivalent. You can then search on the French word for cat by issuing the following query:
'@ctxuser.french(cat)'
Operator precedence is the order in which the components of a query expression are evaluated. ConText query operators can be divided into two sets of operators that have their own order of evaluation. These two groups are described below as Group 1 and Group 2.
In all cases, query expressions are evaluated in order from left to right according to the precedence of their operators. Operators with higher precedence are applied first. Operators of equal precedence are applied in order of their appearance in the expression from left to right.
Within query expressions, the Group 1 operators have the following order of evaluation from highest precedence to lowest:
EQUIV |
= |
NEAR |
; |
Weight, Threshold |
* > |
MINUS |
- |
NOT |
~ |
WITHIN |
|
AND |
& |
OR |
| |
ACCUM |
, |
Max |
: |
First/Next |
# |
Within query expression, the Group 2 operators have the following order of evaluation from highest to lowest:
Wildcard |
% _ |
Stem |
$ |
Fuzzy |
? |
Soundex |
! |
Other operators not listed under Group 1 or Group 2 are procedural. These operators have no sense of precedence attached to them. They include the SQE, PL/SQL, and thesaurus operators.
In the first example, because AND has a higher precedence than OR, the query returns all documents that contain w1 and all documents that contain both w2 and w3.
In the second example, the query returns all documents that contain both w1 and w2 and all documents that contain w3.
In the third example, the fuzzy operator is first applied to w1, then the AND operator is applied to arguments w3 and w4, then the OR operator is applied to term w2 and the results of the AND operation, and finally, the score from the fuzzy operation on w1 is added to the score from the OR operation.
The fourth example shows that the equivalence operator has higher precedence than the AND operator.
The fifth example shows that the AND operator has higher precedence than the WITHIN operator.
Precedence is altered by grouping characters as follows:
Precedence of operators is maintained during evaluation of expressions inside of the parentheses.
To query on words or symbols that have special meaning to query expressions such as and & or| accum, execute, you must escape them. There are two ways to escape characters in a query expression:
In the following examples, an escape sequence is necessary because each expression contains a ConText operator or reserved symbol:
'AT\&T' '{AT&T}' 'high\-voltage' '{high-voltage}'
The following is a list of ConText reserved words and characters that must be escaped to be searched on:
The open brace { signals the beginning of the escape sequence, and the closed brace} indicates the end. Everything between the opening brace and the closing brace is part of the query expression (including any open brace characters). To include the close brace character in a query expression, use}}.
To escape the backslash escape character, use \\.
Stopwords are words for which ConText does not create an index entry. They are usually common words that are unlikely to be searched on by themselves.
ConText is shipped with a default list of stopwords in English containing common words such as this and that. However, you or ConText administrator can define stopwords.
.
See Also:
For more information about defining stopwords, see Oracle8 Context Cartridge Administrator's Guide. |
You cannot query on a stopword by itself or a phrase of only stopwords; whenever you attempt to query on a stopword by itself or a stopword-only phrase, the result is always no hits.
For example, you cannot issue a query to retrieve all documents that contain this if this is defined as a stopword, nor can you issue a query on a phrase of stopwords such as the who, if the words the and who are defined as stopwords.
You can query on phrases that contain stopwords as well as non-stopwords, such as this boy talks to that girl, where this and that are the only stopwords. This is possible because Context records the position of stopwords even though it does not create an index entry for them.
If you have case-sensitivity enabled for text queries and you issue a query on a phrase containing stopwords and non-stopwords, you must specify the correct case for the stopwords. For example, a query on this boy talks to that girl does not return documents that containing the phrase This boy talks to that girl, assuming this is a stopword.
See Also:
For more information about issuing case-sensitive text queries, see "Case-Sensitive Queries" in this chapter. |
When you use a stopword or a stopword-only phrase as an operand of a query operator, ConText rewrites the expression to eliminate the stopword or stopword-only phrase and then executes the query.
The following table describes some common stopword transformations. The Stopword Expression column describes the query expression or component of a query expression you enter, while the right-hand column describes the way ConText rewrites the query.
In these examples, a value of no_token for the rewritten expression means no hits are returned for the query.
For example, assuming that the word this is a stopword and that the word dog is a non-stopword, the query dog and that is rewritten to dog, applying the first transformation is the list.
See Also:
For a complete list of stopword transformations, see Appendix C, "Stopword Transformations". To learn about how to examine stopword transformations, see Chapter 5, "Query Expression Feedback". |
Context indexes text by identifying tokens (words). For English and most European languages it assumes that blank spaces delimit tokens. At index time, ConText must also know how to interpret punctuation characters and characters that occur within words and numbers. Such special characters must be defined in the BASIC LEXER preference. They are described as follows:
In the BASIC LEXER preference, ConText defines a default set of characters for each group.
The way you query on tokens that contain these characters depends on how ConText indexes the tokens containing these characters. This is because ConText tokenizes words at query time the same way it tokenizes words at index time. To query on words or numbers that contain special characters, you must know how these words are represented in the index.
See Also:
For more information about defining special characters for the BASIC LEXER preference, see Oracle8 Context Cartridge Administrator's Guide. |
Punctuation and continuation characters are not indexed with the words they occur next to or with, and thus are ignored by ConText at query time. The following table shows how ConText strips punctuation characters at query time:
Printjoins and skipjoins are characters such as hyphens that join words together.
When you define a character as a printjoin, such as a hyphen, you specify that the words on either side of the hyphen are to be indexed with the hyphen. For example, sister-in-law is indexed as the token sister-in-law.
When you define a character as a skipjoin, such as a hyphen, you specify that the two words on either side of the hyphen are to be indexed as one token without the hyphen. For example, sister-in-law is indexed as sisterinlaw.
To query on words that contain a join character, you must know if the character is defined as a skipjoin or printjoin in the BASIC LEXER preference.
For example, if the hyphen character is defined as a printjoin, you must write your query with the hyphen, since the indexed token contains the hyphen. Thus, to query on all the documents that contain the term sister-in-law, you must write your query as follows with the hyphen:
'{sister-in-law}'
However, if the hyphen character is defined as a skipjoin, you must write your query without the hyphen. Thus, to query on all documents that contain sister-in-law, you must write your query as:
'sisterinlaw'
This query really returns all documents that contain sisterinlaw and sister-in-law, provided the hyphen is defined as a skipjoin.
Numjoin and numgroup characters are characters that can appear in numbers, such as the decimal point and the comma.
A numjoin is a character that occurs once in a string of digits, such as a decimal point, and gets indexed with the number. (ConText defines the decimal as a default numjoin character for the BASIC LEXER preference.) For example, the number 3.14 is indexed as 3.14. Thus to query on 3.14 with the decimal point defined as a numjoin character, you write:
'3.14'
When you define the numjoin character to be NULL, Context indexes 3.14 as the two separate numbers 3 and 14.
A numgroup is a character such as a comma that groups digits together in a number. Numgroup characters get indexed with the number. (ConText defines the comma as a default numgroup character for the BASIC LEXER preference.) For example, the number 6,344,555 gets indexed as 6,344,555.
To query on a number that contains numgroup characters, you must write the query with the numgroup character. For example, to query on 6,344,555, you write:
'{6,344,555}'
Note that the comma must be escaped
.When you define the numgroup character as NULL, numbers such as 1,000 get indexed as 1 and 000.
Startjoin and endjoin characters are non-alphanumeric characters that start and end tokens. These characters are indexed with the token they occur with.
You or your ConText administrator typically define startjoin and endjoin characters when you index tagged text such as HTML. This makes it easy to define sections for section searching as well as to query on the tags themselves.
For example, to query on the tag <HEAD> with < defined as a startjoin and > defined as an endjoin, write your query as follows:
'{<HEAD>}'
In the query above, an escape sequence is necessary, since > is an operator.