Oracle8 ConText Cartridge Application Developer's Guide Release 2.3 A58164-01 |
|
This chapter describes how to perform theme queries. The following topics are covered:
Theme queries enable you to search for documents by their major concepts. The following sections illustrate the theme indexing and querying processes and how they use the knowledge catalog.
A theme query is usually a word or phrase that captures the concept for which you are searching. To better understand how to select the word or phrase that represents your idea, you must have a sense of how concepts and categories are organized in the knowledge catalog, the information store Context uses to derive themes during indexing and querying.
The knowledge catalog is a tree-like structure whose branches break down various realms of discourse. The knowledge catalog is divided into the following six main categories as shown in Figure 4-1:
These categories are divided further into more specific categories and concepts. Categories are defined as classifications of related nouns and ideas that can be sub-divided into further categories and concepts.
Children categories are related to parent categories by an "is-associated-with" relationship, loosely defined as such to cover other standard child-parent type relationships such as "is-a-part-of", "belongs-to", or "is-a".
Figure 4-1 illustrates the basic structure of the knowledge catalog, showing a break down of an example branch within the top-level category of science and technology. In the example branch, the concept of insects belongs to the category of zoology, which is a part of the more general category of biology, which is part of the even more general category of science and technology.
The organization of the knowledge catalog has the following implications for theme indexing and querying:
Concepts are leaf nodes in the knowledge catalog and can be associated with any level of category. Concepts can be either concrete or abstract.
Concrete concepts are ideas founded in the real world, usually described by nouns or noun phrases. Examples of concrete concepts are jazz music and football.
Abstract concepts are ideas such as happiness or success, usually described by abstract nouns. For example, if a news article was about the success of a famous football player and the article used words and phrases that described success, ConText might attach a theme of success to the document. Likewise, a news article describing a Christmas celebration might have a theme of happiness attached to it.
When analyzing documents for theme querying and theme indexing, ConText must convert words and phrases you enter to their normal forms so they can attach into the knowledge hierarchy. To make this conversion, the knowledge catalog keeps the following lists:
Before you can issue a theme query, your set of documents must be indexed by theme. During theme indexing, ConText extracts up to sixteen main concepts or themes of a document. A theme can be a concrete concept, such as insects, or abstract concept, such as success, sufficiently developed in the document.
When indexing a document by theme, ConText attempts to classify document concepts using the knowledge catalog. Some concepts in a document might not have representation in the knowledge catalog; other themes might be inherently ambiguous terms that ConText cannot place in the knowledge catalog. Hence, ConText recognizes the following types of themes:
See Also:
For more information about how to create a theme index, see Oracle8 Context Cartridge Administrator's Guide. |
ConText creates theme vectors for every theme that can attach into the knowledge catalog. A theme vector is the branch of the knowledge catalog to which the concept attaches. Every level in the theme vector is weighted equally in the index. Refer to Figure 4-2.
In the example in Figure 4-2, the hypothetical document A entitled "The Reproductive Cycle of Insects" contains information about insects. The document theme vector T1 has five levels corresponding to the branch of the knowledge catalog, science and technology, hard sciences, biology, zoology, and insects. Every level of the branch gets entered as a searchable row in the theme index.
Themes that cannot attach into the knowledge catalog are indexed as a single row. For example, if ConText determined that a document was about a person John Smith, and John Smith was not known in the knowledge catalog, ConText indexes this name as a single row theme vector.
Ambiguous document themes such as the term cricket or the term table also have no representation in the knowledge catalog and hence are indexed as a single row. To query on such document themes, you would rely on other supporting themes such as sports or insects being indexed with an ambiguous theme like cricket.
See Also:
For more information about querying ambiguous themes, see "Refining Theme Queries" in this chapter. |
The theme weight is a measure of the strength of a theme relative to the other themes in a document. Weights are associated with theme vectors, and thus every level within a theme vector has the same weight. For example in Figure 4-2, every level in theme vector T1 has a weight of 40.
To execute a theme query, you specify a query string, which can be a sentence or a phrase with or without operators. ConText uses the knowledge catalog to normalize the word or phrase you enter into a standard form. It them looks up the normalized theme in the index and returns the documents that were indexed with the given theme. See Figure 4-3. Scores are calculated based on the weights associated with each theme in the index.
In the example above, a theme query on either science and technology, hard sciences, biology, zoology, or insects will retrieve the document indexed in Figure 4-2 entitled, "The Reproductive Cycle of Insects".
.
ConText returns a relevance score for each document it returns in a theme query; the higher the score, the more relevant the returned document. This relevance score is out of 100 and is based on the weight of the indexed theme.
Generally, specifying broader themes or concepts in a theme query will return higher scoring documents.
When using operators in theme queries, the scoring behavior is the same as for regular text queries. For example, the OR operator returns the higher score of its operand, and the AND operator returns the lower score of its operands.
With theme queries, the following operators have the same semantics as with regular text queries:
Operator | Symbol |
---|---|
Accumulate |
, |
Or |
| |
And |
& |
Minus |
- |
Not |
~ |
Weight |
* |
Threshold |
> |
Max |
: |
Some valid theme query strings using operators are as follows:
contains(text, 'cricket ~ insects') > 0; contains(text, 'cricket & sports') > 0; contains(text, 'music, reggae*5') > 0; contains(text, 'chemistry > 30') > 0; contains(text, 'soccer | basketball') > 0; contains(text, 'computer software - Microsoft') > 0; contains(text, 'music:20') > 0;
See Also: For more information about how to use operators in theme queries, see "Refining Theme Queries" in this chapter.
For more information about the semantics of query operators, see Chapter 3, "Understanding Query Expressions".
In a theme query, the thesaurus operators (synonym, broader term, narrower term etc.) work the same way as in a regular text query, provided a thesaurus has been created/loaded.
In theme query expressions, the grouping characters ( ) [ ] have the same semantics as with a regular text query.
In theme query expressions, the wildcard characters% _ work the same way as in regular text queries.
.ConText does not support the following query expression operators with theme queries:
Operator | Symbol |
---|---|
Near |
; |
Fuzzy |
? |
Soundex |
! |
Stem |
$ |
When you enter your theme query, ConText normalizes the word or phrase representing your theme into a form that it can use to compare with document themes in the index. This normal form is nouns and noun phrases, such as chemistry or personal computer. It is therefore better to use nouns and noun phrases when constructing theme queries. Avoid using sentences or long phrases.
For example, to search for documents about computer programming, use the noun form computer programming not programming my computer.
Avoid splitting phrases that describe your idea as a whole. For example, use the phrase physical chemistry, not physical and chemistry.
Unlike regular text queries, theme queries are case-sensitive. For example, doing a query on the common noun turkey, which describes a type of bird, will not produce a hit on the proper noun Turkey, which describes a country.
Depending on how you write your theme query, ConText usually returns documents that are relevant to your query as well as documents that might be irrelevant to your query. Before you issue the query, you do not know what combination of document themes your query will return.
For example, a query on cricket might return documents on sports and insects depending on your document set. The best way to know the possible outcome is to run the query an examine the set of returned documents. Then you run the query again, using logical operators to eliminate unwanted documents.
You can approach the trial and error method in one of two ways:
Starting with broad theme queries might generate noise or unwanted documents. This is because of the following:
You can use the AND or NOT operator to eliminate unwanted documents. However, use these operators with caution, because in both cases you run the risk of eliminating documents that you might be interested in. For this reason, it is always better to have some noise than none at all.
You can use the AND operator with a qualifying theme to restrict your theme query and hence eliminate noise.
For example, if a theme query on cricket always returned documents about the sport cricket and the insect cricket, and you were interested only in those documents about cricket the sport, you can restrict your query by qualifying cricket with the more general category sports as follows:
'cricket and sports'
The disadvantage of using AND with a restricting theme is that a successful query depends on both themes developed sufficiently in the document for ConText to index them as such. For example, a hypothetical news article about the personal affairs of cricket player might not have the theme of sports developed substantially for ConText to index it as a theme, and therefore such a document would not be returned in the above query.
You can use the NOT operator to exclude unwanted themes. For example, suppose you have a collection of news articles. You find that a theme query on cricket returns documents about cricket the sport as well as cricket the insect.
In such a scenario, you can use the not operator to exclude the unwanted theme. Thus if you are interested in those documents only about the sport cricket, you exclude documents about insects as follows:
'cricket not insects'
One disadvantage of using the not operator is that you run the risk of excluding documents that are coincidentally about the desired theme and the unwanted theme. For example, the above query does not return a hypothetical document about a cricket game that was swarmed by locusts, assuming that the theme of insects is developed sufficiently for ConText to index insects as a document theme.
Another disadvantage of using NOT is that you usually have a better idea of the themes you want, not of the themes you don't want. Predicting unwanted themes depends on knowing your document corpus. For this reason, using NOT is best suited for eliminating irrelevant high-ranking documents you specifically know about.
Sometimes it is better to start with specific categories and then expand these queries into more general ones, especially when your query covers a topic that is categorized specifically in the world. For example, if you are searching for documents that are about bees, you issue a query on bees, which is a specific category of insects. If you find that the result set is not returning the documents you need, you can expand the query by issuing a theme of insects, which is slightly broader.
After expanding a query, you can use the NOT or AND operators to scale back the query.
To execute a theme query, you specify a query string, which can be a sentence or a phrase with or without operators. ConText interprets your query, creating a normalized form of your query that it can use to match against document themes in the index. Context returns a list of documents that satisfy the query, based on certain rules, along with a score of how relevant each document is to the query.
You can issue themes queries using either the two-step or one-step method. The way in which ConText matches themes and scores hits is the same for both methods.
To execute a theme query with the CTX_QUERY.CONTAINS procedure, you must specify a policy that has a theme lexer associated with it.
For example, you specify a theme query on computer software as follows:
execute ctx_query.contains('THEME_POL', 'computer software', 'CTX_TEMP');
In the above example, ConText normalizes computer software, and then attempts to match the normal form with document themes in the index.
When a match is found, ConText uses the weight of the matched theme to compute a score that reflects how relevant the match is to the query; the higher the score, the more relevant the hit. ConText returns the matched document as part of the hitlist.
You can execute theme queries in SQL*Plus using the one-step method. To do so, the text column must have a theme policy attached to it. The way in which ConText matches themes and scores hits is the same as in a two-step query.
For example, to execute a theme query on computer software:
SELECT * FROM TEXTAB
WHERE CONTAINS (text, 'computer software') > 0
For a text column that has more than one policy associated with it, you must specify which policy to use in the CONTAINS clause using the pol_hint parameter. You might create two policies for a column when you want to perform both theme and text queries on the column.
For example, if the column text had a regular text policy and a theme policy THEME_POL associated with it, you issue a theme query as follows:
SELECT ID, SCORE(0) FROM TEXTAB
WHERE CONTAINS (text, 'computer software', 0, 'THEME_POL') > 0
Since the pol_hint parameter is last, when you need to specify a policy in the CONTAINS function as in this example, you must also specify a placeholder, in this case 0, for the LABEL parameter.