Efthimiadis -- Query Expansion. In: ARIST, v31, pp 121-187, 1996.

Query Expansion
(Appeared in: Williams, Martha E., ed. Annual Review of Information Systems and Technology (ARIST), v31, pp 121-187, 1996.)

Efthimis N. Efthimiadis
Information School
University of Washington
Box 352930, Seattle, WA 98195-2930, USA
e-mail: efthimis@u.washington.edu
phone: (206) 616-6077; fax: (206) 616-3152

Contents

1  Introduction
2  Scope
3  The Curse of Dimensionality
4  Manual Query Expansion
    4.1  Introduction
    4.2  Search Strategy
    4.3  Search Tactics, Heuristics, Moves
    4.4  Models and Applications
    4.5  Studies
5  Some General Consideration in Query Reformulation & Query Expansion
    5.1  Introduction
    5.2  Relevance Feedback
    5.3  Query Terms
    5.4  Online Relevance Judgments
        5.4.1  Sample size of relevant documents for relevance feedback and query expansion
        5.4.2  On which document representation should relevance judgments be based?
        5.4.3  Relevance assessments
    5.5  Ranking Algorithms and Term Selection for Query Expansion
        5.5.1  The WPQ algorithm
        5.5.2  The PORTER algorithm
        5.5.3  The EMIM algorithm
        5.5.4  The ZOOM ranking
        5.5.5  The R-LOHI algorithm
6  Automatic Query Expansion
    6.1  Introduction
    6.2  AQE: Based on Search Results
    6.3  AQE: Collection Dependent Knowledge Structures
        6.3.1  Term clustering
        6.3.2  Term co-occurrence
        6.3.3  Pseudoclassification
        6.3.4  Association thesaurus
        6.3.5  Conflation
    6.4  AQE: Based on Collection Independent Knowledge Structures
7  Interactive Query Expansion
    7.1  Introduction
    7.2  IQE: Based on Collection Dependent Knowledge Structures
    7.3  IQE: Based on Collection Independent Knowledge Structures
    7.4  IQE: Based on Search Results
8  Conclusions and Proposals for Future Research
    8.1  Ranking Algorithms
    8.2  User Studies and Query Expansion
    8.3  User Studies and Relevance Feedback Systems
9  References

1  Introduction

In the traditional (Boolean) online search environment the searcher must break down the request into distinct concepts. Then the searcher must think of how the concepts and the term(s) associated with them correspond to the document representation stored in the database. Once the terms have been chosen, they can be combined to form the query. However, the mere realization that one term per concept might sometimes be inadequate to express a concept accurately and the effort to find terms to complement the initially chosen term is an instantiation of query expansion (QE). This situation requires a change in the searcher's thinking process for choosing terms. It may be necessary to consult a thesaurus, a list of subject headings, a dictionary, or a classification system and its index to aid in the selection of terms. This effort usually requires specialized training or experience on the part of the searchers because the results for the inexperienced and those who do not consult the search aids will most likely be poor.

Another way of getting ideas to choose terms for query expansion is by browsing the dictionary index of the database that is being searched or the cumulative dictionary index of all the databases available in the retrieval system, such as, DIALINDEX in DIALOG; this, of course, requires imagination and experience on the part of the searcher. Choosing terms from unstructured thesauri, dictionary indexes, or alphabetical word lists is a difficult task for the nonexpert searcher because these lists are usually created as the sum of all the words in all databases that are available in a particular vendor retrieval service at that time.

For the sake of simplicity the online search can be reduced to two stages: (1) initial query formulation and (2) query reformulation. At the initial query formulation stage, the user first constructs the search strategy and submits it to the system. At the query reformulation stage, having had some results from the first stage, the user manually or the system automatically or the user with the assistance of the system, or the system with the assistance of user tries to adjust the initial query and improve the final outcome.

Query expansion (or term expansion) is the process of supplementing the original query with additional terms, and it can be considered as a method for improving retrieval performance. The method itself is applicable to any situation irrespective of the retrieval technique(s) used. The initial query (as provided by the user) may be an inadequate or incomplete representation of the user's information need, either in itself or in relation to the representation of ideas in documents.

Query expansion may take place in the initial query formulation or in the query reformulation stage of the online search, or in both. The more general concept of query modification may involve deletion of terms as well. Query expansion, as depicted in Figure 1, can be performed manually, automatically or interactively (also known as semi-automatic, user mediated, and user assisted).

[Query Expansion: Methods and Sources]
Figure 1: Query Expansion: Methods and Sources.

Two key elements need to be considered when applying any form of query expansion are: (1) the source, that will provide the terms for the expansion, and (2) the method that will be used to select terms to be used in the expansion (e.g., ranking algorithm). Figure 1 shows possible sources. One type is based on the search results, and it relates to the relevance feedback process. Documents retrieved in an earlier iteration of the search that have been identified as relevant become the source for the query expansion terms. The other type is some form of a knowledge structure (taking the phrase broadly) that is independent of the search process. Such knowledge structure can either depend on the collection, i.e., be corpus based, or be independent of it. Examples of collection-dependent knowledge structures are:

Examples of collection-independent knowledge structures are:

Tests on the searching function in information retrieval can be divided in two categories (SPARCK JONES, 1981, p. 237). One is concerned with operational systems and concentrates on, search logic and search behavior. The other tests are concerned with experiments on query modification by relevance feedback. The latter tests have been using test collections where the role of the user in decision making in query expansion and query modification either does not exist or at best is simulated. This division of searching tests in information retrieval research is associated with the differences in the retrieval techniques involved, i.e., Boolean vs. weighted retrieval, which also corresponds to commercial and experimental systems. Quite often research on searching reported for one side seems to be irrelevant or not applicable to the other and vice versa. However, there are strong reasons for attempting to bridge the gap in the case of query expansion. This is true for a number of reasons, including the role that knowledge organization and representation, semantics, and human information-seeking behavior have to play during the process. Some of the research questions that emerge when considering query expansion are:

This chapter reviews the ongoing research on query expansion that attempts to provide answers to these questions.

2  Scope

This is the first ARIST chapter that is solely devoted to query expansion. The literature on the subject is quite diverse since query expansion is an integral part of the information retrieval process.

The level of the treatment of the subject is also varied. The aim for this review is to provide the overall framework of query expansion and present in a selective manner representative research under each type of query expansion.

The organization of this review is outlined mainly in Figure 1. The research reviewed is presented according to the three types of query expansion: manual, automatic, and interactive. Each type, as appropriate, includes discussion of the source of terms and the methods for selecting them. The chapter summarizes the literature on these issues with some historical background.

Because the method of term selection relates to retrieval techniques, methods are discussed here only to the extent that they refer specifically to query expansion. Retrieval techniques are discussed in the ARIST reviews by BELKIN & CROFT and KANTOR and search techniques in the ARIST review by BATES (1981). Other sources include FRAKES & BAEZA-YATES, the SIGIR (Special Interest Group for Information Retrieval) proceedings of the Association for Computing Machinery, the TREC (Text Retrieval Conference) proceedings (HARMAN, 1993; 1994; 1995), the journals Information Processing & Management and the Journal of the American Society for Information Science, and the classic monographs by VAN RIJSBERGEN (1979) and SALTON & MCGILL.

3  The Curse of Dimensionality

An important concept in query expansion is the dimensionality and its consequences for retrieval. The "curse of dimensionality", which was adopted in information retrieval from the pattern recognition literature (VAN RIJSBERGEN, 1979, p. 130), refers to the problem of selecting a set of search keys that can be legitimately used to predict relevance.

More specifically this problem refers to the situation in which a decision should be made as to which terms should be used in query expansion. In the information retrieval situation we are cursed with a high dimensionality term space, i.e., with a large number of attributes, which in this case are the terms. In other words, we are faced with the complete set of index terms in the document collection upon which we have to base the decision of selecting terms and which makes such a decision nearly impossible. A natural way of dimensionality reduction in information retrieval is the acceptance that the query terms themselves are good guides for suggesting the terms to be used by the system for predicting relevance.

The association hypothesis (VAN RIJSBERGEN, 1979, p.134), states that:

``If an index term is good at discriminating relevant from nonrelevant documents, then any closely associated index term is also likely to be good at this.''

If the association hypothesis is accepted, then in a manual system dimensionality reduction during query expansion is controlled, as well as achieved, in a number of ways by the searcher, who uses the initial query terms that are then clustered with other closely associated terms. Decisions about the association of terms are entirely at the discretion of the searcher and are usually determined by the quantity and quality of the knowledge structures held by the searchers themselves as well as the availability and use of searching aids, such as thesauri, concept maps, and word lists.

In the case of automatic query expansion(AQE), in a weighted or associated retrieval system, the problem of dimensionality reduction is being taken care of by computing the weighting function only for the terms specified in the query and maybe their close associates rather than for all possible terms. In interactive query expansion (IQE) both the system and the user jointly share the responsibility for achieving dimensionality reduction.

An important issue in any type of query expansion, however, is how one defines which terms are closely associated with the query terms. An overview of the many different approaches for defining these relationships is offered in this review.

4  Manual Query Expansion

4.1  Introduction

This section reviews research on query expansion when it is performed manually. This type of query expansion is mostly associated with Boolean online and CD-ROM searching.

There are many online searching models that have been developed based on the Boolean retrieval model. Search strategies, search tactics, heuristics and moves are reviewed from the point of view of query expansion.

A number of models have also been developed based mainly on the Boolean retrieval model and the modes of interaction between user and retrieval system. There have been attempts to incorporate these models, in whole or in part, in knowledge-based information retrieval (KBIR) systems. Examples of that work are presented as well as of recent studies that contain some examination of query expansion.

4.2  Search Strategy

Search strategy development, i.e., the plan devised to deal with the whole search on a requested topic, is the most intellectually demanding aspect of the online search and it is the central subject of discussion in many textbooks, journal articles, and reviews, for example, BATES (1981), BELKIN & VICKERY, HARTER (1986), HARTLEY ET AL., MEADOW & COCHRANE. This section reviews selectively the research as it pertains to query expansion.

Good search strategy development involves the use of one's knowledge about online searching systems, indexing vocabularies and conventions practiced in text database construction. In other words, it requires a good understanding of the mechanics of the matching paradigm of information retrieval and how this is implemented in the system searched, a full understanding of the information need(s) and an ascertaining of the search objectives, e.g., high recall or high precision. The result of such analysis will eventually determine the subsequent search formulation, which is the statement or set of statements which express the necessary query in a form understandable to the online system.

A significant decision to be made during the search strategy formulation is the correct decomposition of the information need and the identification of the key concepts or facets.1 Then the choice of a particular search strategy would determine the way the concepts should be combined and would also suggest possible alternative actions.

A number of types of search strategies have been identified and named in the literature. Some of the most commonly mentioned and used strategies are the: building block, citation pearl growing, briefsearch, successive fractions, most specific facet first, or lowest postings facet first.

The building blocks strategy (Figure 2), attributed to Charles Bourne by MARKEY & COCHRANE, is that learned and most often used by many searchers. It is a modular approach where:

(a)
The searcher decomposes the topic of the search into many concepts or facets.
(b)
The individual concept is further decomposed to form a group of terms (synonyms, quasi-synonyms, etc.).
(c)
All terms belonging to the same concept are joined by the Boolean OR operator. The results of each subsearch are then combined with the Boolean AND operator.

[Building Block Search Strategy]
Figure 2: The building block search strategy

The advantage of this approach is that it provides a somewhat clearer history of the search logic, i.e., one that is easy to review and understand at a later time. The approach tends to follow and read like the actual query formulation. This feature also provides users with a useful tool for query reformulation as the search progresses further. The idea/process of OR-ing the terms that are conceptually related to the initial (first) term, which was used to identify a facet, is in effect an instantiation of query expansion.

The Citation Pearl Growing strategy operates in a completely different manner. The searcher starts with a very direct search on the most specific term for each of the concept groups in the search request in order to find at least one citation. That is instead of OR-ing all the terms in each facet, as in the building block above, the searcher selects the most specific/representative term of that facet. S/he then combines the single term facet representatives into a Boolean expression of the form

(term_facet)A AND (term_facet)B AND (term_facet)C AND ¼AND (term_facet)N
This single Boolean expression is known as the Briefsearch and forms the basis of the citation pearl growing strategy which builds the entire search on top of it. The searcher then calls up one or more of these citations for online review, noting index (descriptors) terms and free text terms found in some of the relevant citations. These new terms are incorporated into subsequent query reformulations to retrieve additional citations. After adding these terms to the query and searching, one can again review more retrieved citations and continue this process in successive iterations until no additional terms that seem appropriate for inclusion are found.

Citation pearl growing is characterized by its spiral-like effect which is a very dynamic and empirical manner of strategy development. The revision of retrieved citations online can be very helpful in the identification of search terms for addition or deletion at later stages in the query reformulation process. The payoff from this approach can be incorporated into any of the other approaches. The citation pearl growing approach can be viewed in principle as the ``manual equivalent'' of the relevance feedback search, which is discussed later.

Search strategy formulation is a highly unstructured problem and it requires a broad range of knowledge (e.g., knowledge of the user's specific problem domain, knowledge of the document retrieval system and its constituent parts, and other common-sense knowledge as well). Therefore, although it is being studied systematically it is still not a well understood process. Consequently, the search formulation process is difficult to automate.

4.3  Search Tactics, Heuristics, Moves

Online searching, as discussed so far, involves more than just a straightforward Boolean combination of terms. It is increasingly complicated, dynamic and its success varies considerably depending upon the abilities of the individual searcher. Not only does one have to learn how to use the existing systems, their query languages and the available knowledge structures, but also to develop ways of information seeking which are adjusted/modified or dictated by the `electronic form' of these knowledge structures, i.e., the form these structures are found in the electronic environment.

The difference lies in the way a knowledge structure is transferred from one form to the other. Users have to develop a model of information seeking which matches that form. Once they have developed a model then they go ahead and (try to) search with it. The model is usually ``dictated'' (developed) by the form the structure takes and whether it is explicit or implicit. For example, the move from a paper-based to an electronic-based environment does not imply that what worked on paper will work in the electronic environment, and vice versa.

Although there may be more access points in an electronic environment access to information for the end users remains a problem because all too often the access points are ``hidden'' from them, i.e., are not available at the user interface. So, unless the searcher is well trained on the particular system and very familiar with the specific database structure, searching will always be done at a superficial level.

Therefore, the present day searcher has to behave like a detective in deciphering a crime, coming up with all sorts of tricks and being able to make associations and to take the search results beyond the obvious.

Over the years, a number of general rules of thumb have been developed to the aid of the (manual and online) information searcher. Some of them are empirically drawn rules (e.g. FIDEL, 1985; HARTER & PETERS) and some theoretically, which focus on the psychological properties involved in searching (e.g. BATES, 1979a;b). These have become known as information search tactics, idea tactics, heuristics and moves.

At this point it is necessary to redraw attention to the distinction between the search strategy and the tactics, heuristics, moves. Following Bates' distinction the former is the plan for the entire search whereas the latter are more limited, short-term plans. So, that a number of tactics could be employed to make up the entire search strategy as seen in Figure 3.

[The relation of tactics to search strategy]
Figure 3: The relation of tactics to search strategy [from EFTHIMIADIS 1992a].

There is an array of tactics, heuristics and moves available to the searcher to choose from depending on which stage of the search s/he is in at any one time. Tactics can be divided into groups according to the function they perform or according to the stages of the search at any one time. There are no restrictions on their usage, i.e. these can be used as often as thought necessary by the searcher, or as little, or not at all.

BATES (1979b) described 17 idea tactics, i.e. tactics to help generate new ideas or solutions to problems in information seeking. The focus of these tactics is psychological and they are intended to help improve one's thinking and creative processes in searching (Here searching is taken in its broadest sense. The tactics are also applicable to other environments, such as problem-solving). The idea tactics were based and evolved from the more domain specific information search tactics (BATES, 1979a). The 29 information search tactics are grouped into four categories of which the category on term tactics, i.e., tactics to aid in the selection and revision of specific terms within the search formulation, describes methods for performing query expansion. In a more recent paper BATES (1987) demonstrated how these tactics could be implemented online.

FIDEL (1985) suggested moves or changes in the search process to help the searcher improve the search. Her moves are complimentary to Bates' tactics and strategies (FIDEL, 1986, p.72).

Moves deal with search situations where the retrieved set (a) is too large, (b) too small, or (c) off target. These are divided according to the searching style employed by the searchers, i.e. operationalist and conceptualist. Operational moves rely on the trade-offs between free-text and controlled vocabulary searching and use the search system capabilities to modify the retrieved sets without changing the specific meaning of the query concepts. Conceptual moves, on the other hand, modify retrieved sets by changing the meaning of the query concepts and are often supported by subject-related analyses of requests and by the structure of index languages. There are 18 operational moves and 12 conceptual moves. The conceptual moves refer to ways of expanding the search.

HARTER & PETERS also suggested that general search heuristics should be adapted to achieve better search results. According to their definition ``...heuristics are general rules of thought or action, mental operations, tactics, behaviors, or attitudes that tend to produce useful results in certain problem-solving situations...'' [p.407]. They presented six classes of search heuristics relating to: Philosophical attitudes and overall approach; Language of problem description; Record and file structure; Concept formulation and reformulation; Recall and precision; Cost/efficiency. HARTER & PETERS further provided several general guidelines for each of these classes which users could follow in their search.

CHEN & DHAR observed searchers and based on that they proposed a semantic network representation to capture the subject area knowledge and developed a process model for query reformulation and query expansion. There are two approaches in their process model, semantic browsing and retrieval by instantiation. In the former, searchers obtain their search terms based on one of five semantic relations: synonym, broader, narrower, adjacent, or disjointed term. In the latter, searchers may obtain new terms or other cues for searching based on citation information from the title, author, publisher and the index terms fields.

In conclusion, a manually performed query expansion involves the various search tactics (or moves) suggested by the work of BATES (1979a;b; 1987), FIDEL (1985; 1986), and HARTER & PETERS. The selection of the appropriate tactics is critical to the ultimate search result because the tactics employed will directly affect the progress of a search. Some of the search tactics will have been pre-determined in accordance with the search strategy. Others must be chosen and performed by the searchers at their own discretion while online. Finally, others will have to be modified, abandoned or changed as a result of the feedback received through the interaction. The successive examination and interpretation of intermediate results alters the searcher's knowledge structure, i.e. searchers may obtain a new perception or understanding of the search topic once they have started the online interaction.

4.4  Models and Applications

It has long been recognized that in the query formulation, query reformulation, and query expansion stages the knowledge structure of the searcher and/or intermediary mechanism plays a significant role for the success of the online search. A Boolean system has nothing that could be described as a dynamic cognitive structure (INGWERSEN, 1984), therefore any evolution in the interaction has to involve changes in the cognitive structures of the human beings. This is fine in the case of a professional human intermediary because the model is determined by education, training and the development of skills and experiences, both social and individual. The problem arises in the case where either the intermediary mechanism is a machine, or there is none, i.e., the end user interacts directly with the database. In most present machine-based intermediary mechanisms, this model would be static or passive depending on the mechanisms' level of sophistication (EFTHIMIADIS, 1990; VICKERY & VICKERY, 1992).

An example of a relatively sophisticated approach to this modelling appears in the work of Belkin and his colleagues, who start by investigating the functions carried out by intermediaries (BROOKS). They propose a model in which different expert systems, each dealing with one aspect of the intermediary's knowledge structure, would be all integrated as a distributed expert system and which would simulate the behavior of the human intermediary (BELKIN, SEEGER & WERSIG, 1983). A number of Distributed Expert Based Information System (DEBIS) prototypes have been developed based on the MONSTRAT model (BELKIN ET AL., 1987). These have implemented only some of the expert functions proposed, e.g., I3R (CROFT & THOMPSON, 1987), CODER (FOX, 1987) and MARIAN (FOX ET AL., 1993). An extension of the MONSTRAT model is the MEDIATOR model (INGWERSEN, 1992).

The dynamic nature of the online interaction makes it exceedingly difficult to formalize it in an algorithmic process. Hence the successful implementation of search tactics depends largely on the experience and the correct judgment of the searcher. Therefore, there have been a number of attempts to incorporate the search strategy techniques discussed as well as the tactics, heuristics, moves in knowledge based `online searching aids', as for example in IR-NLI (BRAJNIK ET AL., 1986; 1988), PLEXUS and TOMESEARCHER (VICKERY ET AL., 1987; VICKERY, 1988), EP-X (SMITH ET AL., 1989) and MICROARRAS (GAUCH & SMITH, 1989; 1993).

4.5  Studies

This section reviews some of the recent studies that inter alia examined user selection of terms with a tilt to query expansion.

PALMQUIST & BALAKRISHNAN used word association tests to elicit query expansion terms and enhance users queries. The users' original vocabulary and the word associations added to enhance the original vocabulary were compared with the titles and abstracts retrieved. The expanded vocabulary (word associations) was found to be significantly present in the abstracts and titles. The extra terms were found not to be discriminatory between relevant and nonrelevant documents.

SARACEVIC & KANTOR (p.201, 1988b) among their many findings reported that in general none of the effects of tactics and efficiency measures on relevance odds were large. Tactics and efficiency were measured as the number of commands, command cycles, search terms, preparation time, and connect time used. More search cycles, which allow for feedback, tend to produce better searches, while more search terms, more preparation time and more overall search time resulted in worse searches.

In a study of the searching behavior of 47 professional intermediary online searchers FIDEL (1991a,b,c) developed a model of the selection routine, i.e., a formal decision tree that represents the intuitive rules searchers use when they select search terms. The routine delineates the terminological conditions which lead to the selection of each option. As such it can be incorporated into the knowledge base of intermediary expert systems as discussed in FIDEL & EFTHIMIADIS.

In a study of how humanities scholars operate as end users of online databases, the terms used in the searches were analyzed and vocabulary categories identified (BATES, WILDE & SIEGFRIED; SIEGFRIED, BATES & WILDE). A typical search tended to be simple, using one-word search terms and little or no Boolean logic. Design recommendations for thesaurus development and the design of retrieval systems for the humanities are made (BATES, 1994).

In a study of undergraduate end users searching CD-ROM databases EFTHIMIADIS (1994a) reports that most users relied mainly on their class notes as the main/only source of terms for the query formulation and query expansion of their searches. Overall, there was very little use of controlled vocabulary or of browsing of the database dictionary. Similar results are reported in a study of mediated searches by SPINK where the most effective sources of terms were the users written statements, the user supplied terms during the interview and terms selected from specific records fields. ALLEN studies the ways in which users learn vocabulary during a search and use that vocabulary to alter and expand their searches.

KRISTENSEN & JÄRVELIN demonstrated the positive effect on recall of a manually-constructed searching thesaurus in a limited domain (economics & environment). KRISTENSEN used an expanded version of the searching thesaurus and by adding loosely-defined synonyms, related terms, and narrower terms achieved large improvements in recall at the expense of a small drop in precision.

5  Some General Consideration in Query Reformulation & Query Expansion

5.1  Introduction

In automatic query expansion the system itself is responsible in expanding the initial and/or subsequent queries based on some method. In any type of query expansion, but especially in the cases of automatic and interactive query expansion in partial match retrieval systems, one needs to consider the following issues: relevance feedback; initial query terms and query expansion terms; relevance judgments, e.g., document representation to base the relevance judgments, methods for relevance assessments, sample size of documents for estimating weights, ranking algorithms and term selection for query expansion. These issues are presented in this section while automatic query expansion and interactive query expansion are reviewed in subsequent sections.

5.2  Relevance Feedback

Apart from the manual/intellectual query reformulation where the task falls to the searcher it is possible for the system to take over this task entirely requiring only some yes-no answer from the user. This automatic query reformulation process is called relevance feedback (SALTON & MCGILL). Its aim is to improve the retrieved set by removing unwanted documents and adding more wanted documents without the user consciously constructing new search strategies, and by using relevance or nonrelevance information obtained from the user.

The typical automatic relevance feedback operation involves an initial search with a user-supplied query and an initial retrieval of certain documents. Then, from a display (usually of titles or abstracts of the retrieved documents) the searcher identifies/chooses some relevant documents. Those documents are used to modify the query by reweighting the existing query terms and/or by adding terms that appear useful and by deleting terms that do not. This process creates a new query which resembles the relevant documents more than the original query does.

If we consider the information involved in the user-system interaction in simple relevance feedback, it is like this:

  1. user supplies system initial query;
  2. system supplies user document description (usually as a ranked set of documents);
  3. user supplies system relevance judgments.

In its simplest form relevance feedback could be based on one document only. For example, after displaying a single document, the system could invite the user to see more documents like the one on display. Here the system, e.g., in an OPAC, could use the classification scheme and present the user with books of the same class-mark as the first viewed.

Another simple form is when judgment is based on a set as a whole (PIETILAINEN). Here, a set derived from a previous search becomes the seed for the new query formulation. The method uses `searchonyms' (ATTAR & FRAENKEL, 1977; 1981), i.e., terms which might be regarded as synonyms for the purposes of a particular enquiry and derived from terms contained in the seed set.

Automatic relevance feedback, generally, can be implemented in various ways depending on the retrieval technique used, e.g., vector space, probabilistic, etc., and also on the methods used to select terms for the feedback query. We can distinguish four term selection methods for query reformulation and query expansion.

In all cases, after the initial query formulation, the only form of feedback to the user is documents, and from the user is choices of documents. The query reformulation and query expansion is entrusted entirely to the retrieval system. Finally, query expansion can be performed with or without term reweighting. Query expansion without term reweighting, may involve the addition of terms from a knowledge structure, such as thesauri, term classifications, etc. Most research on relevance feedback and query expansion has been done using both query expansion and term reweighting.

5.3  Query Terms

In IR experiments terms, both initial query terms and query expansion terms, have been combined in many different ways. Most experiments have used all the terms of the initial query and have added a number of query expansion terms. The number of terms added for query expansion has been varied in the experiments depending mostly on the objectives of the experiment. In some IR experiments immediately after the initial search the query terms were renegotiated and only a portion of them, e.g., one third, was included in the expanded query (LESK, 1969; MINKER, WILSON & ZIMMERMAN; EKMEKCIOGLU, ROBERTSON & WILLETT). In some, there is no attempt to control the number of query terms, in other experiments there is a moderate number of query expansion terms added, e.g., 20 terms (HARMAN, 1988; 1992), and in others the number of terms added can be in the hundreds as implemented in massive query expansion, e.g., c150 terms (HAINES & CROFT) to c300 and c500 terms (BUCKLEY ET AL., 1994; 1995a;b).

5.4  Online Relevance Judgments

A search in a system with relevance feedback, with or without automatic or interactive query expansion, involves online relevance judgments. Once the initial query terms are selected, a search takes place, the results displayed to the users and online relevance judgments are obtained. Three questions are associated with the relevance judgments. The first relates to the question of which parts of the record relevance judgments should be based on, the second is concerned with the online relevance assessment of the documents, and the third considers the size of the sample of relevant documents to be used for relevance feedback and query expansion.

5.4.1  Sample size of relevant documents for relevance feedback and query expansion

The first criterion associated with the online relevance judgments is the size of the sample of relevant documents on which the system should base its estimation.

The retrieval system with the help of the user (through the online relevance judgments) is trying to model the probability distributions of the relevant and the nonrelevant documents, so that the probability of relevance of the documents can be estimated.

At least one relevant document must be identified by the user in order to be able to continue with the search. The sample size of just one document, however, is the absolute bare minimum required for a relevance feedback or query expansion search. A variety of sample sizes has been used in IR experiments.

In many relevance feedback experiments the sample is defined at a cutoff level of the 10 or 20 top-ranked documents. This is a commonly used method for getting a sample. These documents are then examined for relevance and those found relevant become the sample for the feedback iteration and the query expansion search.

The sample size that is of interest here is that of the set of documents judged relevant. The former, i.e., the cutoff level, is of concern only in terms of how many documents should be retrieved by the initial search or of how many documents should the user have to see while online. For example, HARPER & VAN RIJSBERGEN, HARPER, VAN RIJSBERGEN ET AL., SMEATON & VAN RIJSBERGEN (1981), SPARCK JONES (1979a;b; 1980) and more recently HARMAN (1988; 1992) have used in relevance feedback and query expansion experiments the 10 or 20 top-ranking documents.

A conclusion of these experiments was that a small sample of relevant documents could be adequate as the basis of the reweighting of terms. SPARCK JONES (1979b, p.143) used a sample of 3-4 documents and in another experiment 1-3 documents (SPARCK JONES, 1980, pp.328-329) and HARPER (p.6-3) suggested that at least one document is needed. This suggestion does not exclude the use of a larger sample. On the contrary, it is believed that the larger the sample of relevant documents the better the estimation should be. However, the problem of selecting an optimal sample size is still very much an open IR research issue. Some suggestions however exist and for some reason all tend to converge to recommending a sample size of 5 relevant documents, for example, HARPER (p.6-7). It is also worth mentioning here that MARTIN (p.73) and WHITE (p.34) in discussing ZOOM both propose without providing further justification, that ZOOM will be more effective if its frequency analysis is based on 5 relevant documents. The OKAPI online catalog requires at least 3 relevant documents in order to perform query expansion (WALKER & DEVERE) and EFTHIMIADIS (1992b) used 5 relevant documents.

A question then becomes how many documents should the user see before the desired number of X (e.g., 5) relevant documents is reached. It is believed that there should be some flexibility on the number of documents seen rather than strictly follow the 10 or 20 cutoff level of earlier IR experiments, especially when using an interactive system.

In the absence of user assigned relevance judgments an alternative method has been used. In this the IR researcher declares that the top X documents will all be treated as relevant. These documents then become the sample upon which estimates are based for term (re)weighting and/or terms get extracted for the expansion. This technique is well known and has been used by Salton and Sparck Jones in early experiments and more recently in TREC, e.g., BUCKLEY ET AL. (1994; 1995); EFTHIMIADIS & BIRON; EVANS & LEFFERTS.

5.4.2  On which document representation should relevance judgments be based?

This question is very important especially in the case when users are to provide relevance judgments. IR research has shown that relevance judgments are influenced by form, i.e. by the different document representations viewed, for example, title, citation, abstract and/or full text. SARACEVIC (1975, p.340) summarized the conclusions of research on the comparative effects on relevance judgments of different document representations in the sixties and seventies as follows:

Based on these experimental findings it is advisable that users should judge relevance of documents by looking at the most complete representation available, e.g., in a bibliographic database on both titles and abstracts.

5.4.3  Relevance assessments

In most relevance feedback systems, after seeing a document or parts of it, the user is prompted to assess the document's relevance. This judgment is based on a binary value of relevance, so that the outcome can only be either ``yes, it is relevant'' or ``no, it is not relevant''.

Relevance is a continuous variable and it has been established in major studies of relevance judgments, that it is an over-simplification to collapse a variety of degrees of relevance into yes/no decisions (CUADRA & KATTER; REES & SCHULTZ). For a recent discussion of relevance see SCHAMBER. However, the adoption of the binary definition of relevance is a required simplification of this complex notion which facilitates the calculation of relevance weights.

During the online relevance assessments of documents by users we are faced with two issues. One is that most present systems require relevance judgments on a binary scale. The other is that the user must respond on the question of the relevance of a document by judging each document for its relevance, i.e., the extent to which the subject matter of the document is about the query, rather that its usefulness. This treatment of relevance is necessary for relevance feedback systems to be able to provide an estimate of the probability of relevance that will be effective. In other words a document should be judged as relevant to a query only on the merits of the information it conveys and not by comparing it with what we already know or with what we have learnt from the previous documents we have seen during the search, or with how useful we perceive it to be.

The distinction of topical relevance, as defined here, and usefulness is that a document which is relevant but not useful to a user, because for example, the user knows about it, or the document is outdated, or it is written in a foreign language, is very important for relevance feedback systems. The importance of this issue lies in the fact that the online relevance judgments obtained are used by the weighting scheme for the estimation of the relevance weights. Although, one can argue that feedback systems would try to predict usefulness, including factors such as date and language, for most it seems appropriate to ask the users for topical relevance judgments as opposed to usefulness.

In helping users understand the difference and make the relevance judgments in the above context, one, for example, can make use of the question: ``regardless of whether you have seen this document before, is this the sort of thing you are looking for?  [y, n]'' (EFTHIMIADIS, 1992b), which is a modified question of the one used in the OKAPI online catalog (WALKER & DEVERE, p.26) and has been attributed to Robertson. A ``YES'' or ``NO'' answer can be given as a response to the question, which corresponds to the binary relevance. If the document is partially relevant or if the searchers for some reason are in doubt they should answer ``YES''. An additional reason behind this suggestion is that it seems better that the system retrieves marginal documents rather than excludes them. The user then, while online or offline, has the opportunity to decide again about the relevance and usefulness of a document.

5.5  Ranking Algorithms and Term Selection for Query Expansion

Query expansion requires a term selection stage where the system either chooses the terms by itself based on some criterion or it presents the query expansion terms to the user for selection. In both automatic or interactive approaches the ranked order of terms is of primary importance. The order should preferably be one in which the terms that are most likely to be useful are close to the top of the list. In addition, heuristic decisions can also be applied during this stage, for example, poor terms are excluded from the term list instead of being given low weights, etc.

What is known from IR research with respect to the relationship that holds between term frequency and term value and the effect on retrieval (SPARCK JONES, 1971; SALTON, 1975; VAN RIJSBERGEN, 1979): is that very frequent terms are not very useful; middle frequency terms are quite useful; infrequent terms are likely to be useful but not as much as the middle frequency terms; very infrequent terms are useful terms in the sense that when present they are good indicators of relevance. From this knowledge it can, therefore, be hypothesized that a good term ranking algorithm would bring the middle frequency terms near the top of the list.

This section discusses term weighting and ranking as it relates to term selection for query expansion. There is a plethora of ranking algorithms reported in the literature (SAGER & LOCKMANN; MCGILL ET AL; RO). These algorithms attempt to quantify the value or usefulness of a query term in retrieval. Formulae may estimate the term value based on some qualitative or quantitative criteria. The qualitative arguments are concerned with the ``value'' of the particular term in retrieval whereas the quantitative argument may involve some specific criterion such as a proof of performance. The relevance weighting theory is an example of the latter.

ROBERTSON (1986) suggested a modification to the F4point-5 formula (ROBERTSON & SPARCK JONES, 1976), referred to as F4MODIFIED, which takes into consideration the addition of new terms to the original query. The modified formula is:

wt = log (r+c)(N-n-R+r+1-c)
(n-r+c)(R-r+1-c)
(1)
where c = n/N; N is the total number of documents in the collection; R is the sample of relevant documents as defined by the user's feedback; n is the number of documents indexed by term t; r is the number of relevant documents (from the sample R) assigned to term t.

It was further suggested that the F4MODIFIED could be used in two ways (ROBERTSON, 1986, p.86). In automatic query expansion every term from the relevant document would be weighted using c = n/N and added to the search. In interactive query expansion the terms would be weighted in the same fashion and those selected by the user would then be weighted with the F4point-5 formula.

Apart from the theoretical argument for F4MODIFIED there were not any empirical data available. EFTHIMIADIS (1992b) tested the ranking behavior of F4point-5 and F4MODIFIED and found that their ranking is almost identical bringing at the top of the lists terms with low collection frequency (n) as well as low frequency in the relevant document set (r).

This type of ranking, is explained by the nature of the relevance weighting theory which assigns a higher degree of importance to the low frequency terms and therefore brings them at the top of the ranked list.

5.5.1  The WPQ algorithm

The results obtained from the empirical investigation of the F4point-5 and F4MODIFIED (EFTHIMIADIS, 1992b) meant that some other algorithm based on the relevance weighting theory might offer a better alternative. Consequently, a new algorithm was developed (ROBERTSON, 1990).

The independence assumption of the relevance weighting theory is that terms are distributed independently of each other in relevant documents and also that terms are distributed independently of each other in nonrelevant documents. In the query expansion stage of search an additional assumption should be made which considers statistical independence between the query expansion term and the terms in the entire previous search formulation.

The distribution of the relevant items for the initial formulation, for example, is further divided into two distributions: one which describes the relevant items that contain the term and one that describes the relevant item that do not contain the term. These two new distributions are identical according to the independence assumption. In other words, the presence or absence of the query expansion term does not affect the initial distribution. When the entire collection is considered, these assumptions predict a positive association between an initial query formulation and a good new term for query expansion.

The inclusion of term t in the search formulation with weight wt will increase the effectiveness of retrieval by

WPQt = wt(pt - qt)
where, wt is a weighting function, which in this case is the F4point-5; pt is the probability of term t occurring in a relevant document; and qt is the probability of a term t occurring in a nonrelevant document.

This means that irrespective of the weighting function (wt) used the rule for deciding the inclusion of a term in a query expansion search should be based on the ranking of WPQt instead of wt. Substituting the weighting function and the probability of relevance in WPQt with r, R, n, N we get:

WPQt = log (r + .5)(N - n - R + r + .5)
(n - r + .5)(R -r + .5)
·( r
R
- n - r
N - R
)
(2)

The relevance weighting theory attempts, via the Probability Ranking Principle (ROBERTSON, 1977), to optimize the entire length of the search curve from high-precision to high-recall. This is expressed by the wt component of the above formula which assigns greater importance to the infrequent terms. However, a model that determines which term(s) to add for query expansion will have to lead to a preference somewhere between the very infrequent terms that lead to high precision and the frequent terms that lead to high recall.

In the above formula this is achieved by pt - qt. This component, like the PORTER formula (equation 3), is influenced by the frequency of occurrence of a term in the relevant document set as well as the term's frequency in the collection. Therefore, the multiplication of the two components results in the effect which seems to be required by the model for term selection in query expansion.

5.5.2  The PORTER algorithm

PORTER & GALPIN in the MUSCAT online catalog used the following ranking formula for the ranking of terms for query expansion:
Porter = r
R
- n
N
(3)
where r, R, n, N are defined as in the F4point-5 weight. Looking at the formula it can be established that the weight is influenced by a term's occurrence (r) in the relevant document set (R) as well as the term's frequency (n) in the collection (N).

Because this formula is used to rank terms for query expansion the r/R portion never becomes 0, (because in order to use the formula there must always be at least one document containing the term that has been judged relevant) and it has a maximum value of 1 (this happens whenever r = R). In other words, this portion of the weight can take values that fall within the range: 0 < r/R £ 1 The n/N fraction however is influenced by a term's frequency (n). So that, the higher the term frequency (n) the higher the result of the fraction.

The PORTER formula seems to place more emphasis on terms that occur frequently in the relevant document set.

5.5.3  The EMIM algorithm

EMIM (expected mutual information measure) is a term weighting model incorporating relevance information in which it is assumed that index terms may not be distributed independently of each other (VAN RIJSBERGEN, 1977; HARPER & VAN RIJSBERGEN; VAN RIJSBERGEN, HARPER & PORTER)
EMIM = Eiq =
å
ti,wq 
DiqP(ti,wq) log P(ti,wq)
P(ti)P(wq)
(4)
or more generally
Giq =
å
ti,wq 
DiqDiq Piq
where, ti indicates the presence (1) or absence (0) of a term; wq indicates that a document is relevant (1) or nonrelevant (0); Diq indicates the value of a term as a relevance discriminator, and it is 1 if ti = wq or -1 if ti ¹ wq; Diq is the ``degree of involvement''; and Piq is the ``probabilistic contribution'' given by the log expression.

EMIM reduces to the F4point-5 weight when the ``degree of involvement'', i.e. the joint probabilities, are all unity. The EMIM weight is claimed to make use of the statistical dependence between index terms over the entire collection.

5.5.4  The ZOOM ranking

ZOOM is a frequency analysis tool available in the ESA/IRS online vendor. ZOOM was first discussed in the literature by MARTIN and an analysis of it from a cognitive viewpoint was given by INGWERSEN (1984).

ZOOM provides the automatic frequency analysis of phrases, single words, codes, or a combination of these contained in a selected set of references. Once a set of records is generated in a file the searcher may ZOOM the set. The ZOOM command can now analyze up to 20,000 records. However, the basic default analysis sample is the controlled (CT) and uncontrolled terms (UT) fields of the 50 records across the last selected set. The fields of the record that can be used to ZOOM vary from database to database. The searcher may request any of these fields, either on their own or in any combination.

ZOOM processes the records in the set and the phrases and/or single words of the analysis are displayed in columns. All terms are ranked in descending order of their frequency of occurrence in the sample set. Within ties, i.e. whenever there are more than one term with the same frequency of occurrence, terms are ranked in alphabetical order. It is important to note that every occurrence of a term is counted. In addition, the analysis takes into account all terms including stopwords. However, all punctuation is removed in the display of the ZOOM text analysis.

5.5.5  The R-LOHI algorithm

The R-LOHI algorithm has been proposed by EFTHIMIADIS (1993) as the result of the observation of the ranking behavior of the EMIM, F4MODIFIED, PORTER, WPQ and ZOOM algorithms that are used for the ranking terms for query expansion.

The R-LOHI ranking algorithm ranks terms according to r, i.e., their frequency of occurrence in the relevant document set in descending order, and resolves ties according to their term frequency, n, from low-to-high frequency.

It was hypothesized that the R-LOHI algorithm would have a similar ranking to PORTER and a performance approaching that of WPQ and EMIM. The evaluation of the performance of the R-LOHI algorithm against that of the other algorithms are reported in EFTHIMIADIS (1993; 1995) and EFTHIMIADIS & BIRON.

6  Automatic Query Expansion

6.1  Introduction

There is a considerable amount of laboratory experimental work on systems that incorporate some sort of automatic query expansion. However in many cases it is difficult to decide how the query expansion process itself is taking place as the expansion process is hidden within the overall process of information retrieval. Such systems often employ weighted or associative retrieval techniques. An example of this can be seen in the SMART system with the use of normalized vectors of information retrieval (ROCCHIO; SALTON, 1971).

Given therefore that there is, firstly, a large number of experimental systems that incorporate query expansion and modification techniques and, secondly, that it is often difficult to separate the expansion process from the overall retrieval process, I intend in this section to look only at a selection of the ideas in the area of automatic query expansion. Another related area for query expansion is the work done on natural language processing and syntactic analysis. This, although interesting, is a very large area and will not be discussed further in this review. Also excluded are from the review are relevance feedback and any retrieval techniques, including genetic algorithms, neural networks, etc., if these do not expand the query.

6.2  AQE: Based on Search Results

In the SMART system query expansion, as introduced by Rocchio, was/is done with the use of normalized vectors (ROCCHIO; SALTON, 1971). Queries and documents are represented by weighted vectors and for retrieval the query vectors are matched against document vectors. The system operates a relevance feedback mechanism and the original query is modified with each iteration of the loop by adding to the query term vectors for all retrieved relevant documents by a given cutoff point and subtracting term vectors for all retrieved nonrelevant documents by that cutoff point. This effectively added many terms, some with negative weights. It also increased the weights of some query terms and decreased the weights of others. Rocchio subsequently adjusted the method by allowing terms to be included in the expanded query if they either were in the initial query or occurred in at least half the relevant documents as well as in more relevant than nonrelevant documents. This adjustment produced positive results. IDE corroborated Rocchio's results. She also tested three strategies (basic Rocchio, Rocchio plus feedback from top ranked documents, and limited negative feedback from top ranked nonrelevant documents [dec-hi]) and found no significant differences between them. Some queries did better with positive feedback and some did better with negative. SALTON & BUCKLEY evaluated twelve vector space and probabilistic feedback approaches across six test collections. The tests also involved two levels of query expansion. Concentrating in the vector space approaches the methods used were basic Rocchio, Ide, and Ide dec-hi. For all six test collections the results seem to favor the Ide dec-hi method, although the other methods followed closely behind. When limiting the effects of negative feedback the Rocchio method gave best results and this finding corroborated earlier work. Of the two types of document weighting used the most effective involved a combination of the idf and within-document frequency of terms.

A procedure for the automatic adjustment of queries based on terms in the descriptor field (DE) of a database, i.e., by utilizing the manual indexing, has been described by VERNIMB. The procedure starts with the user having identified at least two relevant documents although as Vernimb suggests, searching effectiveness is improved the higher the number of known relevant documents. The system then examines the descriptors of these documents, which are 5-6 per document in this particular database. Then, it arranges them according to their frequency of occurrence in these documents, with the most frequently occurring descriptors first, and so on. Descriptors with the same frequency of occurrence, i.e., ties, are ordered in increasing order of their collection frequency (n). The one with the smallest collection frequency is placed first and that with the highest last (VERNIMB, p.340).

Partial queries are then generated for each one of the relevant documents. These queries consist of the descriptors (DE) that index the document and which are combined with the AND logical operator. For example, if there are seven DE in the first document then the query is: DE1 + DE2 + ¼ + DE7 . These partial queries are subsequently `loosened' by stepwise omission of descriptors eliminating the lowest ranked descriptor first until at least 10 new documents are found. The partial queries are then combined in an OR statement to form the `total query.' The next step involves a procedure for establishing the most useful partial queries based on relevance feedback of documents retrieved. A new query is formed and new documents are retrieved which after relevance judgments are made may be used for re-initiating the whole process. The procedure just described is wholly automatic and hidden from the user who only sees the retrieved documents.

DILLON & DESPER and DILLON, ULMSCHNEIDER & DESPER have described an algorithm for automatically incorporating search terms into a query using a form of relevance weighting known as prevalence weighting. Positive and negative prevalence is computed based on the occurrence of terms in relevant and nonrelevant documents that are retrieved from the initial search query. A number of threshold values for the prevalence weights exist so that groups of terms are assigned to a particular category depending on their prevalence weights. A new Boolean query is constructed by OR-ing together groups of terms according to their position in the prevalence category. Terms in the highest prevalence category are added (ORed) as single terms. Terms from the second highest category are ANDed as pairs of terms and so on. Finally bad (negative weight) terms are employed and these are NOTed. Any document containing one of these terms is not retrieved. Thus a query would be phrased as (DILLON & DESPER, p.202):
a series of positive expressions (Pi) which are ORed:

(P1 OR P2 OR ¼ OR Pn)
a series of negative expressions (Nj) which are ANDed:
(N1 AND N2 AND ¼ AND Nn)
and the final query is the intersection of the above expressions:
(P1 OR P2 OR ¼ ) AND (N1 AND N2¼)
This process results in new queries being formed based on the relevance judgments of retrieved documents and according to the calculated prevalence weights. The terms of the original user query are not included in subsequent queries. The terms found in the retrieved document set are ordered by these weights and used in the construction of new Boolean queries as described above. So this process uses the terms in the initially retrieved set as a starting point, abandoning whatever information was present in the user query, and resulting in a complete query reformulation rather than query expansion.

CITE (Current Information Transfer in English), was created as an interface system to Medline (DOSZKOCS & RAPP). It allows for a natural language input of the query which is then translated by the system. CITE makes use of weighting and ranking, using inverse document frequency and it has a relevance feedback facility. Its query expansion capability takes the form of modifying the search query using the MeSH headings of the retrieved document set. The MeSH headings of all the selected documents are incorporated into the search query and combinatorial search is performed to retrieve similar documents. Doszkocs and Rapp also mentioned other methods of query expansion that may be implemented in the system, such as using both the additional free text terms from the title, abstract and MeSH headings; or using those free text terms and MeSH headings that appear in relevant documents exclusively, i.e., do not appear in nonrelevant documents. These and other possibilities of automatic query expansion as well as a semi-automatic query expansion facility were implemented in subsequent versions of CITE (DOSZKOCS, 1983) and are discussed in the next section.

The OKAPI experimental online catalog has been influenced in some ways by the interactional aspects of CITE (WALKER & DEVERE). At the query input stage, OKAPI uses a dictionary table of substitution terms which allows common synonyms to be searched automatically, i.e even if the user has not entered them. For example if `Britain' is entered then OKAPI automatically includes Great Britain, GB, UK, United Kingdom, etc. (WALKER & JONES). OKAPI expands a query by selecting the `best' terms from a list containing the original query terms together with terms extracted from all the records which the user has judged relevant. Terms are weighted using a scheme based on the F4 formula and which gives a higher weight to terms that occur in more of the relevant document and a lower weight to those that do not. The list of terms is then sorted by descending term weight. Term selection for query expansion starts at the top of the list providing the following conditions are satisfied: the user has not already seen all the records indexed by this term; the term is not in the list of `dubious' words, i.e., a list of common words which cannot be stopped but which are not very useful in retrieval either, such as `analysis', `introductory', etc.; the term weight is positive; and fewer than 16 terms have been selected.

The main reason behind the choice of implementing only an automatic query expansion module was a concern that catalog users would not want a high degree of involvement and would probably not understand what the system was doing (WALKER, 1989; 1990). Recent versions of OKAPI (at City University and at UCLA) include (different) client graphical user interfaces, X-OKAPI and GIRAFFE respectively, with interactive query expansion modules. HANCOCK-BEAULIEU & WALKER report on the evaluation of the automatic query expansion facility in the OKAPI online catalog in an operational library setting. Data was collected using a combination of transaction logs, search replays, questionnaires and interviews. The results show that automatic query expansion was beneficial in a substantial number of searches. User intentions, the effectiveness of the ``best match'' search and user interaction were identified as the main factors that affected the usage of the query expansion facility which is an option available to users.

HARMAN (1992) using the Cranfield 1400 test collection demonstrated the importance of query expansion in addition to term reweighting. The addition of 20 well-selected terms and the execution of multiple feedback iterations improved retrieval performance. KWOK extended the neural network used in his probabilistic component approach to include query modification and query expansion. Tests with the CACM and CISI collections demonstrated that automatic query expansion using NN and learning from relevant documents was most effective with the addition of approximately 30 terms. Learning from irrelevant documents did not lead to better retrieval performance. BUCKLEY ET AL (1993; 1994; 1995) experimented with adding the 300-500 top ranked words and 50 phrases to the query. Massive query expansion was particularly effective in runs where relevance information is available (routing runs). The performance of five ranking algorithms for query expansion (WPQ, EMIM, PORTER, R-LOHI, R-HILO) was compared in the context of the TREC-2 environment and R-LOHI improved retrieval performance on both routing and ad hoc searches (EFTHIMIADIS & BIRON). Best performance was achieved by adding 10 terms from 10 documents. For the automatic query expansion runs of OKAPI 20-40 terms were added from the top 20-30 documents (ROBERTSON ET AL., 1994; 1995). THOMPSON ET AL. used a query expansion approach that combined the three methods reported in HAINES & CROFT.

6.3  AQE: Collection Dependent Knowledge Structures

6.3.1  Term clustering

During the late 60s and early 70s, query expansion was investigated a great deal. At that time the emphasis was concentrated in clustering single index terms together (using different techniques) before the user submitted a query by constructing a similarity matrix of terms. From this matrix groups of similar terms were identified based on some definitions of what cluster types were being used. Queries were then expanded by simply adding clusters of related terms. Typical examples of this type of research is the work of SPARCK JONES (1971) and of MINKER, WILSON & ZIMMERMAN, which appear to have rather conflicting views.

The most extensive work in the area of term clustering was done by SPARCK JONES (1971; 1973b;c; SPARCK JONES & BARBER; SPARCK JONES & BATES; SPARCK JONES & JACKSON). She formed groups of related word stems based on co-occurrence in documents. Her experiments explored a number of different clustering strategies and she found the effect of all these strategies relatively similar. For as long as the strategies were restricted to forming relatively small clusters their performance was best.

SPARCK JONES (1971, p.208) reached the overall conclusion that the use of an automatically obtained term classification does enable better retrieval performance compared to the performance obtained with the unclassified terms alone. She suggested that this occurred when frequent terms were being left un-clustered, infrequent terms were grouped into clusters and strongly connected terms were clustered by applying a matrix threshold SPARCK JONES (1971, p.195). This result was challenged by MINKER, WILSON & ZIMMERMAN who evaluated the effect of adding cluster-related terms to queries. They concluded that the expansion of queries by cluster-related terms was at best marginally useful and generally produced worse results than un-expanded queries. However, SALTON (1973) and SPARCK JONES (1973a) questioned Minker's experimental design. Results, similar to Minker's, obtained by other researchers (LESK, 1969) have for the most part also been marginal. LESK (1969) concluded that manually constructed thesauri were superior to automatically constructed ones. Recent work by CROUCH provide some encouraging preliminary results for the latter.

Sparck Jones' subsequent research provided rather pessimistic conclusions because retrieval performance was significantly improved by term clustering only on one small collection (SPARCK JONES, 1973b;c; SPARCK JONES & BATES).

On the whole, the explanations given for the poor results from term clustering focused on the properties of the collections that were used to derive the clustering. In LESK's (1969) case it was the small size of the collection used, while Sparck Jones in a detailed analysis of her experiments concluded that the fault lay in insufficient differences in the vocabulary between relevant and nonrelevant documents (SPARCK JONES, 1973b;c). She also stressed the need to use strongly connected non-frequent terms (SPARCK JONES & BARBER).

An additional explanation given by LESK (1969) is that the clusters captured only terms whose meaning was purely local (to the small collection) and did not reflect their general meaning in technical text. SPARCK JONES (1971) also argued that clustering of words limits ambiguity instead of emphasizing it. According to her a match on two words from the same class is very suggestive of which word sense is present. However, since matching on two words in the same cluster is much less common than matching on a single word, it seems unlikely that this is an effective disambiguation strategy. Since phrases (bound descriptors) are much less ambiguous than single words, they might be more effective. Based on the complementarity of the term clustering and syntactic phrase formation methods LEWIS & CROFT combined the two approaches. Small improvements in retrieval effectiveness were shown using the CACM collection.

6.3.2  Term co-occurrence

Most if not all of the work on automatic query expansion so far, with the exception of SPARCK JONES (1971), has not been fully evaluated with regard to performance measurement. At best they have investigated a specific method of automatic query expansion as implemented in a particular system. Research on query expansion which tried to systematically evaluate the performance of query expansion using term co-occurrence data is the work of van Rijsbergen and co-workers. In a series of papers they advocated the use of query expansion techniques based on a maximum spanning tree (MST) (VAN RIJSBERGEN, 1977; VAN RIJSBERGEN & HARPER; VAN RIJSBERGEN, HARPER & PORTER; SMEATON & VAN RIJSBERGEN, 1981; 1983).

Experiments on three different test collections demonstrated that relevance feedback searches using expanded queries were superior to simple co-ordination level match searches for which these was no expansion and no relevance feedback information available (VAN RIJSBERGEN, HARPER & PORTER). But, it was not clear whether this improved performance was due to the feedback or to the expansion. SMEATON (1982) and SMEATON & VAN RIJSBERGEN (1983) have investigated the effects of automatic query expansion using three different sources of new terms: (a) Maximum Spanning Tree (MST). A MST is a term-term dependence structure similar to a thesaurus but derived from statistical associations between terms. It takes the form of a tree structure where every term in the collection points to or is connected to at least one other term. (b) Nearest Neighbors (NN). A nearest neighbor to a given term is that term which is the most strongly related statistically to the given term. (c) Using index terms from the relevant documents found so far.

The three strategies were tested on the Vaswani test collection. The F4 and EMIM weighting formulae were used and query terms were weighted according to one of the formulae. Relevance judgments were simulated and automatic query expansion was performed using MST and non-MST methods. A test was also performed where no query expansion was performed and where randomly selected terms were incorporated in the strategy. It was found that the above three query expansion strategies all had a detrimental effect on overall retrieval effectiveness (measured by averaging the precision and recall figures computed for each query). The effect in order of increasing degradation (i.e., from best to worst performances) of retrieval was: (a) no modification, (b) randomly selected terms added to original query, (c) terms derived from known relevant documents, (d) terms derived from the MST, (e) terms as nearest neighbors.

It was also found that the more extra terms that were added the greater the degradation in retrieval effectiveness. SMEATON & VAN RIJSBERGEN (1983) have put forward several reasons for this detrimental effect on overall retrieval effectiveness. One reason given is that the probabilities were estimated from little relevance information and so could not be estimated accurately and that the situation was compounded by the addition of extra search terms which resulted in a decline in retrieval performance. They also stated that in theory the modifications that should have yielded the best retrieval (i.e., MST and NN), in fact yielded the greatest degradation. This is because the effects that poor probability estimations have on retrieval are `clustered' into an area around the query. MST and NN terms will be highly related to the original query terms and will co-occur with the original search terms more often and therefore have a significant influence on the top ranked documents. Another reason given is that the query modification strategies may have been implemented too early in the overall retrieval strategy.

A weighting function which utilizes the principle of fuzzy sets has been also suggested by SMEATON (1984). This is achieved by assigning a secondary weight as well as a relevance weight to each of the search terms which indicates the terms relative degree of importance to the user's search. The `degree of importance weights' are combined with the relevance weights. This effectively makes a search term set a partial membership or fuzzy set. New sets of search terms are generated from relevant documents using a matching function whereby all terms from relevant documents are added automatically. In order not to swamp the original query each term is weighted according to how important it is to the overall retrieval. This is achieved by measuring the similarity between the original query and each relevant document and grading the contribution of the new terms obtained from a particular relevant document in accordance with how `similar' that document is to the query. The retrieval strategy therefore uses the user provided relevance feedback to compute a fuzzy set of search terms composed of the initial query terms and all the terms from all known relevant documents. SMEATON (1984) concludes that this strategy has not yielded any significant improvements in retrieval effectiveness over a retrieval strategy that uses conventional weighting of the initial query terms.

Therefore, to date the overall conclusion of the above mentioned research has been that query expansion based on term co-occurrence data is unlikely to bring significant improvements in the performance of document retrieval systems. There is recent evidence, however, which has identified a substantial limitation on the use of term co-occurrence data for automatic query expansion (PEAT & WILLETT). It has been pointed out that the above limitation arises from the characteristics of the coefficients that are used to measure the similarity between a pair of terms. Peat and Willett have concluded that:

Their findings provide a rationale for the lack of success attained in the studies which used term co-occurrence data for query expansion. These also explain the findings by SPARCK JONES (1971) that the best results were obtained if only the infrequent terms were clustered and if the more frequent terms were left un-clustered, and by SMEATON & VAN RIJSBERGEN (1983) that query expansion by the addition of randomly selected terms gave better results to any of their methods based on co-occurrence data. Apparently, this last point by itself (i.e., that randomly selected terms gave better results) could have been taken as an indication that something might have been wrong.

ELKALIFA suggests that the failure of these studies is mainly due to the heterogeneity of the collections used rather than to the inefficiency of the techniques themselves. ELKALIFA used a combination of cluster and term association techniques for query expansion. Cluster analysis techniques were used to subdivide the document collection into small more homogeneous collections. Term association techniques were then applied to determine which terms could be used to expand the original request. This technique was applied on a very small test collection of 150 documents and reported positive results. Three search strategies were formulated for each request consisting of: (a) the initial query terms; (b) terms extracted from the entire collection; (c) a combination of terms extracted from specific clusters and the initial query terms.

6.3.3  Pseudoclassification

Of the different types of knowledge structures used in information retrieval, the automatic generation of thesauri and pseudoclassifications is intuitively appealing with continuing research interest. Brief definitions of these knowledge structures are given here in order to facilitate the subsequent discussion.

A thesaurus is generated manually by subject experts and is used for indexing and retrieval, e.g., INSPEC, MeSH. A searching thesaurus (PITERNICK) is not used during the indexing of the database but only during searching. In contrast to conventional thesauri, which lead the searcher to the preferred search term, the searching thesaurus does not seek to standardize the term choices by the searcher, instead it provides alternatives to the terms that the searcher has in mind, such as synonyms, quasi synonyms, antonyms, and related terms.

In an automatically constructed thesaurus (association thesaurus) the associations between the terms can be obtained automatically by using the occurrence characteristics of the terms across the documents in the collection.

Pseudoclassification, is a user-dependent term classification which takes into account user relevance judgments of certain documents with respect to certain queries in order to build term classes designed to retrieve the relevant documents and to reject the nonrelevant documents (JACKSON; SALTON, 1980).

In determining the term-by-term relationships for the pseudoclassification a similarity measure, a set of queries, the user's relevance judgments, and a cutoff threshold value are needed. Then, whenever a nonrelevant document is retrieved to a query because the similarity value is high for the document to be retrieved, the term-document correlation values of that term-document pair are lowered so that the likelihood of this error happening again is reduced. This is achieved by arranging the terms in the query and the document so that enough term class mismatches would occur to exclude the retrieval of the document. When a relevant document is not retrieved because the similarity value between the query and the document is not high enough for it to be retrieved, then the term-document correlation is increased so that this document will be retrieved next time (JACKSON). This is achieved by forming suitable term classes so that the similarity between the document and the query is strengthened by matching these term classes.

YU (1974; 1975) developed a heuristic algorithm for constructing term classes that was more efficient than Jackson's exhaustive search algorithm. YU & RAGHAVAN and RAGHAVAN & YU proposed a method of thesaurus construction (pseudothesaurus) based on relevance judgments that instead of constructing large term classes identified pairs of related terms. This method was shown to be faster than constructing term classes but as effective in improving retrieval performance. Their ad hoc method however involves many parameters that need to be determined empirically. To alleviate these problems RAGHAVAN & JUNG adopted a machine learning approach for the construction of the pseudothesaurus.

WONG, CAI & YAO proposed an adaptive bilinear model, three layer feed-forward neural network, for the computation of the term associations. WONG & YAO (1990; 1993) proposed a probabilistic approach for utilizing the relevance information to estimate the term relationships.

6.3.4  Association thesaurus

A query expansion model using a similarity thesaurus is presented by QIU & FREI (1993). A similarity thesaurus is a matrix that consists of term-term similarities. It is based on how the terms in the collection are ``indexed'' by indexing features such as documents, i.e., for each term there is document vector space. The domain knowledge contained in a similarity thesaurus is then used to find additional search terms that are most similar to the entire query, rather than to select terms that are similar to a single term in the query. The query expansion terms are weighted according to the similarity between these terms and the query. An extended version of the model, descriptions of how to construct and update a similarity thesaurus, and results of experiments on small test collections are presented in QIU & FREI (1994).

PHRASEFINDER is an association thesaurus used with the INQUERY system (JING & CROFT). Query terms are searched in the PHRASEFINDER database and the retrieved phrases are then used to expand the initial query. Results of its usage demonstrated consistent retrieval performance improvements through automatic query expansion (JING & CROFT; BROGLIO ET AL.).

Experiments in query reduction had detrimental effects on retrieval performance whereas query expansion using an association thesaurus improved retrieval performance (LU & KEEFER).

MATCHPLUS uses a vector representation for words such that the query expansion is a direct, natural result of the representation (GALLANT ET AL.). These vectors are learned from free-text examples of the corpus. As a consequence, the vector set for the stems represent a domain specific lexicon whose contexts have driven the inter-vector relationships. MATCHPLUS' approach is similar to the Latent Semantic Indexing (LSI) approach which can be viewed as a kind of query expansion (DEERWESTER ET AT.). In the reduced-dimension LSI space words are correlated, i.e., words that are used in similar documents will be near each other in the space.

First-order thesauri are discovered in CLARIT through the indexing of a representative sample of documents of a subject domain (EVANS & LEFFERTS). For example, in the TREC environment, for each routing topic the document collection used to establish a thesaurus consists of the training set of relevant documents for that topic. The process clusters phrases through a combined syntactic and statistical analysis of the text. In STRZALKOWSKI and STRZALKOWSKI & CARBALLO term similarities are determined by the information contribution function which uses term co-occurrence in a syntactic context, with poor results so far.

6.3.5  Conflation

Conflation (i.e., stemming and string similarity measures) relate morphologically similar indexing and search terms. In addition, stemming is used in IR for the reduction of the size of the index files (FRAKES, LENNON ET AL.).

WALKER & JONES used stemming, automatic spelling correction and cross-reference tables to improve subject retrieval in the OKAPI online catalog. Stemming was based on the weak and strong variants of the Porter stemmer (PORTER, 1980). The weak stemming was recommended as it significantly increased recall in their experiments. KROVETZ describes a new morphological processing (stemming) approach as an alternative to PORTER's (1980) stemmer.

HARMAN (1991) used the Porter, Lovins and S removal stemmers on the Cranfield 1400, Medlars and CACM in the IRX system and found that none of them significant improved retrieval performance. Other sources for query expansion terms that have been tried without success include string similarity measures and Soundex codes (HENDRY, WILLETT & WOOD). EKMEKCIOGLU, ROBERTSON & WILLETT report on an evaluation of three methods for the expansion of natural language queries in ranked-output retrieval systems. The methods are based on term co-occurrence data, on Soundex codes, and on a string similarity measure. Searches for 110 queries in a database of 26,280 titles and abstracts suggest that there is no significant difference in retrieval effectiveness between any of these methods and unexpanded searches.

POPOVIC & WILLETT devised a stemming algorithm for Slovene text that worked very well compared with the Porter algorithm used on English translations of the documents and queries. Successful stemming and string similarity work on other languages and on historical texts in English has also been reported in ROGERS & WILLETT, ROBERTSON & WILLETT (1992). Expanded queries, by means of historical variants as identified by the digram-matching method, retrieved more relevant documents than the unexpanded modern language queries (FRITH, ROBERTSON & WILLETT).

6.4  AQE: Based on Collection Independent Knowledge Structures

FOX (1981) assembled eleven groups of lexical relations from the linguistics literature and then grouped them into five categories. For each query term a list of lexically related terms was manually determined and was used to expand the query vectors in SMART. The results show that the best performance came from when combining all categories, but antonyms, together. WANG, VANDENDORPE & EVENS developed a relational thesaurus, i.e., a thesaurus based on lexical-semantic relations and tested it by running the same experiment as described in FOX (1981). Their results showed that the relational thesaurus improved retrieval performance.

JONES applied the relational database model to thesaurus representation for the use in a retrieval system. She provides an outline with suggestions for using the stored information when matching one or more thesaurus terms with a user's query. The results by JONES ET AL. (1995) on incorporating intelligent algorithms were not encouraging. Query expansion using the thesaurus in a ranked output system reordered the output but it did not improve it.

PAICE introduced a thesaural model of information retrieval based on spreading activation, but there is no empirical evidence to support the model.

MEGHINI ET AL. describe a multimedia information retrieval model (MIRTL) based on a terminological logic. The model, inter alia, specifies denotational semantics for query answering. Queries are answered by relying on a lattice of definitions and connotations which is the logical equivalent of a thesaurus. This may be seen as the formal equivalent of expanding a query by taking into account dictionary definitions.

JÄRVELIN ET AL. present a deductive data model for thesauri. EXPANSIONTOOL is an aid for AQE based on the deductive thesaurus data model. By naming relevant concepts and by adjusting a set of parameters, the searcher can generate queries which vary in recall and precision, target database, query language and/or matching principles. The system uses a more elaborate building-blocks approach for concept expansion. Once the system has expanded all the terms it then uses a query language grammar to formulate the appropriate search statement to the target query language selected. EXPANSIONTOOL currently supports the INQUERY, ISO, TOPIC and TRIP query languages.

CROFT & DAS describe an experiment in which the query was enhanced with information obtained from users during the presearch interview. Different retrieval strategies were tried out and the top 10 documents from each strategy were merged into a single set. The results demonstrated significant improvements in retrieval performance.

The automatic query expansion in the OKAPI online catalog in addition to using terms from the relevant retrieved records it also uses the classification codes that are assigned to the record (HANCOCK-BEAULIEU & WALKER).

In SPECIALIST the natural language query is parsed in order to determine simple noun phrases (ARONSON, RINDFLESCH & BROWNE). Then each phrase is mapped to concepts in the UMLS Metathesaurus by generating variants of the words in the noun phrase. A degree of similarity is computed between the noun phrase and the generated variant terms from the UMLS Metathesaurus. The similarity, from strongest to weakest mapping, is defined as exact, simple, complex and partial match. In an experiment using the UMLS test collection and comparing retrieval effectiveness of three types of queries (unexpanded query, expanded query with UMLS Metathesaurus terms, and UMLS Metathesaurus terms only) they report that the expanded query increased average precision and performed better than the unexpanded query. The query using the Metathesaurus terms only had the worst performance. Performance improvement was minimal, a (4%) increase of average precision was reported. The authors point to future methodological modifications that promise better retrieval performance by exploiting the UMLS semantic network. COACH, a UMLS browser for expanding GRATEFUL-MED MEDLINE searches, when invoked it locates a term in the Metathesaurus and identifies all synonyms and related terms of the term and continues the process recursively until twenty MeSH terms are found (JACHNA, POWSNER & MILLER). COACH uses a modified idf weighting to rank terms.

VOORHEES & HOU and VOORHEES (1994a;b) used WordNet (MILLER), a general purpose thesaurus, as a source of related concepts for query expansion. Terms were either automatically selected or were hand picked. Some queries were improved, others were degraded. BOYD, DRISCOLL & SYU used the public domain 1911 version of ROGET's thesaurus to expand queries. Two approaches were implemented, one using a connectionist model and the other the vector space model, but had poor results. In TREC-3, DRISCOLL, THEIS & BILLINGS use a knowledge-based approach based on the ER database model. A TREC topic is expanded into several lists of information. Each list is either all the synonyms (and word forms) for a word appearing in the topic, or all the currently known values for an item mentioned in the topic. For example, for each required item of information (such the name of a company, or the type of a cancer), a list of possible values is made for that item of information. The set of lists for a topic are used to search for relevant information. The search involves scanning documents and count hits in the lists. Various combinations of the lists and various weighting of the lists are used in determining the relevancy value for a document.

7  Interactive Query Expansion

7.1  Introduction

This section considers methods where the user is offered search terms as part of the reformulation process. In interactive query expansion as opposed to automatic query expansion there are two parties responsible for determining and selecting terms for the expansion. One is the retrieval system itself which, as in the case for automatic query expansion, is designed to select terms from a number of predetermined fields of the bibliographic record and then weigh and rank those terms. The other is the user, who is presented with the ranked list of terms and has to decide which are the terms to be added to the search.

In a weighted search with interactive query expansion there is therefore a joint responsibility between system and user in selecting terms for the expansion. On one hand, the system suggests terms and presents them to the user, and on the other hand it is the users who select terms according to their preferences. So, it is the user who takes the final decision over the usefulness of a term. Therefore, the user choices of terms, rather than the system chosen terms alone, reflect the relative importance and the usefulness of terms from the user's perspective, thus increasing user satisfaction. The reasons of the success or failure of the search becomes therefore even more difficult to pin-point because in interactive query expansion we are increasing the uncontrollable variables.

The source of the terms for the selection stage, as in the case of automatic expansion, could be based either on the search results or on some knowledge structure which could subsequently be either dependent on the collection or independent of it. Attempts to incorporate a semi-automatic user-assisted (interactive) term selection stage are much fewer than their automatic query expansion counterparts. The discussion below examines some selected attempts of interactive query expansion.

7.2  IQE: Based on Collection Dependent Knowledge Structures

Examples of interactive query expansion based on the collection are the EXPAND or ROOT commands available in online vendors. These provide a form of feedback from a knowledge structure of the database which is the dictionary file. Users are given an alphabetical listing of descriptors and free-text terms to choose from and expand or modify their query formulation.

The INSTRUCT term clustering technique (WADE & WILLETT) identifies keyword stems which are most similar to the query term stem. The morphological expansion (FREUND & WILLETT; HENDRY, WILLETT & WOOD) calculates a measure of string similarity (measured in trigrams) between a selected query stem and each of the stems in the dictionary file of the database. Then, in both cases, the system displays the twenty most similar stems to the user who chooses the ones to be added in the query.

EUREKA (BURKET, EMRATH & KUCH) is an experimental full text retrieval system which uses a user specific thesaurus (there is not a system-wide one). Each user can create and maintain a personal thesaurus which is used by EUREKA at search time to find synonyms for the query terms. As additional user aids EUREKA can present on demand either a histogram of term frequencies based on the retrieved documents, or word-lists of terms that are used in many documents or have high average frequencies. From these lists the user selects terms to refine the retrieved set.

Feedback based on the result of the search and with a dynamic form of search modification is given by Thomas (ODDY) which treats the document collection as a thesaurus and lets the user browse through it. The user starts with a simple natural language expression of interest, and the program responds by presenting a document representative (i.e., title, author(s), index terms) and asks for the user's judgments. At this point the user can make negative or positive judgments, on the document as a whole, and/or on any index term(s) or author(s) listed. Then Thomas iterates the search while keeping track of its actions by creating a model of the search in the form of a network.

In I3R (THOMPSON & CROFT) browsing begins after the user has picked an item as the starting point. The system provides assistance to the user by suggesting paths that should lead to relevant information. Recommendation is given to guide and not to restrict the user options. This is reflected by the displays on the neighborhood and context maps, which consist of nodes representing concepts, documents and journal issues connected by links. In the center of the map is the document the user selected as relevant. The surrounding nodes are documents that the system has decided are likely to be interesting. When the user selects a node the document that it represents appears on the screen and the user can mark it as relevant or ignore it. After a number of documents have been judged relevant I3R will re-iterate the search.

7.3  IQE: Based on Collection Independent Knowledge Structures

Query expansion examples based on a knowledge structure which is independent of the collection are found in AID, CITE and in expert systems which use thesauri in one way or another.

The Associative Interactive Dictionary (AID) is a prototype system developed for the Medline and Toxline databases at the National Library of Medicine (DOSZKOCS, 1978). AID automatically generates and displays related terms, synonyms, broader and narrower terms and other semantic associations for a given search term. These are controlled vocabulary associations from MeSH in Medline or free text word associations in Toxline. The terms are derived from titles, abstracts and/or controlled indexing fields from retrieved documents. These terms are displayed in ranked order according to a `relatedness' value (R) which is calculated using a modified chi-square value. The retrieved set is defined as the set of documents retrieved by a given search term or Boolean query. If the observed document frequency of a given term in the retrieved set is less than its statistically expected frequency then the term is assumed to have no associative value. The rationale behind the AID system is based upon the assumption that terms showing a considerably higher than expected frequency of occurrence in a retrieved set are assumed to be semantically related to the terms that retrieved that set. Thus AID is said to be a tool that retrieves semantically related terms.

AID operates by storing a subset of the inverted files for the two databases in its in-core hash table. The hash table terms represent all the inverted files' index terms with a frequency of four or more postings. Searches are carried out in the usual Boolean fashion and AID can be implemented at any time through the `EXECUTE' command.

A typical search on AID requires that the searcher is not only familiar with the terminology of document representation of bibliographic systems (in order to select which field the system should base the analysis on) but has a knowledge of how statistical correlation works since one is invited by AID to decide on a correlation threshold (a number between .0 and .999). Another requirement is that one must also specify the number of records to be used in the associative analysis. Nevertheless, searchers are presented with a list of terms which are ranked according to their relatedness to the initial query term together with an indication on the number of documents each of them would retrieve. The searcher then could select or reject any of the associated terms for query expansion.

A number of the research ideas in AID were subsequently implemented by Doszkocs in the various versions of CITE (DOSZKOCS, 1983). An interactive session with CITE might be instigated by the searcher who enters an enquiry statement in natural language. CITE parses the input, identifies spelling mistakes, requests their clarification and then suggests to the searcher a set of potentially applicable single words and MeSH headings that match the parsed input and for which CITE proposes to search. These terms are ranked by some weighting formula (which has not been disclosed) and presented to the searcher for selection. Relevance feedback information is also utilized in order to locate other items. CITE performs a frequent analysis on the MeSH headings and presents them to the user who can choose any of these and/or add additional keywords to the search.

CITE automatically performs stemming on words entered by the user, then selects medical subject heading (MeSH) terms and terms from the dictionary file by matching the user's words against them. Then it ranks the selected terms and displays them to the user for approval and feedback before commencing the search (DOSZKOCS, 1983).

All expert systems which have pre-search aid modules, such as CANSEARCH, CONIT, TOMESEARCHER, etc., use a knowledge structure independent of the collection and help in the suggestion of terms. Some of these systems maintain a thesaurus as part of their knowledge base, while all provide some means for the identification of terms.

EXPERT (YIP) helps the user in building a query formulation by asking the user to split the topic into concepts and then to suggest terms to express each one of them. The system builds the query by developing these concepts and then it prints them out in a columnar manner with each column containing the terms that form one concept. Thus it has no thesaurus (or any form of semantic knowledge structure); its knowledge structure could be described as purely syntactic, relating to the structure of typical simple Boolean queries and utilizing some form of the building block search strategy.

CANSEARCH (POLLITT, 1981; 1987) uses a subset of the MeSH thesaurus, together with its own knowledge of how cancer queries tend to be structured, to help the user identify terms and incorporate them in the query. TOMESEARCHER (VICKERY), processes its natural language query input and uses the INSPEC thesaurus to supplement the terms and expand the query. A system specifically designed to assist users in the selection of query terms has been developed by SHOVAL (1985; 1986). Its knowledge base consists of a thesaurus enhanced by additional information about terms, which is organized as a semantic network.

The user inputs search terms and the system searches its knowledge base, selects terms that seem to be relevant, evaluates them and presents them to the user who chooses which are useful for the query.

The MICROARRAS full text retrieval engine (SMITH, WEISS & FERGUSON) retrieves text passages and then informs the expert system of the results that satisfy the request. The expert system evaluates them and decides whether and how to reformulate the query (GAUCH & SMITH, 1989; 1991). Query reformulation is done by using a thesaurus (i.e., adding terms from the thesaurus to the search terms), by adjusting contextual constraints on the Boolean operators (i.e., alternating the operators from the most to the least restrictive, as for example in the sequence `ADJ, WITH, SAME, AND, OR'), and by replacing Boolean operators. These techniques can be employed individually or in any combination.

In MenUSE (POLLITT, 1988) the emphasis is on the structure of the MeSH thesaurus. The user identifies concepts by starting from the top of the hierarchy and going down the tree till the appropriate entry is recognized and selected. The same process is repeated by the user for each of the concepts. Then the system combines them and presents a review to the user. Another system, which also takes advantage of the MeSH thesaurus, uses a graphical interface to achieve similar goals (MCMATH, TAMARU & RADA). The hierarchical structure of MeSH is graphically displayed and the user can move about the hierarchy with the help of the mouse. Queries can be formulated by selecting terms from the hierarchy and placing them in the query window. Multiple windows allow the user to simultaneously see the thesaurus, a histogram of document rankings and the document descriptions themselves.

In HIBROWSE (POLLITT, ELLIS & SMITH) different views of the same database are presented all at the same time on the screen. For example, MeSH headings appear in the center window along with indications of how many documents are indexed under each heading and how many terms are in the hierarchy below the displayed term. Additional windows display, for example, titles, author names, year of publication, document type, languages, etc. When the searcher selects a subject heading then all the windows get updated automatically to display the corresponding information, such as the titles of the papers indexed by the selected subject heading, etc. Searchers can then browse, select and combine information from any window so that they define and expand a query without the need of mastering a command language.

Searching in the MEDINDEX prototype is performed through the indexing of the query, which shares the document indexing knowledge-base (HUMPHREY, 1992; 1994). Many of the rules are shared between document indexing and query indexing in filling indexing frames. Frame-filling is considered a form of query expansion by prompting the searcher for aspects of a topic. For example, in filling a Drug Therapy frame, the system prompts for PROBLEM (disease treated), AGENT (drug used), and so forth. If the searcher filling this frame enters Estrogens or a specific estrogen term as the AGENT, the system reminds the searcher of the more specific term Estrogen Replacement Therapy. MEDINDEX therefore uses knowledge of specificity in the domain, and can initiate very situation-specific assistance, compared to searchers having to initiate look-ups in the MeSH trees assuming the searchers realize they need to take this initiative.

After the frames are filled, the indexing terms generated are displayed in an interface. Three techniques for query expansion are then provided (HUMPHREY, 1994; 1995): (a) Explosion of terms (i.e., hierarchical query expansion) is the default. (b) The interface includes ëxpand to parent", displaying the ancestors of a MeSH term for searchers to select any of them to be combined in union with the original term. Parent terms selected are not exploded by default. They are incorporated into the search strategy, which is then passed to the retrieval system. (c) The "SQL Query" version of the MEDINDEX prototype has an explosion interface which permits searchers to mark exceptions to explosion and then it produces source code for query expansion with exceptions. The code explodes a MeSH term, removes the searcher identified exceptions from the list, and then considers the result to be the explosion.

The design issues of integrating a dynamic lexicon with the AI-STARS full-text retrieval system are discussed by ANNICK. Like I3R (CROFT & THOMPSON), AI-STARS allow users to directly browse a lexicon during query (re)formulation and select terms for the expansion.

Searchers used a newspaper database as a further source of words and phrases in addition to their personal knowledge of the topic they were searching. For each search they analyzed the TREC topic, excluded any terms that seemed unhelpful and expanded the term list by adding synonyms, broader terms, narrower terms or other useful terms (COOPER, CHEN & GEY).

7.4  IQE: Based on Search Results

An alternative method for assistance in the search process is to present the user with information based on the results of the search. User-system interaction is on various levels of sophistication. For instance, the system presents to the user a list of terms based on their occurrences in an identified set of documents. Then the user feeds back his or her choice of terms. The document set on which this analysis is based may either be simply a set retrieved in the usual fashion (and chosen or accepted by the user as a suitable set for this purpose), or it may consist of documents individually selected as relevant by the user.

The ZOOM (MARTIN; INGWERSEN, 1984) feature on ESA/IRS is a helpful tool for online searching along these lines. It performs term frequency analysis on a number of records from the retrieved set(s). The user is then presented with screen-displays which contain terms in a frequency ranked order. The searcher selects terms which then can use to expand the query. ESA/IRS offers also QUESTQUORUM (D'ELIA & MARCHETTI) as a simple interface to its command driven system for inexperienced users which can also do a semi-automatic query expansion based on terms selected by the user from a ZOOM-like display. Recently, ESA/IRS introduced BRAQUE, a prototype interface that utilizes ZOOM, EXPAND, and thesaurus browsing (HYPERLINE) for term selection during query formulation and query expansion (BELKIN, MARCHETTI & COOL). Commands similar to ZOOM are now also available in other vendors, e.g. GET in ORBIT, EXTRACT in DIMDI, MEMSORT in QUESTEL, RANK in DIALOG.

There are also some other systems that have utilized the term frequency analysis function either in a ZOOM-like way or in some other form. IT (Information Transfer system) (WILLIAMS, 1984) uses the command EXPLORE to retrieve, organize and present index terms to the user, for addition to the search. Analogous is the command TERMS in the MUSCAT online catalog (PORTER & GALPIN). It presents terms and UDC numbers which are extracted from documents inspected and judged relevant and which were not included in the original query. With each term presented the user is asked whether or not to add it to the query. CITE (DOSZKOCS, 1983) utilizes the user feedback also in a similar manner. It automatically performs term frequency analysis on the records marked as relevant, and then it presents the terms in ranked order to the user for selection.

USERLINK, OASIS and IT are front-ends for both the unskilled user and the skilled intermediary (WILLIAMS, 1983; 1984; 1985; GOLDSMITH & WILLIAMS). The aim is to improve the recall of a search by helping the searcher select index terms from retrieved documents. Query expansion is achieved through the EXPLORE feature as follows.

The searcher carries out a search and identifies a set of relevant documents. IT then retrieves the titles and descriptors which are contained in the final results set. The retrieved items are then processed to provide two lists of the single words and phrases that are contained in the retrieved documents. The lists are ranked in order of decreasing frequency. The word list and phrase list are then displayed to the user. A simple dialog is provided which enables the user to select terms and to add them to the search.

The words and phrases are displayed in order of decreasing frequency of their occurrence in the retrieved set. The system can handle fairly large numbers of references. Williams suggests that up to 50 references should normally be retrieved and processed to keep the online time to a reasonable value. ZOOM and Williams' system are thus fairly similar in their approach, extracting terms from retrieved records which are subsequently displayed in an organized fashion using frequency information.

PORTER & GALPIN discuss MUSCAT, an OPAC at the Scott Polar Research Institute, which allows relevance feedback and query expansion in its advanced use of the system. In the advanced module of the Muscat OPAC a query can be in probabilistic or in Boolean terms. In the probabilistic query extra terms may be added in any of the following ways: (a) by browsing through the index and adding in selected terms; (b) by using the command `ADD PENGUIN' which explicitly adds the term penguin in the search; (c) by using the command `TERMS' which is a relevance feedback method of query expansion.

The `TERMS' command allows users to view terms found in the relevant documents. These are comprised of stemmed single words and Universal Decimal Classification (UDC) numbers. The terms and UDC numbers are weighted and ranked in decreasing order.

Another investigation of query expansion is the IRX experiment (HARMAN, 1988). Part of that experiment looked at some ways of extracting short lists of terms for query expansion gathered from relevance feedback, nearest neighbors, and term variants of the original query terms. IRX used the Cranfield test collection for this bench experiment and user interaction was simulated. The goal of the part of the experiment that looked at relevance feedback was to identify 20 terms that could be included in the `feedback terms' window of the interface.

ARAYA evaluated different strategies to improve query formulation using the SMART retrieval system and the CACM test collection. Query expansion was based on adding phrases that were generated, using syntactic or statistical methods, from the query terms or from the relevant documents. The statistical method generated phrases by combining every term in, for example, the query in every possible way. It was reported that user intervention did not perform much better than the automatic approach. However, the potential of interactive query expansion for improving performance was realized.

EFTHIMIADIS (1992b) investigated interactive query expansion using CIRT, a relevance feedback system (ROBERTSON ET AL., 1986), and searched INSPEC on DATASTAR with real users and their requests. Interactive query expansion was investigated from two different perspectives, (a) studying end users during the process of query expansion, especially the term selection process, and (b) studying the behavior of ranking algorithms for query expansion. The overall search results provided some evidence for the effectiveness of interactive query expansion. The user-based aspect of the research investigated the processes of interactive query expansion, term selection for query expansion by users, and the user perception, understanding, and assignment of term relationships from a knowledge structure, such as the INSPEC thesaurus (EFTHIMIADIS, 1994b; 1996) The algorithmic aspect of the research was concerned with the evaluation of the performance of ranking algorithms for query expansion. A new methodological approach for the evaluation of ranking algorithms was introduced where the yardstick for the evaluation of the ranking algorithms was the user choices of terms (EFTHIMIADIS 1993). The comparative evaluation of eight ranking algorithms for the ranking of the terms for query expansion is given in EFTHIMIADIS (1995).

8  Conclusions and Proposals for Future Research

Query expansion is an essential element in the retrieval process. The research reviewed not only attests to that but also signals that more research is needed. There are many approaches to query expansion and research work needs to be continued to:

As experimentation is moving away from the small test collections to the TIPSTER/TREC collection and to operational environments, a new wave of research is under way that will test the scalability of earlier approaches and try out new ones. Most research on query expansion has concentrated in developing a method and then testing it in terms of overall retrieval performance. While, this is an appropriate methodology there is lack of microevaluation of the tests performed. Such individual scrutiny of results, though admittedly a costly and onerous task, would provide qualitative clues for further explanation of the behavior of the method of query expansion that is studied. Thus far research on query expansion has not yet identified optimal levels for neither automatic query expansion nor interactive query expansion.

The conclusions from this review of query expansion are suggestive of many directions for future research. The main research directions can be divided into two streams: (a) research on methods for the automatic identification, collection and association of terms; term extraction and use in retrieval; ranking algorithms and of other methods for term inclusion in the search; and (b) user studies on query expansion with emphasis on how searchers select terms for query expansion; how do they perceive term relations, and how these affect retrieval performance. These are highlighted below.

8.1  Ranking Algorithms

There is a renewed interest in research on ranking algorithms for query expansion. By taking into consideration