Another way of getting ideas to choose terms for query expansion is by browsing the dictionary index of the database that is being searched or the cumulative dictionary index of all the databases available in the retrieval system, such as, DIALINDEX in DIALOG; this, of course, requires imagination and experience on the part of the searcher. Choosing terms from unstructured thesauri, dictionary indexes, or alphabetical word lists is a difficult task for the nonexpert searcher because these lists are usually created as the sum of all the words in all databases that are available in a particular vendor retrieval service at that time.
For the sake of simplicity the online search can be reduced to two stages: (1) initial query formulation and (2) query reformulation. At the initial query formulation stage, the user first constructs the search strategy and submits it to the system. At the query reformulation stage, having had some results from the first stage, the user manually or the system automatically or the user with the assistance of the system, or the system with the assistance of user tries to adjust the initial query and improve the final outcome.
Query expansion (or term expansion) is the process of supplementing the original query with additional terms, and it can be considered as a method for improving retrieval performance. The method itself is applicable to any situation irrespective of the retrieval technique(s) used. The initial query (as provided by the user) may be an inadequate or incomplete representation of the user's information need, either in itself or in relation to the representation of ideas in documents.
Query expansion may take place in the initial query formulation or in the query reformulation stage of the online search, or in both. The more general concept of query modification may involve deletion of terms as well. Query expansion, as depicted in Figure 1, can be performed manually, automatically or interactively (also known as semi-automatic, user mediated, and user assisted).
Two key elements need to be considered when applying any form of query expansion are: (1) the source, that will provide the terms for the expansion, and (2) the method that will be used to select terms to be used in the expansion (e.g., ranking algorithm). Figure 1 shows possible sources. One type is based on the search results, and it relates to the relevance feedback process. Documents retrieved in an earlier iteration of the search that have been identified as relevant become the source for the query expansion terms. The other type is some form of a knowledge structure (taking the phrase broadly) that is independent of the search process. Such knowledge structure can either depend on the collection, i.e., be corpus based, or be independent of it. Examples of collection-dependent knowledge structures are:
Examples of collection-independent knowledge structures are:
Tests on the searching function in information retrieval can be divided in two categories (SPARCK JONES, 1981, p. 237). One is concerned with operational systems and concentrates on, search logic and search behavior. The other tests are concerned with experiments on query modification by relevance feedback. The latter tests have been using test collections where the role of the user in decision making in query expansion and query modification either does not exist or at best is simulated. This division of searching tests in information retrieval research is associated with the differences in the retrieval techniques involved, i.e., Boolean vs. weighted retrieval, which also corresponds to commercial and experimental systems. Quite often research on searching reported for one side seems to be irrelevant or not applicable to the other and vice versa. However, there are strong reasons for attempting to bridge the gap in the case of query expansion. This is true for a number of reasons, including the role that knowledge organization and representation, semantics, and human information-seeking behavior have to play during the process. Some of the research questions that emerge when considering query expansion are:
This chapter reviews the ongoing research on query expansion that attempts to provide answers to these questions.
The level of the treatment of the subject is also varied. The aim for this review is to provide the overall framework of query expansion and present in a selective manner representative research under each type of query expansion.
The organization of this review is outlined mainly in Figure 1. The research reviewed is presented according to the three types of query expansion: manual, automatic, and interactive. Each type, as appropriate, includes discussion of the source of terms and the methods for selecting them. The chapter summarizes the literature on these issues with some historical background.
Because the method of term selection relates to retrieval techniques, methods are discussed here only to the extent that they refer specifically to query expansion. Retrieval techniques are discussed in the ARIST reviews by BELKIN & CROFT and KANTOR and search techniques in the ARIST review by BATES (1981). Other sources include FRAKES & BAEZA-YATES, the SIGIR (Special Interest Group for Information Retrieval) proceedings of the Association for Computing Machinery, the TREC (Text Retrieval Conference) proceedings (HARMAN, 1993; 1994; 1995), the journals Information Processing & Management and the Journal of the American Society for Information Science, and the classic monographs by VAN RIJSBERGEN (1979) and SALTON & MCGILL.
More specifically this problem refers to the situation in which a decision should be made as to which terms should be used in query expansion. In the information retrieval situation we are cursed with a high dimensionality term space, i.e., with a large number of attributes, which in this case are the terms. In other words, we are faced with the complete set of index terms in the document collection upon which we have to base the decision of selecting terms and which makes such a decision nearly impossible. A natural way of dimensionality reduction in information retrieval is the acceptance that the query terms themselves are good guides for suggesting the terms to be used by the system for predicting relevance.
The association hypothesis (VAN RIJSBERGEN, 1979, p.134), states that:
``If an index term is good at discriminating relevant from nonrelevant documents, then any closely associated index term is also likely to be good at this.''
If the association hypothesis is accepted, then in a manual system dimensionality reduction during query expansion is controlled, as well as achieved, in a number of ways by the searcher, who uses the initial query terms that are then clustered with other closely associated terms. Decisions about the association of terms are entirely at the discretion of the searcher and are usually determined by the quantity and quality of the knowledge structures held by the searchers themselves as well as the availability and use of searching aids, such as thesauri, concept maps, and word lists.
In the case of automatic query expansion(AQE), in a weighted or associated retrieval system, the problem of dimensionality reduction is being taken care of by computing the weighting function only for the terms specified in the query and maybe their close associates rather than for all possible terms. In interactive query expansion (IQE) both the system and the user jointly share the responsibility for achieving dimensionality reduction.
An important issue in any type of query expansion, however, is
how one defines which terms are closely associated
with the query terms.
An overview of the many different approaches for defining
these relationships is offered in this review.
There are many online searching models that have been developed based on the Boolean retrieval model. Search strategies, search tactics, heuristics and moves are reviewed from the point of view of query expansion.
A number of models have also been developed based mainly on the Boolean retrieval model and the modes of interaction between user and retrieval system. There have been attempts to incorporate these models, in whole or in part, in knowledge-based information retrieval (KBIR) systems. Examples of that work are presented as well as of recent studies that contain some examination of query expansion.
Good search strategy development involves the use of one's knowledge about online searching systems, indexing vocabularies and conventions practiced in text database construction. In other words, it requires a good understanding of the mechanics of the matching paradigm of information retrieval and how this is implemented in the system searched, a full understanding of the information need(s) and an ascertaining of the search objectives, e.g., high recall or high precision. The result of such analysis will eventually determine the subsequent search formulation, which is the statement or set of statements which express the necessary query in a form understandable to the online system.
A significant decision to be made during the search strategy formulation is the correct decomposition of the information need and the identification of the key concepts or facets.1 Then the choice of a particular search strategy would determine the way the concepts should be combined and would also suggest possible alternative actions.
A number of types of search strategies have been identified and named in the literature. Some of the most commonly mentioned and used strategies are the: building block, citation pearl growing, briefsearch, successive fractions, most specific facet first, or lowest postings facet first.
The building blocks strategy (Figure 2), attributed to Charles Bourne by MARKEY & COCHRANE, is that learned and most often used by many searchers. It is a modular approach where:
The advantage of this approach is that it provides a somewhat clearer history of the search logic, i.e., one that is easy to review and understand at a later time. The approach tends to follow and read like the actual query formulation. This feature also provides users with a useful tool for query reformulation as the search progresses further. The idea/process of OR-ing the terms that are conceptually related to the initial (first) term, which was used to identify a facet, is in effect an instantiation of query expansion.
The Citation Pearl Growing strategy operates in a completely different manner. The searcher starts with a very direct search on the most specific term for each of the concept groups in the search request in order to find at least one citation. That is instead of OR-ing all the terms in each facet, as in the building block above, the searcher selects the most specific/representative term of that facet. S/he then combines the single term facet representatives into a Boolean expression of the form
|
Citation pearl growing is characterized by its spiral-like effect which is a very dynamic and empirical manner of strategy development. The revision of retrieved citations online can be very helpful in the identification of search terms for addition or deletion at later stages in the query reformulation process. The payoff from this approach can be incorporated into any of the other approaches. The citation pearl growing approach can be viewed in principle as the ``manual equivalent'' of the relevance feedback search, which is discussed later.
Search strategy formulation is a highly unstructured problem and it requires a broad range of knowledge (e.g., knowledge of the user's specific problem domain, knowledge of the document retrieval system and its constituent parts, and other common-sense knowledge as well). Therefore, although it is being studied systematically it is still not a well understood process. Consequently, the search formulation process is difficult to automate.
The difference lies in the way a knowledge structure is transferred from one form to the other. Users have to develop a model of information seeking which matches that form. Once they have developed a model then they go ahead and (try to) search with it. The model is usually ``dictated'' (developed) by the form the structure takes and whether it is explicit or implicit. For example, the move from a paper-based to an electronic-based environment does not imply that what worked on paper will work in the electronic environment, and vice versa.
Although there may be more access points in an electronic environment access to information for the end users remains a problem because all too often the access points are ``hidden'' from them, i.e., are not available at the user interface. So, unless the searcher is well trained on the particular system and very familiar with the specific database structure, searching will always be done at a superficial level.
Therefore, the present day searcher has to behave like a detective in deciphering a crime, coming up with all sorts of tricks and being able to make associations and to take the search results beyond the obvious.
Over the years, a number of general rules of thumb have been developed to the aid of the (manual and online) information searcher. Some of them are empirically drawn rules (e.g. FIDEL, 1985; HARTER & PETERS) and some theoretically, which focus on the psychological properties involved in searching (e.g. BATES, 1979a;b). These have become known as information search tactics, idea tactics, heuristics and moves.
At this point it is necessary to redraw attention to the
distinction between the search strategy and the
tactics, heuristics, moves. Following Bates' distinction
the former is the plan for the entire search whereas the latter
are more limited, short-term plans. So, that a number of tactics
could be employed to make up the entire search strategy as seen
in Figure 3.
![[The relation of tactics to search strategy]](fig3-tactics-ss.gif)
There is an array of tactics, heuristics and moves available to the searcher to choose from depending on which stage of the search s/he is in at any one time. Tactics can be divided into groups according to the function they perform or according to the stages of the search at any one time. There are no restrictions on their usage, i.e. these can be used as often as thought necessary by the searcher, or as little, or not at all.
BATES (1979b) described 17 idea tactics, i.e. tactics to help generate new ideas or solutions to problems in information seeking. The focus of these tactics is psychological and they are intended to help improve one's thinking and creative processes in searching (Here searching is taken in its broadest sense. The tactics are also applicable to other environments, such as problem-solving). The idea tactics were based and evolved from the more domain specific information search tactics (BATES, 1979a). The 29 information search tactics are grouped into four categories of which the category on term tactics, i.e., tactics to aid in the selection and revision of specific terms within the search formulation, describes methods for performing query expansion. In a more recent paper BATES (1987) demonstrated how these tactics could be implemented online.
FIDEL (1985) suggested moves or changes in the search process to help the searcher improve the search. Her moves are complimentary to Bates' tactics and strategies (FIDEL, 1986, p.72).
Moves deal with search situations where the retrieved set (a) is too large, (b) too small, or (c) off target. These are divided according to the searching style employed by the searchers, i.e. operationalist and conceptualist. Operational moves rely on the trade-offs between free-text and controlled vocabulary searching and use the search system capabilities to modify the retrieved sets without changing the specific meaning of the query concepts. Conceptual moves, on the other hand, modify retrieved sets by changing the meaning of the query concepts and are often supported by subject-related analyses of requests and by the structure of index languages. There are 18 operational moves and 12 conceptual moves. The conceptual moves refer to ways of expanding the search.
HARTER & PETERS also suggested that general search heuristics should be adapted to achieve better search results. According to their definition ``...heuristics are general rules of thought or action, mental operations, tactics, behaviors, or attitudes that tend to produce useful results in certain problem-solving situations...'' [p.407]. They presented six classes of search heuristics relating to: Philosophical attitudes and overall approach; Language of problem description; Record and file structure; Concept formulation and reformulation; Recall and precision; Cost/efficiency. HARTER & PETERS further provided several general guidelines for each of these classes which users could follow in their search.
CHEN & DHAR observed searchers and based on that they proposed a semantic network representation to capture the subject area knowledge and developed a process model for query reformulation and query expansion. There are two approaches in their process model, semantic browsing and retrieval by instantiation. In the former, searchers obtain their search terms based on one of five semantic relations: synonym, broader, narrower, adjacent, or disjointed term. In the latter, searchers may obtain new terms or other cues for searching based on citation information from the title, author, publisher and the index terms fields.
In conclusion, a manually performed query expansion involves the various search tactics (or moves) suggested by the work of BATES (1979a;b; 1987), FIDEL (1985; 1986), and HARTER & PETERS. The selection of the appropriate tactics is critical to the ultimate search result because the tactics employed will directly affect the progress of a search. Some of the search tactics will have been pre-determined in accordance with the search strategy. Others must be chosen and performed by the searchers at their own discretion while online. Finally, others will have to be modified, abandoned or changed as a result of the feedback received through the interaction. The successive examination and interpretation of intermediate results alters the searcher's knowledge structure, i.e. searchers may obtain a new perception or understanding of the search topic once they have started the online interaction.
An example of a relatively sophisticated approach to this modelling appears in the work of Belkin and his colleagues, who start by investigating the functions carried out by intermediaries (BROOKS). They propose a model in which different expert systems, each dealing with one aspect of the intermediary's knowledge structure, would be all integrated as a distributed expert system and which would simulate the behavior of the human intermediary (BELKIN, SEEGER & WERSIG, 1983). A number of Distributed Expert Based Information System (DEBIS) prototypes have been developed based on the MONSTRAT model (BELKIN ET AL., 1987). These have implemented only some of the expert functions proposed, e.g., I3R (CROFT & THOMPSON, 1987), CODER (FOX, 1987) and MARIAN (FOX ET AL., 1993). An extension of the MONSTRAT model is the MEDIATOR model (INGWERSEN, 1992).
The dynamic nature of the online interaction makes it exceedingly difficult to formalize it in an algorithmic process. Hence the successful implementation of search tactics depends largely on the experience and the correct judgment of the searcher. Therefore, there have been a number of attempts to incorporate the search strategy techniques discussed as well as the tactics, heuristics, moves in knowledge based `online searching aids', as for example in IR-NLI (BRAJNIK ET AL., 1986; 1988), PLEXUS and TOMESEARCHER (VICKERY ET AL., 1987; VICKERY, 1988), EP-X (SMITH ET AL., 1989) and MICROARRAS (GAUCH & SMITH, 1989; 1993).
PALMQUIST & BALAKRISHNAN used word association tests to elicit query expansion terms and enhance users queries. The users' original vocabulary and the word associations added to enhance the original vocabulary were compared with the titles and abstracts retrieved. The expanded vocabulary (word associations) was found to be significantly present in the abstracts and titles. The extra terms were found not to be discriminatory between relevant and nonrelevant documents.
SARACEVIC & KANTOR (p.201, 1988b) among their many findings reported that in general none of the effects of tactics and efficiency measures on relevance odds were large. Tactics and efficiency were measured as the number of commands, command cycles, search terms, preparation time, and connect time used. More search cycles, which allow for feedback, tend to produce better searches, while more search terms, more preparation time and more overall search time resulted in worse searches.
In a study of the searching behavior of 47 professional intermediary online searchers FIDEL (1991a,b,c) developed a model of the selection routine, i.e., a formal decision tree that represents the intuitive rules searchers use when they select search terms. The routine delineates the terminological conditions which lead to the selection of each option. As such it can be incorporated into the knowledge base of intermediary expert systems as discussed in FIDEL & EFTHIMIADIS.
In a study of how humanities scholars operate as end users of online databases, the terms used in the searches were analyzed and vocabulary categories identified (BATES, WILDE & SIEGFRIED; SIEGFRIED, BATES & WILDE). A typical search tended to be simple, using one-word search terms and little or no Boolean logic. Design recommendations for thesaurus development and the design of retrieval systems for the humanities are made (BATES, 1994).
In a study of undergraduate end users searching CD-ROM databases EFTHIMIADIS (1994a) reports that most users relied mainly on their class notes as the main/only source of terms for the query formulation and query expansion of their searches. Overall, there was very little use of controlled vocabulary or of browsing of the database dictionary. Similar results are reported in a study of mediated searches by SPINK where the most effective sources of terms were the users written statements, the user supplied terms during the interview and terms selected from specific records fields. ALLEN studies the ways in which users learn vocabulary during a search and use that vocabulary to alter and expand their searches.
KRISTENSEN & JÄRVELIN demonstrated the positive effect on recall of a manually-constructed searching thesaurus in a limited domain (economics & environment). KRISTENSEN used an expanded version of the searching thesaurus and by adding loosely-defined synonyms, related terms, and narrower terms achieved large improvements in recall at the expense of a small drop in precision.
The typical automatic relevance feedback operation involves an initial search with a user-supplied query and an initial retrieval of certain documents. Then, from a display (usually of titles or abstracts of the retrieved documents) the searcher identifies/chooses some relevant documents. Those documents are used to modify the query by reweighting the existing query terms and/or by adding terms that appear useful and by deleting terms that do not. This process creates a new query which resembles the relevant documents more than the original query does.
If we consider the information involved in the user-system interaction in simple relevance feedback, it is like this:
In its simplest form relevance feedback could be based on one document only. For example, after displaying a single document, the system could invite the user to see more documents like the one on display. Here the system, e.g., in an OPAC, could use the classification scheme and present the user with books of the same class-mark as the first viewed.
Another simple form is when judgment is based on a set as a whole (PIETILAINEN). Here, a set derived from a previous search becomes the seed for the new query formulation. The method uses `searchonyms' (ATTAR & FRAENKEL, 1977; 1981), i.e., terms which might be regarded as synonyms for the purposes of a particular enquiry and derived from terms contained in the seed set.
Automatic relevance feedback, generally, can be implemented in various ways depending on the retrieval technique used, e.g., vector space, probabilistic, etc., and also on the methods used to select terms for the feedback query. We can distinguish four term selection methods for query reformulation and query expansion.
In all cases, after the initial query formulation, the only form of feedback to the user is documents, and from the user is choices of documents. The query reformulation and query expansion is entrusted entirely to the retrieval system. Finally, query expansion can be performed with or without term reweighting. Query expansion without term reweighting, may involve the addition of terms from a knowledge structure, such as thesauri, term classifications, etc. Most research on relevance feedback and query expansion has been done using both query expansion and term reweighting.
The retrieval system with the help of the user (through the online relevance judgments) is trying to model the probability distributions of the relevant and the nonrelevant documents, so that the probability of relevance of the documents can be estimated.
At least one relevant document must be identified by the user in order to be able to continue with the search. The sample size of just one document, however, is the absolute bare minimum required for a relevance feedback or query expansion search. A variety of sample sizes has been used in IR experiments.
In many relevance feedback experiments the sample is defined at a cutoff level of the 10 or 20 top-ranked documents. This is a commonly used method for getting a sample. These documents are then examined for relevance and those found relevant become the sample for the feedback iteration and the query expansion search.
The sample size that is of interest here is that of the set of documents judged relevant. The former, i.e., the cutoff level, is of concern only in terms of how many documents should be retrieved by the initial search or of how many documents should the user have to see while online. For example, HARPER & VAN RIJSBERGEN, HARPER, VAN RIJSBERGEN ET AL., SMEATON & VAN RIJSBERGEN (1981), SPARCK JONES (1979a;b; 1980) and more recently HARMAN (1988; 1992) have used in relevance feedback and query expansion experiments the 10 or 20 top-ranking documents.
A conclusion of these experiments was that a small sample of relevant documents could be adequate as the basis of the reweighting of terms. SPARCK JONES (1979b, p.143) used a sample of 3-4 documents and in another experiment 1-3 documents (SPARCK JONES, 1980, pp.328-329) and HARPER (p.6-3) suggested that at least one document is needed. This suggestion does not exclude the use of a larger sample. On the contrary, it is believed that the larger the sample of relevant documents the better the estimation should be. However, the problem of selecting an optimal sample size is still very much an open IR research issue. Some suggestions however exist and for some reason all tend to converge to recommending a sample size of 5 relevant documents, for example, HARPER (p.6-7). It is also worth mentioning here that MARTIN (p.73) and WHITE (p.34) in discussing ZOOM both propose without providing further justification, that ZOOM will be more effective if its frequency analysis is based on 5 relevant documents. The OKAPI online catalog requires at least 3 relevant documents in order to perform query expansion (WALKER & DEVERE) and EFTHIMIADIS (1992b) used 5 relevant documents.
A question then becomes how many documents should the user see before the desired number of X (e.g., 5) relevant documents is reached. It is believed that there should be some flexibility on the number of documents seen rather than strictly follow the 10 or 20 cutoff level of earlier IR experiments, especially when using an interactive system.
In the absence of user assigned relevance judgments an alternative method has been used. In this the IR researcher declares that the top X documents will all be treated as relevant. These documents then become the sample upon which estimates are based for term (re)weighting and/or terms get extracted for the expansion. This technique is well known and has been used by Salton and Sparck Jones in early experiments and more recently in TREC, e.g., BUCKLEY ET AL. (1994; 1995); EFTHIMIADIS & BIRON; EVANS & LEFFERTS.
Based on these experimental findings it is advisable that users should judge relevance of documents by looking at the most complete representation available, e.g., in a bibliographic database on both titles and abstracts.
Relevance is a continuous variable and it has been established in major studies of relevance judgments, that it is an over-simplification to collapse a variety of degrees of relevance into yes/no decisions (CUADRA & KATTER; REES & SCHULTZ). For a recent discussion of relevance see SCHAMBER. However, the adoption of the binary definition of relevance is a required simplification of this complex notion which facilitates the calculation of relevance weights.
During the online relevance assessments of documents by users we are faced with two issues. One is that most present systems require relevance judgments on a binary scale. The other is that the user must respond on the question of the relevance of a document by judging each document for its relevance, i.e., the extent to which the subject matter of the document is about the query, rather that its usefulness. This treatment of relevance is necessary for relevance feedback systems to be able to provide an estimate of the probability of relevance that will be effective. In other words a document should be judged as relevant to a query only on the merits of the information it conveys and not by comparing it with what we already know or with what we have learnt from the previous documents we have seen during the search, or with how useful we perceive it to be.
The distinction of topical relevance, as defined here, and usefulness is that a document which is relevant but not useful to a user, because for example, the user knows about it, or the document is outdated, or it is written in a foreign language, is very important for relevance feedback systems. The importance of this issue lies in the fact that the online relevance judgments obtained are used by the weighting scheme for the estimation of the relevance weights. Although, one can argue that feedback systems would try to predict usefulness, including factors such as date and language, for most it seems appropriate to ask the users for topical relevance judgments as opposed to usefulness.
In helping users understand the difference and make the relevance judgments in the above context, one, for example, can make use of the question: ``regardless of whether you have seen this document before, is this the sort of thing you are looking for? [y, n]'' (EFTHIMIADIS, 1992b), which is a modified question of the one used in the OKAPI online catalog (WALKER & DEVERE, p.26) and has been attributed to Robertson. A ``YES'' or ``NO'' answer can be given as a response to the question, which corresponds to the binary relevance. If the document is partially relevant or if the searchers for some reason are in doubt they should answer ``YES''. An additional reason behind this suggestion is that it seems better that the system retrieves marginal documents rather than excludes them. The user then, while online or offline, has the opportunity to decide again about the relevance and usefulness of a document.
What is known from IR research with respect to the relationship that holds between term frequency and term value and the effect on retrieval (SPARCK JONES, 1971; SALTON, 1975; VAN RIJSBERGEN, 1979): is that very frequent terms are not very useful; middle frequency terms are quite useful; infrequent terms are likely to be useful but not as much as the middle frequency terms; very infrequent terms are useful terms in the sense that when present they are good indicators of relevance. From this knowledge it can, therefore, be hypothesized that a good term ranking algorithm would bring the middle frequency terms near the top of the list.
This section discusses term weighting and ranking as it relates to term selection for query expansion. There is a plethora of ranking algorithms reported in the literature (SAGER & LOCKMANN; MCGILL ET AL; RO). These algorithms attempt to quantify the value or usefulness of a query term in retrieval. Formulae may estimate the term value based on some qualitative or quantitative criteria. The qualitative arguments are concerned with the ``value'' of the particular term in retrieval whereas the quantitative argument may involve some specific criterion such as a proof of performance. The relevance weighting theory is an example of the latter.
ROBERTSON (1986) suggested a modification to the F4point-5 formula (ROBERTSON & SPARCK JONES, 1976), referred to as F4MODIFIED, which takes into consideration the addition of new terms to the original query. The modified formula is:
| (1) |
It was further suggested that the F4MODIFIED could be used in two ways (ROBERTSON, 1986, p.86). In automatic query expansion every term from the relevant document would be weighted using c = n/N and added to the search. In interactive query expansion the terms would be weighted in the same fashion and those selected by the user would then be weighted with the F4point-5 formula.
Apart from the theoretical argument for F4MODIFIED there were not any empirical data available. EFTHIMIADIS (1992b) tested the ranking behavior of F4point-5 and F4MODIFIED and found that their ranking is almost identical bringing at the top of the lists terms with low collection frequency (n) as well as low frequency in the relevant document set (r).
This type of ranking, is explained by the nature of the relevance weighting theory which assigns a higher degree of importance to the low frequency terms and therefore brings them at the top of the ranked list.
The independence assumption of the relevance weighting theory is that terms are distributed independently of each other in relevant documents and also that terms are distributed independently of each other in nonrelevant documents. In the query expansion stage of search an additional assumption should be made which considers statistical independence between the query expansion term and the terms in the entire previous search formulation.
The distribution of the relevant items for the initial formulation, for example, is further divided into two distributions: one which describes the relevant items that contain the term and one that describes the relevant item that do not contain the term. These two new distributions are identical according to the independence assumption. In other words, the presence or absence of the query expansion term does not affect the initial distribution. When the entire collection is considered, these assumptions predict a positive association between an initial query formulation and a good new term for query expansion.
The inclusion of term t in the search formulation with weight wt will increase the effectiveness of retrieval by
|
This means that irrespective of the weighting function (wt) used the rule for deciding the inclusion of a term in a query expansion search should be based on the ranking of WPQt instead of wt. Substituting the weighting function and the probability of relevance in WPQt with r, R, n, N we get:
| (2) |
The relevance weighting theory attempts, via the Probability Ranking Principle (ROBERTSON, 1977), to optimize the entire length of the search curve from high-precision to high-recall. This is expressed by the wt component of the above formula which assigns greater importance to the infrequent terms. However, a model that determines which term(s) to add for query expansion will have to lead to a preference somewhere between the very infrequent terms that lead to high precision and the frequent terms that lead to high recall.
In the above formula this is achieved by pt - qt. This component, like the PORTER formula (equation 3), is influenced by the frequency of occurrence of a term in the relevant document set as well as the term's frequency in the collection. Therefore, the multiplication of the two components results in the effect which seems to be required by the model for term selection in query expansion.
| (3) |
Because this formula is used to rank terms for query expansion the r/R portion never becomes 0, (because in order to use the formula there must always be at least one document containing the term that has been judged relevant) and it has a maximum value of 1 (this happens whenever r = R). In other words, this portion of the weight can take values that fall within the range: 0 < r/R £ 1 The n/N fraction however is influenced by a term's frequency (n). So that, the higher the term frequency (n) the higher the result of the fraction.
The PORTER formula seems to place more emphasis on terms that occur frequently in the relevant document set.
| (4) |
|
EMIM reduces to the F4point-5 weight when the ``degree of involvement'', i.e. the joint probabilities, are all unity. The EMIM weight is claimed to make use of the statistical dependence between index terms over the entire collection.
ZOOM provides the automatic frequency analysis of phrases, single words, codes, or a combination of these contained in a selected set of references. Once a set of records is generated in a file the searcher may ZOOM the set. The ZOOM command can now analyze up to 20,000 records. However, the basic default analysis sample is the controlled (CT) and uncontrolled terms (UT) fields of the 50 records across the last selected set. The fields of the record that can be used to ZOOM vary from database to database. The searcher may request any of these fields, either on their own or in any combination.
ZOOM processes the records in the set and the phrases and/or single words of the analysis are displayed in columns. All terms are ranked in descending order of their frequency of occurrence in the sample set. Within ties, i.e. whenever there are more than one term with the same frequency of occurrence, terms are ranked in alphabetical order. It is important to note that every occurrence of a term is counted. In addition, the analysis takes into account all terms including stopwords. However, all punctuation is removed in the display of the ZOOM text analysis.
The R-LOHI ranking algorithm ranks terms according to r, i.e., their frequency of occurrence in the relevant document set in descending order, and resolves ties according to their term frequency, n, from low-to-high frequency.
It was hypothesized that the R-LOHI algorithm would have a similar ranking to PORTER and a performance approaching that of WPQ and EMIM. The evaluation of the performance of the R-LOHI algorithm against that of the other algorithms are reported in EFTHIMIADIS (1993; 1995) and EFTHIMIADIS & BIRON.
Given therefore that there is, firstly, a large number of experimental systems that incorporate query expansion and modification techniques and, secondly, that it is often difficult to separate the expansion process from the overall retrieval process, I intend in this section to look only at a selection of the ideas in the area of automatic query expansion. Another related area for query expansion is the work done on natural language processing and syntactic analysis. This, although interesting, is a very large area and will not be discussed further in this review. Also excluded are from the review are relevance feedback and any retrieval techniques, including genetic algorithms, neural networks, etc., if these do not expand the query.
A procedure for the automatic adjustment of queries based on terms in the descriptor field (DE) of a database, i.e., by utilizing the manual indexing, has been described by VERNIMB. The procedure starts with the user having identified at least two relevant documents although as Vernimb suggests, searching effectiveness is improved the higher the number of known relevant documents. The system then examines the descriptors of these documents, which are 5-6 per document in this particular database. Then, it arranges them according to their frequency of occurrence in these documents, with the most frequently occurring descriptors first, and so on. Descriptors with the same frequency of occurrence, i.e., ties, are ordered in increasing order of their collection frequency (n). The one with the smallest collection frequency is placed first and that with the highest last (VERNIMB, p.340).
Partial queries are then generated for each one of the relevant documents. These queries consist of the descriptors (DE) that index the document and which are combined with the AND logical operator. For example, if there are seven DE in the first document then the query is: DE1 + DE2 + ¼ + DE7 . These partial queries are subsequently `loosened' by stepwise omission of descriptors eliminating the lowest ranked descriptor first until at least 10 new documents are found. The partial queries are then combined in an OR statement to form the `total query.' The next step involves a procedure for establishing the most useful partial queries based on relevance feedback of documents retrieved. A new query is formed and new documents are retrieved which after relevance judgments are made may be used for re-initiating the whole process. The procedure just described is wholly automatic and hidden from the user who only sees the retrieved documents.
DILLON & DESPER and DILLON, ULMSCHNEIDER & DESPER have
described an algorithm for automatically incorporating search
terms into a query using a form of relevance weighting known
as prevalence weighting. Positive and negative prevalence is
computed based on the occurrence of terms in relevant and
nonrelevant documents that are retrieved from the initial search
query. A number of threshold values for the prevalence weights
exist so that groups of terms are assigned to a particular
category depending on their prevalence weights. A new Boolean
query is constructed by OR-ing together groups of terms
according to their position in the prevalence category. Terms
in the highest prevalence category are added (ORed) as single
terms. Terms from the second highest category are ANDed as
pairs of terms and so on. Finally bad (negative weight) terms
are employed and these are NOTed. Any document containing one
of these terms is not retrieved. Thus a query would be phrased
as (DILLON & DESPER, p.202):
a series of positive expressions (Pi) which are ORed:
|
|
|
CITE (Current Information Transfer in English), was created as an interface system to Medline (DOSZKOCS & RAPP). It allows for a natural language input of the query which is then translated by the system. CITE makes use of weighting and ranking, using inverse document frequency and it has a relevance feedback facility. Its query expansion capability takes the form of modifying the search query using the MeSH headings of the retrieved document set. The MeSH headings of all the selected documents are incorporated into the search query and combinatorial search is performed to retrieve similar documents. Doszkocs and Rapp also mentioned other methods of query expansion that may be implemented in the system, such as using both the additional free text terms from the title, abstract and MeSH headings; or using those free text terms and MeSH headings that appear in relevant documents exclusively, i.e., do not appear in nonrelevant documents. These and other possibilities of automatic query expansion as well as a semi-automatic query expansion facility were implemented in subsequent versions of CITE (DOSZKOCS, 1983) and are discussed in the next section.
The OKAPI experimental online catalog has been influenced in some ways by the interactional aspects of CITE (WALKER & DEVERE). At the query input stage, OKAPI uses a dictionary table of substitution terms which allows common synonyms to be searched automatically, i.e even if the user has not entered them. For example if `Britain' is entered then OKAPI automatically includes Great Britain, GB, UK, United Kingdom, etc. (WALKER & JONES). OKAPI expands a query by selecting the `best' terms from a list containing the original query terms together with terms extracted from all the records which the user has judged relevant. Terms are weighted using a scheme based on the F4 formula and which gives a higher weight to terms that occur in more of the relevant document and a lower weight to those that do not. The list of terms is then sorted by descending term weight. Term selection for query expansion starts at the top of the list providing the following conditions are satisfied: the user has not already seen all the records indexed by this term; the term is not in the list of `dubious' words, i.e., a list of common words which cannot be stopped but which are not very useful in retrieval either, such as `analysis', `introductory', etc.; the term weight is positive; and fewer than 16 terms have been selected.
The main reason behind the choice of implementing only an automatic query expansion module was a concern that catalog users would not want a high degree of involvement and would probably not understand what the system was doing (WALKER, 1989; 1990). Recent versions of OKAPI (at City University and at UCLA) include (different) client graphical user interfaces, X-OKAPI and GIRAFFE respectively, with interactive query expansion modules. HANCOCK-BEAULIEU & WALKER report on the evaluation of the automatic query expansion facility in the OKAPI online catalog in an operational library setting. Data was collected using a combination of transaction logs, search replays, questionnaires and interviews. The results show that automatic query expansion was beneficial in a substantial number of searches. User intentions, the effectiveness of the ``best match'' search and user interaction were identified as the main factors that affected the usage of the query expansion facility which is an option available to users.
HARMAN (1992) using the Cranfield 1400 test collection demonstrated the importance of query expansion in addition to term reweighting. The addition of 20 well-selected terms and the execution of multiple feedback iterations improved retrieval performance. KWOK extended the neural network used in his probabilistic component approach to include query modification and query expansion. Tests with the CACM and CISI collections demonstrated that automatic query expansion using NN and learning from relevant documents was most effective with the addition of approximately 30 terms. Learning from irrelevant documents did not lead to better retrieval performance. BUCKLEY ET AL (1993; 1994; 1995) experimented with adding the 300-500 top ranked words and 50 phrases to the query. Massive query expansion was particularly effective in runs where relevance information is available (routing runs). The performance of five ranking algorithms for query expansion (WPQ, EMIM, PORTER, R-LOHI, R-HILO) was compared in the context of the TREC-2 environment and R-LOHI improved retrieval performance on both routing and ad hoc searches (EFTHIMIADIS & BIRON). Best performance was achieved by adding 10 terms from 10 documents. For the automatic query expansion runs of OKAPI 20-40 terms were added from the top 20-30 documents (ROBERTSON ET AL., 1994; 1995). THOMPSON ET AL. used a query expansion approach that combined the three methods reported in HAINES & CROFT.
The most extensive work in the area of term clustering was done by SPARCK JONES (1971; 1973b;c; SPARCK JONES & BARBER; SPARCK JONES & BATES; SPARCK JONES & JACKSON). She formed groups of related word stems based on co-occurrence in documents. Her experiments explored a number of different clustering strategies and she found the effect of all these strategies relatively similar. For as long as the strategies were restricted to forming relatively small clusters their performance was best.
SPARCK JONES (1971, p.208) reached the overall conclusion that the use of an automatically obtained term classification does enable better retrieval performance compared to the performance obtained with the unclassified terms alone. She suggested that this occurred when frequent terms were being left un-clustered, infrequent terms were grouped into clusters and strongly connected terms were clustered by applying a matrix threshold SPARCK JONES (1971, p.195). This result was challenged by MINKER, WILSON & ZIMMERMAN who evaluated the effect of adding cluster-related terms to queries. They concluded that the expansion of queries by cluster-related terms was at best marginally useful and generally produced worse results than un-expanded queries. However, SALTON (1973) and SPARCK JONES (1973a) questioned Minker's experimental design. Results, similar to Minker's, obtained by other researchers (LESK, 1969) have for the most part also been marginal. LESK (1969) concluded that manually constructed thesauri were superior to automatically constructed ones. Recent work by CROUCH provide some encouraging preliminary results for the latter.
Sparck Jones' subsequent research provided rather pessimistic conclusions because retrieval performance was significantly improved by term clustering only on one small collection (SPARCK JONES, 1973b;c; SPARCK JONES & BATES).
On the whole, the explanations given for the poor results from term clustering focused on the properties of the collections that were used to derive the clustering. In LESK's (1969) case it was the small size of the collection used, while Sparck Jones in a detailed analysis of her experiments concluded that the fault lay in insufficient differences in the vocabulary between relevant and nonrelevant documents (SPARCK JONES, 1973b;c). She also stressed the need to use strongly connected non-frequent terms (SPARCK JONES & BARBER).
An additional explanation given by LESK (1969) is that the clusters captured only terms whose meaning was purely local (to the small collection) and did not reflect their general meaning in technical text. SPARCK JONES (1971) also argued that clustering of words limits ambiguity instead of emphasizing it. According to her a match on two words from the same class is very suggestive of which word sense is present. However, since matching on two words in the same cluster is much less common than matching on a single word, it seems unlikely that this is an effective disambiguation strategy. Since phrases (bound descriptors) are much less ambiguous than single words, they might be more effective. Based on the complementarity of the term clustering and syntactic phrase formation methods LEWIS & CROFT combined the two approaches. Small improvements in retrieval effectiveness were shown using the CACM collection.
Experiments on three different test collections demonstrated that relevance feedback searches using expanded queries were superior to simple co-ordination level match searches for which these was no expansion and no relevance feedback information available (VAN RIJSBERGEN, HARPER & PORTER). But, it was not clear whether this improved performance was due to the feedback or to the expansion. SMEATON (1982) and SMEATON & VAN RIJSBERGEN (1983) have investigated the effects of automatic query expansion using three different sources of new terms: (a) Maximum Spanning Tree (MST). A MST is a term-term dependence structure similar to a thesaurus but derived from statistical associations between terms. It takes the form of a tree structure where every term in the collection points to or is connected to at least one other term. (b) Nearest Neighbors (NN). A nearest neighbor to a given term is that term which is the most strongly related statistically to the given term. (c) Using index terms from the relevant documents found so far.
The three strategies were tested on the Vaswani test collection. The F4 and EMIM weighting formulae were used and query terms were weighted according to one of the formulae. Relevance judgments were simulated and automatic query expansion was performed using MST and non-MST methods. A test was also performed where no query expansion was performed and where randomly selected terms were incorporated in the strategy. It was found that the above three query expansion strategies all had a detrimental effect on overall retrieval effectiveness (measured by averaging the precision and recall figures computed for each query). The effect in order of increasing degradation (i.e., from best to worst performances) of retrieval was: (a) no modification, (b) randomly selected terms added to original query, (c) terms derived from known relevant documents, (d) terms derived from the MST, (e) terms as nearest neighbors.
It was also found that the more extra terms that were added the greater the degradation in retrieval effectiveness. SMEATON & VAN RIJSBERGEN (1983) have put forward several reasons for this detrimental effect on overall retrieval effectiveness. One reason given is that the probabilities were estimated from little relevance information and so could not be estimated accurately and that the situation was compounded by the addition of extra search terms which resulted in a decline in retrieval performance. They also stated that in theory the modifications that should have yielded the best retrieval (i.e., MST and NN), in fact yielded the greatest degradation. This is because the effects that poor probability estimations have on retrieval are `clustered' into an area around the query. MST and NN terms will be highly related to the original query terms and will co-occur with the original search terms more often and therefore have a significant influence on the top ranked documents. Another reason given is that the query modification strategies may have been implemented too early in the overall retrieval strategy.
A weighting function which utilizes the principle of fuzzy sets has been also suggested by SMEATON (1984). This is achieved by assigning a secondary weight as well as a relevance weight to each of the search terms which indicates the terms relative degree of importance to the user's search. The `degree of importance weights' are combined with the relevance weights. This effectively makes a search term set a partial membership or fuzzy set. New sets of search terms are generated from relevant documents using a matching function whereby all terms from relevant documents are added automatically. In order not to swamp the original query each term is weighted according to how important it is to the overall retrieval. This is achieved by measuring the similarity between the original query and each relevant document and grading the contribution of the new terms obtained from a particular relevant document in accordance with how `similar' that document is to the query. The retrieval strategy therefore uses the user provided relevance feedback to compute a fuzzy set of search terms composed of the initial query terms and all the terms from all known relevant documents. SMEATON (1984) concludes that this strategy has not yielded any significant improvements in retrieval effectiveness over a retrieval strategy that uses conventional weighting of the initial query terms.
Therefore, to date the overall conclusion of the above mentioned research has been that query expansion based on term co-occurrence data is unlikely to bring significant improvements in the performance of document retrieval systems. There is recent evidence, however, which has identified a substantial limitation on the use of term co-occurrence data for automatic query expansion (PEAT & WILLETT). It has been pointed out that the above limitation arises from the characteristics of the coefficients that are used to measure the similarity between a pair of terms. Peat and Willett have concluded that:
Their findings provide a rationale for the lack of success attained in the studies which used term co-occurrence data for query expansion. These also explain the findings by SPARCK JONES (1971) that the best results were obtained if only the infrequent terms were clustered and if the more frequent terms were left un-clustered, and by SMEATON & VAN RIJSBERGEN (1983) that query expansion by the addition of randomly selected terms gave better results to any of their methods based on co-occurrence data. Apparently, this last point by itself (i.e., that randomly selected terms gave better results) could have been taken as an indication that something might have been wrong.
ELKALIFA suggests that the failure of these studies is mainly due to the heterogeneity of the collections used rather than to the inefficiency of the techniques themselves. ELKALIFA used a combination of cluster and term association techniques for query expansion. Cluster analysis techniques were used to subdivide the document collection into small more homogeneous collections. Term association techniques were then applied to determine which terms could be used to expand the original request. This technique was applied on a very small test collection of 150 documents and reported positive results. Three search strategies were formulated for each request consisting of: (a) the initial query terms; (b) terms extracted from the entire collection; (c) a combination of terms extracted from specific clusters and the initial query terms.
A thesaurus is generated manually by subject experts and is used for indexing and retrieval, e.g., INSPEC, MeSH. A searching thesaurus (PITERNICK) is not used during the indexing of the database but only during searching. In contrast to conventional thesauri, which lead the searcher to the preferred search term, the searching thesaurus does not seek to standardize the term choices by the searcher, instead it provides alternatives to the terms that the searcher has in mind, such as synonyms, quasi synonyms, antonyms, and related terms.
In an automatically constructed thesaurus (association thesaurus) the associations between the terms can be obtained automatically by using the occurrence characteristics of the terms across the documents in the collection.
Pseudoclassification, is a user-dependent term classification which takes into account user relevance judgments of certain documents with respect to certain queries in order to build term classes designed to retrieve the relevant documents and to reject the nonrelevant documents (JACKSON; SALTON, 1980).
In determining the term-by-term relationships for the pseudoclassification a similarity measure, a set of queries, the user's relevance judgments, and a cutoff threshold value are needed. Then, whenever a nonrelevant document is retrieved to a query because the similarity value is high for the document to be retrieved, the term-document correlation values of that term-document pair are lowered so that the likelihood of this error happening again is reduced. This is achieved by arranging the terms in the query and the document so that enough term class mismatches would occur to exclude the retrieval of the document. When a relevant document is not retrieved because the similarity value between the query and the document is not high enough for it to be retrieved, then the term-document correlation is increased so that this document will be retrieved next time (JACKSON). This is achieved by forming suitable term classes so that the similarity between the document and the query is strengthened by matching these term classes.
YU (1974; 1975) developed a heuristic algorithm for constructing term classes that was more efficient than Jackson's exhaustive search algorithm. YU & RAGHAVAN and RAGHAVAN & YU proposed a method of thesaurus construction (pseudothesaurus) based on relevance judgments that instead of constructing large term classes identified pairs of related terms. This method was shown to be faster than constructing term classes but as effective in improving retrieval performance. Their ad hoc method however involves many parameters that need to be determined empirically. To alleviate these problems RAGHAVAN & JUNG adopted a machine learning approach for the construction of the pseudothesaurus.
WONG, CAI & YAO proposed an adaptive bilinear model, three layer feed-forward neural network, for the computation of the term associations. WONG & YAO (1990; 1993) proposed a probabilistic approach for utilizing the relevance information to estimate the term relationships.
PHRASEFINDER is an association thesaurus used with the INQUERY system (JING & CROFT). Query terms are searched in the PHRASEFINDER database and the retrieved phrases are then used to expand the initial query. Results of its usage demonstrated consistent retrieval performance improvements through automatic query expansion (JING & CROFT; BROGLIO ET AL.).
Experiments in query reduction had detrimental effects on retrieval performance whereas query expansion using an association thesaurus improved retrieval performance (LU & KEEFER).
MATCHPLUS uses a vector representation for words such that the query expansion is a direct, natural result of the representation (GALLANT ET AL.). These vectors are learned from free-text examples of the corpus. As a consequence, the vector set for the stems represent a domain specific lexicon whose contexts have driven the inter-vector relationships. MATCHPLUS' approach is similar to the Latent Semantic Indexing (LSI) approach which can be viewed as a kind of query expansion (DEERWESTER ET AT.). In the reduced-dimension LSI space words are correlated, i.e., words that are used in similar documents will be near each other in the space.
First-order thesauri are discovered in CLARIT through the indexing of a representative sample of documents of a subject domain (EVANS & LEFFERTS). For example, in the TREC environment, for each routing topic the document collection used to establish a thesaurus consists of the training set of relevant documents for that topic. The process clusters phrases through a combined syntactic and statistical analysis of the text. In STRZALKOWSKI and STRZALKOWSKI & CARBALLO term similarities are determined by the information contribution function which uses term co-occurrence in a syntactic context, with poor results so far.
WALKER & JONES used stemming, automatic spelling correction and cross-reference tables to improve subject retrieval in the OKAPI online catalog. Stemming was based on the weak and strong variants of the Porter stemmer (PORTER, 1980). The weak stemming was recommended as it significantly increased recall in their experiments. KROVETZ describes a new morphological processing (stemming) approach as an alternative to PORTER's (1980) stemmer.
HARMAN (1991) used the Porter, Lovins and S removal stemmers on the Cranfield 1400, Medlars and CACM in the IRX system and found that none of them significant improved retrieval performance. Other sources for query expansion terms that have been tried without success include string similarity measures and Soundex codes (HENDRY, WILLETT & WOOD). EKMEKCIOGLU, ROBERTSON & WILLETT report on an evaluation of three methods for the expansion of natural language queries in ranked-output retrieval systems. The methods are based on term co-occurrence data, on Soundex codes, and on a string similarity measure. Searches for 110 queries in a database of 26,280 titles and abstracts suggest that there is no significant difference in retrieval effectiveness between any of these methods and unexpanded searches.
POPOVIC & WILLETT devised a stemming algorithm for Slovene text that worked very well compared with the Porter algorithm used on English translations of the documents and queries. Successful stemming and string similarity work on other languages and on historical texts in English has also been reported in ROGERS & WILLETT, ROBERTSON & WILLETT (1992). Expanded queries, by means of historical variants as identified by the digram-matching method, retrieved more relevant documents than the unexpanded modern language queries (FRITH, ROBERTSON & WILLETT).
JONES applied the relational database model to thesaurus representation for the use in a retrieval system. She provides an outline with suggestions for using the stored information when matching one or more thesaurus terms with a user's query. The results by JONES ET AL. (1995) on incorporating intelligent algorithms were not encouraging. Query expansion using the thesaurus in a ranked output system reordered the output but it did not improve it.
PAICE introduced a thesaural model of information retrieval based on spreading activation, but there is no empirical evidence to support the model.
MEGHINI ET AL. describe a multimedia information retrieval model (MIRTL) based on a terminological logic. The model, inter alia, specifies denotational semantics for query answering. Queries are answered by relying on a lattice of definitions and connotations which is the logical equivalent of a thesaurus. This may be seen as the formal equivalent of expanding a query by taking into account dictionary definitions.
JÄRVELIN ET AL. present a deductive data model for thesauri. EXPANSIONTOOL is an aid for AQE based on the deductive thesaurus data model. By naming relevant concepts and by adjusting a set of parameters, the searcher can generate queries which vary in recall and precision, target database, query language and/or matching principles. The system uses a more elaborate building-blocks approach for concept expansion. Once the system has expanded all the terms it then uses a query language grammar to formulate the appropriate search statement to the target query language selected. EXPANSIONTOOL currently supports the INQUERY, ISO, TOPIC and TRIP query languages.
CROFT & DAS describe an experiment in which the query was enhanced with information obtained from users during the presearch interview. Different retrieval strategies were tried out and the top 10 documents from each strategy were merged into a single set. The results demonstrated significant improvements in retrieval performance.
The automatic query expansion in the OKAPI online catalog in addition to using terms from the relevant retrieved records it also uses the classification codes that are assigned to the record (HANCOCK-BEAULIEU & WALKER).
In SPECIALIST the natural language query is parsed in order to determine simple noun phrases (ARONSON, RINDFLESCH & BROWNE). Then each phrase is mapped to concepts in the UMLS Metathesaurus by generating variants of the words in the noun phrase. A degree of similarity is computed between the noun phrase and the generated variant terms from the UMLS Metathesaurus. The similarity, from strongest to weakest mapping, is defined as exact, simple, complex and partial match. In an experiment using the UMLS test collection and comparing retrieval effectiveness of three types of queries (unexpanded query, expanded query with UMLS Metathesaurus terms, and UMLS Metathesaurus terms only) they report that the expanded query increased average precision and performed better than the unexpanded query. The query using the Metathesaurus terms only had the worst performance. Performance improvement was minimal, a (4%) increase of average precision was reported. The authors point to future methodological modifications that promise better retrieval performance by exploiting the UMLS semantic network. COACH, a UMLS browser for expanding GRATEFUL-MED MEDLINE searches, when invoked it locates a term in the Metathesaurus and identifies all synonyms and related terms of the term and continues the process recursively until twenty MeSH terms are found (JACHNA, POWSNER & MILLER). COACH uses a modified idf weighting to rank terms.
VOORHEES & HOU and VOORHEES (1994a;b) used WordNet (MILLER), a general purpose thesaurus, as a source of related concepts for query expansion. Terms were either automatically selected or were hand picked. Some queries were improved, others were degraded. BOYD, DRISCOLL & SYU used the public domain 1911 version of ROGET's thesaurus to expand queries. Two approaches were implemented, one using a connectionist model and the other the vector space model, but had poor results. In TREC-3, DRISCOLL, THEIS & BILLINGS use a knowledge-based approach based on the ER database model. A TREC topic is expanded into several lists of information. Each list is either all the synonyms (and word forms) for a word appearing in the topic, or all the currently known values for an item mentioned in the topic. For example, for each required item of information (such the name of a company, or the type of a cancer), a list of possible values is made for that item of information. The set of lists for a topic are used to search for relevant information. The search involves scanning documents and count hits in the lists. Various combinations of the lists and various weighting of the lists are used in determining the relevancy value for a document.
In a weighted search with interactive query expansion there is therefore a joint responsibility between system and user in selecting terms for the expansion. On one hand, the system suggests terms and presents them to the user, and on the other hand it is the users who select terms according to their preferences. So, it is the user who takes the final decision over the usefulness of a term. Therefore, the user choices of terms, rather than the system chosen terms alone, reflect the relative importance and the usefulness of terms from the user's perspective, thus increasing user satisfaction. The reasons of the success or failure of the search becomes therefore even more difficult to pin-point because in interactive query expansion we are increasing the uncontrollable variables.
The source of the terms for the selection stage, as in the case of automatic expansion, could be based either on the search results or on some knowledge structure which could subsequently be either dependent on the collection or independent of it. Attempts to incorporate a semi-automatic user-assisted (interactive) term selection stage are much fewer than their automatic query expansion counterparts. The discussion below examines some selected attempts of interactive query expansion.
The INSTRUCT term clustering technique (WADE & WILLETT) identifies keyword stems which are most similar to the query term stem. The morphological expansion (FREUND & WILLETT; HENDRY, WILLETT & WOOD) calculates a measure of string similarity (measured in trigrams) between a selected query stem and each of the stems in the dictionary file of the database. Then, in both cases, the system displays the twenty most similar stems to the user who chooses the ones to be added in the query.
EUREKA (BURKET, EMRATH & KUCH) is an experimental full text retrieval system which uses a user specific thesaurus (there is not a system-wide one). Each user can create and maintain a personal thesaurus which is used by EUREKA at search time to find synonyms for the query terms. As additional user aids EUREKA can present on demand either a histogram of term frequencies based on the retrieved documents, or word-lists of terms that are used in many documents or have high average frequencies. From these lists the user selects terms to refine the retrieved set.
Feedback based on the result of the search and with a dynamic form of search modification is given by Thomas (ODDY) which treats the document collection as a thesaurus and lets the user browse through it. The user starts with a simple natural language expression of interest, and the program responds by presenting a document representative (i.e., title, author(s), index terms) and asks for the user's judgments. At this point the user can make negative or positive judgments, on the document as a whole, and/or on any index term(s) or author(s) listed. Then Thomas iterates the search while keeping track of its actions by creating a model of the search in the form of a network.
In I3R (THOMPSON & CROFT) browsing begins after the user has picked an item as the starting point. The system provides assistance to the user by suggesting paths that should lead to relevant information. Recommendation is given to guide and not to restrict the user options. This is reflected by the displays on the neighborhood and context maps, which consist of nodes representing concepts, documents and journal issues connected by links. In the center of the map is the document the user selected as relevant. The surrounding nodes are documents that the system has decided are likely to be interesting. When the user selects a node the document that it represents appears on the screen and the user can mark it as relevant or ignore it. After a number of documents have been judged relevant I3R will re-iterate the search.
The Associative Interactive Dictionary (AID) is a prototype system developed for the Medline and Toxline databases at the National Library of Medicine (DOSZKOCS, 1978). AID automatically generates and displays related terms, synonyms, broader and narrower terms and other semantic associations for a given search term. These are controlled vocabulary associations from MeSH in Medline or free text word associations in Toxline. The terms are derived from titles, abstracts and/or controlled indexing fields from retrieved documents. These terms are displayed in ranked order according to a `relatedness' value (R) which is calculated using a modified chi-square value. The retrieved set is defined as the set of documents retrieved by a given search term or Boolean query. If the observed document frequency of a given term in the retrieved set is less than its statistically expected frequency then the term is assumed to have no associative value. The rationale behind the AID system is based upon the assumption that terms showing a considerably higher than expected frequency of occurrence in a retrieved set are assumed to be semantically related to the terms that retrieved that set. Thus AID is said to be a tool that retrieves semantically related terms.
AID operates by storing a subset of the inverted files for the two databases in its in-core hash table. The hash table terms represent all the inverted files' index terms with a frequency of four or more postings. Searches are carried out in the usual Boolean fashion and AID can be implemented at any time through the `EXECUTE' command.
A typical search on AID requires that the searcher is not only familiar with the terminology of document representation of bibliographic systems (in order to select which field the system should base the analysis on) but has a knowledge of how statistical correlation works since one is invited by AID to decide on a correlation threshold (a number between .0 and .999). Another requirement is that one must also specify the number of records to be used in the associative analysis. Nevertheless, searchers are presented with a list of terms which are ranked according to their relatedness to the initial query term together with an indication on the number of documents each of them would retrieve. The searcher then could select or reject any of the associated terms for query expansion.
A number of the research ideas in AID were subsequently implemented by Doszkocs in the various versions of CITE (DOSZKOCS, 1983). An interactive session with CITE might be instigated by the searcher who enters an enquiry statement in natural language. CITE parses the input, identifies spelling mistakes, requests their clarification and then suggests to the searcher a set of potentially applicable single words and MeSH headings that match the parsed input and for which CITE proposes to search. These terms are ranked by some weighting formula (which has not been disclosed) and presented to the searcher for selection. Relevance feedback information is also utilized in order to locate other items. CITE performs a frequent analysis on the MeSH headings and presents them to the user who can choose any of these and/or add additional keywords to the search.
CITE automatically performs stemming on words entered by the user, then selects medical subject heading (MeSH) terms and terms from the dictionary file by matching the user's words against them. Then it ranks the selected terms and displays them to the user for approval and feedback before commencing the search (DOSZKOCS, 1983).
All expert systems which have pre-search aid modules, such as CANSEARCH, CONIT, TOMESEARCHER, etc., use a knowledge structure independent of the collection and help in the suggestion of terms. Some of these systems maintain a thesaurus as part of their knowledge base, while all provide some means for the identification of terms.
EXPERT (YIP) helps the user in building a query formulation by asking the user to split the topic into concepts and then to suggest terms to express each one of them. The system builds the query by developing these concepts and then it prints them out in a columnar manner with each column containing the terms that form one concept. Thus it has no thesaurus (or any form of semantic knowledge structure); its knowledge structure could be described as purely syntactic, relating to the structure of typical simple Boolean queries and utilizing some form of the building block search strategy.
CANSEARCH (POLLITT, 1981; 1987) uses a subset of the MeSH thesaurus, together with its own knowledge of how cancer queries tend to be structured, to help the user identify terms and incorporate them in the query. TOMESEARCHER (VICKERY), processes its natural language query input and uses the INSPEC thesaurus to supplement the terms and expand the query. A system specifically designed to assist users in the selection of query terms has been developed by SHOVAL (1985; 1986). Its knowledge base consists of a thesaurus enhanced by additional information about terms, which is organized as a semantic network.
The user inputs search terms and the system searches its knowledge base, selects terms that seem to be relevant, evaluates them and presents them to the user who chooses which are useful for the query.
The MICROARRAS full text retrieval engine (SMITH, WEISS & FERGUSON) retrieves text passages and then informs the expert system of the results that satisfy the request. The expert system evaluates them and decides whether and how to reformulate the query (GAUCH & SMITH, 1989; 1991). Query reformulation is done by using a thesaurus (i.e., adding terms from the thesaurus to the search terms), by adjusting contextual constraints on the Boolean operators (i.e., alternating the operators from the most to the least restrictive, as for example in the sequence `ADJ, WITH, SAME, AND, OR'), and by replacing Boolean operators. These techniques can be employed individually or in any combination.
In MenUSE (POLLITT, 1988) the emphasis is on the structure of the MeSH thesaurus. The user identifies concepts by starting from the top of the hierarchy and going down the tree till the appropriate entry is recognized and selected. The same process is repeated by the user for each of the concepts. Then the system combines them and presents a review to the user. Another system, which also takes advantage of the MeSH thesaurus, uses a graphical interface to achieve similar goals (MCMATH, TAMARU & RADA). The hierarchical structure of MeSH is graphically displayed and the user can move about the hierarchy with the help of the mouse. Queries can be formulated by selecting terms from the hierarchy and placing them in the query window. Multiple windows allow the user to simultaneously see the thesaurus, a histogram of document rankings and the document descriptions themselves.
In HIBROWSE (POLLITT, ELLIS & SMITH) different views of the same database are presented all at the same time on the screen. For example, MeSH headings appear in the center window along with indications of how many documents are indexed under each heading and how many terms are in the hierarchy below the displayed term. Additional windows display, for example, titles, author names, year of publication, document type, languages, etc. When the searcher selects a subject heading then all the windows get updated automatically to display the corresponding information, such as the titles of the papers indexed by the selected subject heading, etc. Searchers can then browse, select and combine information from any window so that they define and expand a query without the need of mastering a command language.
Searching in the MEDINDEX prototype is performed through the indexing of the query, which shares the document indexing knowledge-base (HUMPHREY, 1992; 1994). Many of the rules are shared between document indexing and query indexing in filling indexing frames. Frame-filling is considered a form of query expansion by prompting the searcher for aspects of a topic. For example, in filling a Drug Therapy frame, the system prompts for PROBLEM (disease treated), AGENT (drug used), and so forth. If the searcher filling this frame enters Estrogens or a specific estrogen term as the AGENT, the system reminds the searcher of the more specific term Estrogen Replacement Therapy. MEDINDEX therefore uses knowledge of specificity in the domain, and can initiate very situation-specific assistance, compared to searchers having to initiate look-ups in the MeSH trees assuming the searchers realize they need to take this initiative.
After the frames are filled, the indexing terms generated are displayed in an interface. Three techniques for query expansion are then provided (HUMPHREY, 1994; 1995): (a) Explosion of terms (i.e., hierarchical query expansion) is the default. (b) The interface includes ëxpand to parent", displaying the ancestors of a MeSH term for searchers to select any of them to be combined in union with the original term. Parent terms selected are not exploded by default. They are incorporated into the search strategy, which is then passed to the retrieval system. (c) The "SQL Query" version of the MEDINDEX prototype has an explosion interface which permits searchers to mark exceptions to explosion and then it produces source code for query expansion with exceptions. The code explodes a MeSH term, removes the searcher identified exceptions from the list, and then considers the result to be the explosion.
The design issues of integrating a dynamic lexicon with the AI-STARS full-text retrieval system are discussed by ANNICK. Like I3R (CROFT & THOMPSON), AI-STARS allow users to directly browse a lexicon during query (re)formulation and select terms for the expansion.
Searchers used a newspaper database as a further source of words and phrases in addition to their personal knowledge of the topic they were searching. For each search they analyzed the TREC topic, excluded any terms that seemed unhelpful and expanded the term list by adding synonyms, broader terms, narrower terms or other useful terms (COOPER, CHEN & GEY).
The ZOOM (MARTIN; INGWERSEN, 1984) feature on ESA/IRS is a helpful tool for online searching along these lines. It performs term frequency analysis on a number of records from the retrieved set(s). The user is then presented with screen-displays which contain terms in a frequency ranked order. The searcher selects terms which then can use to expand the query. ESA/IRS offers also QUESTQUORUM (D'ELIA & MARCHETTI) as a simple interface to its command driven system for inexperienced users which can also do a semi-automatic query expansion based on terms selected by the user from a ZOOM-like display. Recently, ESA/IRS introduced BRAQUE, a prototype interface that utilizes ZOOM, EXPAND, and thesaurus browsing (HYPERLINE) for term selection during query formulation and query expansion (BELKIN, MARCHETTI & COOL). Commands similar to ZOOM are now also available in other vendors, e.g. GET in ORBIT, EXTRACT in DIMDI, MEMSORT in QUESTEL, RANK in DIALOG.
There are also some other systems that have utilized the term frequency analysis function either in a ZOOM-like way or in some other form. IT (Information Transfer system) (WILLIAMS, 1984) uses the command EXPLORE to retrieve, organize and present index terms to the user, for addition to the search. Analogous is the command TERMS in the MUSCAT online catalog (PORTER & GALPIN). It presents terms and UDC numbers which are extracted from documents inspected and judged relevant and which were not included in the original query. With each term presented the user is asked whether or not to add it to the query. CITE (DOSZKOCS, 1983) utilizes the user feedback also in a similar manner. It automatically performs term frequency analysis on the records marked as relevant, and then it presents the terms in ranked order to the user for selection.
USERLINK, OASIS and IT are front-ends for both the unskilled user and the skilled intermediary (WILLIAMS, 1983; 1984; 1985; GOLDSMITH & WILLIAMS). The aim is to improve the recall of a search by helping the searcher select index terms from retrieved documents. Query expansion is achieved through the EXPLORE feature as follows.
The searcher carries out a search and identifies a set of relevant documents. IT then retrieves the titles and descriptors which are contained in the final results set. The retrieved items are then processed to provide two lists of the single words and phrases that are contained in the retrieved documents. The lists are ranked in order of decreasing frequency. The word list and phrase list are then displayed to the user. A simple dialog is provided which enables the user to select terms and to add them to the search.
The words and phrases are displayed in order of decreasing frequency of their occurrence in the retrieved set. The system can handle fairly large numbers of references. Williams suggests that up to 50 references should normally be retrieved and processed to keep the online time to a reasonable value. ZOOM and Williams' system are thus fairly similar in their approach, extracting terms from retrieved records which are subsequently displayed in an organized fashion using frequency information.
PORTER & GALPIN discuss MUSCAT, an OPAC at the Scott Polar Research Institute, which allows relevance feedback and query expansion in its advanced use of the system. In the advanced module of the Muscat OPAC a query can be in probabilistic or in Boolean terms. In the probabilistic query extra terms may be added in any of the following ways: (a) by browsing through the index and adding in selected terms; (b) by using the command `ADD PENGUIN' which explicitly adds the term penguin in the search; (c) by using the command `TERMS' which is a relevance feedback method of query expansion.
The `TERMS' command allows users to view terms found in the relevant documents. These are comprised of stemmed single words and Universal Decimal Classification (UDC) numbers. The terms and UDC numbers are weighted and ranked in decreasing order.
Another investigation of query expansion is the IRX experiment (HARMAN, 1988). Part of that experiment looked at some ways of extracting short lists of terms for query expansion gathered from relevance feedback, nearest neighbors, and term variants of the original query terms. IRX used the Cranfield test collection for this bench experiment and user interaction was simulated. The goal of the part of the experiment that looked at relevance feedback was to identify 20 terms that could be included in the `feedback terms' window of the interface.
ARAYA evaluated different strategies to improve query formulation using the SMART retrieval system and the CACM test collection. Query expansion was based on adding phrases that were generated, using syntactic or statistical methods, from the query terms or from the relevant documents. The statistical method generated phrases by combining every term in, for example, the query in every possible way. It was reported that user intervention did not perform much better than the automatic approach. However, the potential of interactive query expansion for improving performance was realized.
EFTHIMIADIS (1992b) investigated interactive query expansion using CIRT, a relevance feedback system (ROBERTSON ET AL., 1986), and searched INSPEC on DATASTAR with real users and their requests. Interactive query expansion was investigated from two different perspectives, (a) studying end users during the process of query expansion, especially the term selection process, and (b) studying the behavior of ranking algorithms for query expansion. The overall search results provided some evidence for the effectiveness of interactive query expansion. The user-based aspect of the research investigated the processes of interactive query expansion, term selection for query expansion by users, and the user perception, understanding, and assignment of term relationships from a knowledge structure, such as the INSPEC thesaurus (EFTHIMIADIS, 1994b; 1996) The algorithmic aspect of the research was concerned with the evaluation of the performance of ranking algorithms for query expansion. A new methodological approach for the evaluation of ranking algorithms was introduced where the yardstick for the evaluation of the ranking algorithms was the user choices of terms (EFTHIMIADIS 1993). The comparative evaluation of eight ranking algorithms for the ranking of the terms for query expansion is given in EFTHIMIADIS (1995).
As experimentation is moving away from the small test collections to the TIPSTER/TREC collection and to operational environments, a new wave of research is under way that will test the scalability of earlier approaches and try out new ones. Most research on query expansion has concentrated in developing a method and then testing it in terms of overall retrieval performance. While, this is an appropriate methodology there is lack of microevaluation of the tests performed. Such individual scrutiny of results, though admittedly a costly and onerous task, would provide qualitative clues for further explanation of the behavior of the method of query expansion that is studied. Thus far research on query expansion has not yet identified optimal levels for neither automatic query expansion nor interactive query expansion.
The conclusions from this review of query expansion are suggestive of many directions for future research. The main research directions can be divided into two streams: (a) research on methods for the automatic identification, collection and association of terms; term extraction and use in retrieval; ranking algorithms and of other methods for term inclusion in the search; and (b) user studies on query expansion with emphasis on how searchers select terms for query expansion; how do they perceive term relations, and how these affect retrieval performance. These are highlighted below.