The results discuss issues that relate to query expansion, retrieval effectiveness, the correspondence of the online-to-offline relevance judgments, and the selection of terms for query expansion by users (interactive query expansion).
The main conclusions drawn from the results of the study are that: (1) One third of the terms presented to users in a list of candidate terms for query expansion was identified by the users as potentially useful for query expansion; (2) These terms were mainly judged as either variant expressions (synonyms) or alternative (related) terms to the initial query terms. However, a substantial portion of the selected terms were identified as representing new ideas. (3) The relationships identified between the 5 best terms selected by the users for query expansion and the initial query terms were that: (a) 34% of the query expansion terms have no relationship or other type of correspondence with a query term; (b) 66% of the remaining query expansion terms have a relationship to the query terms. These relationships were: narrower term (46%), broader term (3%), related term (17%). (4) The results provide evidence for the effectiveness of interactive query expansion. The initial search produced on average 3 highly relevant documents; the query expansion search produced on average 9 further highly relevant documents.
The conclusions highlight the need for more research on: interactive query expansion, the comparative evaluation of automatic vs. interactive query expansion, the study of weighted web-based or web-accessible retrieval systems in operational environments, and for user studies in searching ranked retrieval systems in general.
The research presented here is an investigation of interactive query expansion,1 its process and its effectiveness using an operational system with real users and their requests. The INSPEC database on Data-Star was searched online with a system that supports weighting, ranking and relevance feedback [ Robertson & Thompson1990]. In investigating interactive query expansion (IQE) in an operational system the emphasis is on the users, their requests, and their interaction with the system. The most obvious characteristic of the user interaction is the selection of terms at a particular stage in the process.
Data were collected from 25 searches. The data collection mechanisms included questionnaires, transaction logs, and relevance evaluations. The variables examined were divided into seven categories which included: retrieval effectiveness, user effort, subjective user reactions, user characteristics, request characteristics, search process characteristics and term selection characteristics.
Studies of operational systems produce a wide range of results and this study is no exception. The most important results with respect to query expansion are reported here from the term selection characteristics and retrieval effectiveness. The results of the study are presented in two sections, that is, main results and other findings according to the treatment of query expansion. The main results relate directly to query expansion, e.g., what type of term relationships users identify between the initial query terms and the query expansion terms, to retrieval effectiveness and to the correspondence of the online-to-offline relevance judgments. The other findings present results from the analysis of the questionnaires and the searches that do not relate directly to query expansion and include user characteristics, subjective user reactions and search process characteristics.
When considering query expansion the research questions that emerge include:
All these questions led to the decision to investigate interactive query expansion and its effectiveness within a dynamic user centered environment.
The methods and procedures used in the study are presented and discussed here. These include the study participants, variables examined, data collection instruments (questionnaires and transaction logs), data collection procedures, methodological issues relating to relevance assessments for the online and offline evaluation of the search results, and to user selection of terms for query expansion.
The study participants were all engineering, computer and information science students and faculty. A total of 25 searches were performed. All searches were conducted in the INSPEC database. The time required to complete a search from the pre-search interview to the final evaluation of the printouts of the results offline was at least four hours. Free searches were offered to attract users.
A complete record of the search includes four questionnaires, search logs and the evaluated printouts of the results. The printouts of the search results were evaluated on site immediately after the search. Thereby, reducing the possibility of users not returning them if the printouts were taken away for evaluation. At the end of the session users were given a photocopy of the printouts they had evaluated.
The variables examined are divided into seven categories and are evaluated by the questionnaires, logs and relevance evaluations. The categories include: Retrieval effectiveness, user effort, subjective user reactions, user characteristics, request characteristics, search process characteristics, term selection characteristics. All variables are listed below.
Because the experiment was conducted in an operational environment, and because each search was conducted only once, it was not possible to establish a recall base. Therefore number of relevant retrieved was used instead of recall.
Additional relevance data was gathered during the evaluation of offline prints. This information was concerned with whether the user had seen the documents prior to the evaluation and could be used for determining novelty ratios. The questions covered both the issue of subject topical relevance as well as that of the usefulness of each document to the user. In addition to precision the following relationships were also examined: (a) relationship (proportion) of online relevance judgments to total number of documents seen online; (b) relationship (proportion) of total number of documents seen online to total retrieved; (c) relationship of online relevance judgments to offline relevance judgments. How online relevance judgments correlate to overall results? Is there a consistency between online-offline relevance judgments?
User effort could also be measured on how many documents one has to see in order to arrive at 5 relevant documents online. This aspect of the variable is closely related to retrieval effectiveness and therefore the two are discussed in the results in parallel rather than separately. There are two stages in looking at documents, one is online and the other is offline. Traditionally we calculate precision based on the results of the offline evaluation. However, the relevance judgments online provide us with information that relates to precision. For example, during the searches it was aimed to base the analysis on a sample of at least 5 relevant documents. Therefore, one could relate the user's effort in identifying online relevant documents to precision and retrieval effectiveness.
The hypothesis is that the longer it takes (in terms of documents examined online) to arrive at the 5 relevant documents the lower the precision of the final retrieved set will be. To put it in another way, the question could be reformulated to how the proportion of the relevant documents seen online to the total number of documents seen online affects the final result, i.e. could the precision that is achieved online be an indication for the expected precision to be achieved from the final results. Could such relationship be established, and, what does it mean?
Categorical questions on education/academic status, occupation/subject studied, etc. give data on individual differences that have been shown to be related to information retrieval system performance [ Borgman1986].
The questionnaires on the outcome of the search would elicit scaled data on the effectiveness and efficiency of the search. These would allow me to make estimates of the degree of success or failure of the interaction, from the point of view of the user and thus to relate searching behavior, problem context and types of uses to various measures of success.
This questionnaire elicited information about the query expansion terms that were selected by the user. The questions concentrated on how the user perceived the relationship between the query terms and the term that s/he selected for query expansion. They were asked to identify whether the query expansion terms were chosen because they were thought of as being synonyms of query terms, related terms, the best alternative to express the subject that they could find in the list, or representing new ideas.
There are two ways to look at ranked lists of terms for term selection for query expansion. In one we can assume that the ranking algorithm is not in question, and in the other that the ranking algorithm is in question. In this paper the discussion of the results assumes that the ranking algorithm is not in question. That is, we accept that the terms are ranked by the system according to their usefulness for each particular query, with the best terms at the top of the list and the bad terms at the bottom. Efthimiadis presents a user-centered framework for evaluating ranking algorithms when they are in question.
The data collection instruments used in the study are divided into three categories which include questionnaires, transaction logs and printouts of the search results. The instruments are summarized below and discussed by category.
The pre-search questionnaire dealt mostly with user and request characteristics. It inquired about the users' background and the context of the information request. It asked for combined user and search information such as status (student, faculty, researcher, etc.); what they hoped to use the search results for (coursework, research project, teaching); whether they had done any online searches before, either on their own or with an intermediary. There were also questions regarding request characteristics. Some questions elicited from the users information on how much they thought they knew about the subject (1=nothing ... 5=a lot) as well as at which stage of the project they were at (1=beginning to think ... 5=end of project). Other questions asked whether the user wanted a broad or narrow search, and whether they viewed the nature of their search request as precise, general or vague. This last question was trying to ascertain how the request and the terms used in the search related to the subject domain of the query.
The query expansion questionnaire asked questions relating to the choice of terms for query expansion. For example, if users had selected some terms from the ranked list they were asked whether these terms were thought of as: variant expressions or synonyms, or alternative (related) terms to the original query terms; or whether these were chosen because the user could not find a better term(s) to express the subject of the enquiry; or whether the chosen terms represented new ideas that were not part of the original request. For each query expansion term chosen users had to identify if the term corresponds to an existing query term and whether it was a broader, narrower or related term.
The remaining two questionnaires (end-of-search and final) were completed after the search. The end-of-search questionnaire was completed immediately after the search while the user was still sitting at the terminal and was concerned with the search process. The final questionnaire was completed after the evaluation of the printouts of the search results.
These two questionnaires were concerned primarily with the users' overall satisfaction with the search, their impression of the ease or difficulty of the search and a consideration of the results based upon what they had seen during the search. Users were also asked how close was the search to the original enquiry and did they retrieve the number of references they had anticipated. The questions that were asked after they had evaluated the offline prints were concerned with their satisfaction from the references they evaluated and whether in retrospect there were any other concepts that they would like to search for.
Methodological issues relating to query term selection, online relevance judgments, sample size of relevant documents for relevance feedback, and selection of terms for query expansion are discussed below.
Since the input of query terms could vary considerably in its form it was decided to try the following two approaches for the selection of terms for initial query and for query expansion. For the first 12 searches initial query terms were chosen to be entered as presumed most suitable to the search, that is, as phrases, single terms, truncated terms, synonyms ORed together, etc. Similarly, query expansion terms were first selected from the ranked lists and were input to the system as thought to be best, i.e. single terms or phrases or parts of phrases. For the remaining 13 searches query terms were searched as single terms only, i.e. phrases were split. Query expansion terms were taken as they appeared in the ranked list. The reason for using single terms was to simulate a basic system that uses some kind of a pseudo-natural language parsing of the input query. Such system usually splits the input into single words, although one could have a form of input for ``phrases,'' like the use of the quotes by web search engines.
Relevance is a continuous variable and it has been established in studies of relevance judgments, that it is an over-simplification to collapse a variety of degrees of relevance into yes/no decisions [ Cuadra & Katter1967, Rees & Schultz1967, Eisenberg & Hu1987, Froelich1994, Harter1992]. However, the adoption of the binary definition of relevance is a required simplification of this complex notion because it facilitates the calculation of relevance weights.
During the online relevance assessments of documents users are faced with two issues. One is that the retrieval system requires relevance judgments on a binary scale. The other is that the user must respond on the question of the relevance of a document by judging each document for its topical relevance, i.e. the extent to which the subject matter of the document is about the query, rather that its usefulness.
The distinction of topical relevance, as defined here, and usefulness is that a document which is relevant but not useful to a user, because for example, the user knows about it, or the document is outdated, or it is written in a foreign language, is very important for relevance feedback systems. This is because relevance information is used by the weighting scheme for the estimation of the relevance weights. Although, one can argue that feedback systems would try to predict usefulness, including factors such as date and language, for most it seems appropriate to ask the users for topical relevance judgments as opposed to usefulness.
In helping users understand the difference and make the relevance judgments in the above context the following question was used: ``regardless of whether you have seen this document before, is this the sort of thing you are looking for? [y, n]''. This is a modified version of the question used in the Okapi retrieval system [ Walker & De Vere1990]. ``More like this'' (Excite) and ``find similar pages'' (Infoseek) are simplified versions of the question used by web search engines. However, the resulting relevance feedback search is different from the one described here.
A variety of sample sizes has been used in IR experiments for relevance feedback or query expansion searches. A commonly used method for getting a sample in automatic relevance feedback and query expansion is to simulate relevance feedback by assuming that the top N-ranked documents are relevant and use them as sources of terms for query expansion. This technique is used in TREC, for example, [ Efthimiadis & Biron1994, Buckley et al.1995, Robertson et al.1995, Buckley, Singhal & Mitra1996, Allan et al.1998], and has its roots in early automatic relevance feedback experiments where the sample was defined at a cutoff level of the 10 or 20 top-ranked documents [ Harper & van Rijsbergen1978, Harper1980, van Rijsbergen, Harper & Porter1981, Smeaton & van Rijsbergen1981, Sparck Jones1979a, Sparck Jones1979b, Sparck Jones1980].
A conclusion of these earlier experiments was that within the documents in the cutoff level a small sample of (1-4) relevant documents could be adequate as the basis of the reweighting of terms [ Sparck Jones1979b],. Others have suggested a sample size of 5 relevant documents, for example, Harper , Martin and White . This does not exclude the use of a larger sample. On the contrary, it is believed that the larger the sample of relevant documents the better the estimation should be. However, the problem of selecting an optimal sample size is still very much an open IR research issue.
Consequently, for the data collection of this research it was decided to aim for at least 5 relevant documents within the first 20 top ranked documents. Getting a sample set of 5 relevant documents was not always possible, but was quite consistent throughout. There are 17 searches that have a sample of 5 or more documents judged relevant online. In six searches there are 4 documents judged relevant and in two searches 2 documents.
Users were asked to be thorough and were told to take as long as they needed to finish this task, i.e. no time constrains were imposed on them. It could be argued that the ranking and its presentation order had an effect on the users in choosing query expansion terms. Users are more likely to choose terms from the top of the ranked list that is presented to them rather than from the bottom. In other words, users are most likely to stop at some point while going down the list rather than to maintain their concentration till the end of it. There might be a number of factors that make someone decide to quit, e.g. they got satisfied with the terms they found on the list so far, they got disappointed by the length of the list and chose terms from the top, etc.
However, it was thought that term selection bias would be reduced if the user had a clear visual idea of the length of the list as opposed to have to go through an unspecified number of screens. The evaluation of the ranked list and the selection of the query terms was followed by the completion of the questionnaire on term selection.
After having selected the query expansion terms the search was continued by including the new terms. In order to be consistent with searching and try to minimize the variability between searches it was decided to have only one search iteration with the initial query terms and one search with the query expansion terms.
After the query expansion search was completed the retrieved documents were printed and given to users for relevance assessment.
The results reported are based on 25 searches. The analysis is directed at the query expansion aspects of the searches and retrieval effectiveness. Other aspects of the search including an analysis of the data collected by the pre-search and post-search questionnaires are discussed in the section `other findings'.
The discussion throughout is mostly qualitative in nature. The data have also been subjected to appropriate statistical analysis using various tests. Because of the small sample size there are occasions where the results are presented only with the intention to demonstrate trends and to facilitate the discussion and there are no claims of any statistical significance.
| User | Terms | Terms | |
| in list | chosen | %2 | |
| 101 | 62 | 10 | 16 |
| 102 | 38 | 9 | 24 |
| 103 | 137 | 11 | 8 |
| 105 | 77 | 14 | 18 |
| 108 | 33 | 8 | 24 |
| 110 | 61 | 10 | 16 |
| 111 | 48 | 31 | 65 |
| 112 | 93 | 6 | 6 |
| 113 | 65 | 24 | 37 |
| 114 | 62 | 15 | 24 |
| 115 | 42 | 9 | 21 |
| 116 | 77 | 3 | 4 |
| 117 | 64 | 25 | 39 |
| 118 | 55 | 32 | 58 |
| 119 | 117 | 34 | 29 |
| 120 | 44 | 9 | 20 |
| 121 | 61 | 17 | 28 |
| 122 | 60 | 13 | 22 |
| 123 | 61 | 12 | 20 |
| 124 | 80 | 27 | 34 |
| 125 | 41 | 21 | 51 |
| 126 | 39 | 7 | 18 |
| 127 | 62 | 11 | 18 |
| 128 | 113 | 98 | 87 |
| 129 | 34 | 8 | 24 |
| N | MEAN | STDEV | SEMEAN | MIN | MAX | |
| % | 25 | 28.44 | 19.19 | 3.84 | 4.00 | 87.00 |
The questionnaire on query expansion asked the users two
questions, each corresponding to some aspect of
the term selection process they had followed.
At first users identified all the terms in the ranked list that
they thought as being useful for the purpose of the search, i.e.
they selected terms that would be acceptable to use in the
search.
The results of this part of the selection process are given in
Table 1 which presents the total number of
terms in each list and the number of terms selected by the
users. The users were then asked for the
reason(s) that made them chose those terms. How did they think
the selected terms related to the search and to the original
query terms. They were asked to consider such relationship(s)
for all terms
collectively rather than for each term individually. Users were
given four options to choose from:
Users could select as many options as they thought appropriate.
Figure 1 summarizes the results of the user
perceived association between the terms identified from the ranked
list to the query. The percentages given in the figure do not amount
to a total of 100% because users were asked to select as many of the
four options as applied to the terms. Users thought of the terms they
selected as being alternative (related) terms to the query terms for
88% and as variant expressions or synonyms for 64% of the time.
These two categories account for the majority of the responses. A
very small percentage (4%) chose the terms because they could not
find a better term from those on the list to express the subject of
the query. A rather interesting result comes from terms that do not
relate directly to the original query terms and which represent new
ideas. These accounted for 44% of the responses.
The latter result demonstrates the unpredictability that is involved in
subject searching and the difficulties imposed on information
retrieval. Additional information was not collected for this
category. In retrospect some questions could have been
included to elicit information about the terms that represent new
ideas. Further research is therefore needed into this area. More
specifically about the relationships of the `new ideas' to the
original query. What was the reason for choosing these terms? Was
the user aware about these new concepts/ideas at the beginning of the
search? If yes, why were these not expressed at the pre-search
interview? Was the reason for the exclusion interview related?, e.g.
communication failure, or did the user chose to exclude them at the
interview stage because s/he thought of them as peripheral? If users
had not thought of these concepts at an earlier stage, did they knew
about them before? Did they recognize and chose these concepts as the
result of a learning process during the search? On the whole, what
were the reason(s) and stimuli that made them choose these new terms.
Answers to these questions will contribute to the
understanding of users' searching behavior.
The second question asked the users to express the type of term relationships they thought that exist between the five best terms they selected and the original query terms. The 5 best terms were selected from all those terms they had identified as useful. The questions on term relationship concentrated on whether there is a correspondence between each of the 5 new terms and the query terms. If the user identified that there was a correspondence between them the type of correspondence was noted. The term relationship that users were asked to select from are among the standard types of relationships found in thesauri, i.e. broader term and narrower term for hierarchical relationships and related term for affinitive/associative relationships.
The relationships of the 5 best terms selected for QE to the query terms are shown in Figure 2. For 34% of the chosen terms there is no relationship or other type of correspondence to a query term. For the remaining 66% of the terms relationships exist as follows: A narrower term relationship between a selected term and a query term accounts for 46% of the responses. A broader term relationship accounts for 3%. An associative relationship (i.e. related term) holds for 17% of the terms.
From these results it can be established that approximately 75% of the term associations fall within a hierarchical relationship. Users have overwhelmingly selected narrower terms as the terms for query expansion. This finding is very important and emphasizes the possible advantages that may be involved if an online thesaurus is used. Such a thesaurus could assist users in looking-up terms, establishing their relationships and deciding on term inclusion, as well as for exploiting hierarchical relationships. A possible way to include terms could be in clusters or in hierarchies. For example in a fashion similar to that of the `explode' command in Medline, where an `exploded' term retrieves itself as well as all the terms in the hierarchy beneath it. However, all these should be user-controlled or `machine-aided' operations rather than entirely automated.
Retrieval effectiveness is based on relevance judgments for all documents retrieved and which were printed in full format. The results from the relevance assessments are discussed below, first for the offline printouts and then for the correspondence of the online to the offline relevance assessments.
The number of documents evaluated ranged from 16 to 66 with an average of 30 documents per search. Relevance was evaluated first in a three-point scale (`Yes', `Partially' or `No') and then one of four choices was selected that reflected the usefulness of the citation to the user (I have seen the document itself before, and it was useful; I have seen the document itself, but it was not useful; I have not seen the document represented by this reference, but I would like to see it; I have not seen the document represented by this reference and I would not like to see it). From all the documents retrieved users had seen 8% of them, wanted to see 51% and did not want to see 41%. A summary of the results is given in Figure 3. It is observed that everything is correlated as expected. For example, the `want-to-see' decline from the relevant to the non-relevant scale (30%, 18%, 3%), whereas the `do-not-want-to-see' increase from the relevant to non-relevant (3%, 9%, 29%).
There is therefore a strong sense of distinction between relevance and usefulness, i.e. relevance and `want-to-see'. There are a few not-relevant documents indicated as `want-to-see' (3%) and a few relevant indicated `do-not-want-to-see' (3%). These results raise the question of what is the reasoning behind these decisions. Data to answer this type of questioning has not been collected. The latter case is perhaps more obvious because one can think of good reasons for why one might not want to see a relevant document, whereas for the former it is only the users who could identify what triggered their interest to that particular non-relevant document. The examination of the questionnaires and the citations identified as relevant and `do-not-want-to-see' reveals that the citations were either old or written in a language other than English.
The average percentage of the relevant, partially relevant and not-relevant documents over the 25 searches is 39%, 29%, and 32% respectively. Retrieval effectiveness was calculated for both relevance1, i.e. only relevant documents, and relevance2, i.e. relevant and partially relevant documents. The overall precision (averaged over the 25 searches) for relevance1 is 39% (for an average of 12 relevance1 documents retrieved) and for relevance2 is 68% (for an average of 21 relevance2 documents retrieved).
During a search users judged the relevance of a number of documents online. The positive judgments were used for relevance feedback searches to retrieve additional documents. When the search was completed the user was presented with all the citations that were retrieved and was asked to evaluate them. Among the citations were also the ones that were judged relevant online. Here we examine how these same documents were assessed for relevance offline. Were these judged relevant? Is there a consistency between the users' online and offline relevance evaluations? How do the relevance judgments break down? The two sets of relevance judgments are expected to correlate, therefore, the question is not whether they correlate but how well do they correlate, i.e. how strong the correlation is. These issues relate to the variables of retrieval effectiveness and user effort discussed earlier.
| seen | relevant | precision | rel1 | rel2 | prec1 | prec2 | Prec1 | Prec2 | |
| search | online | online | offline assessments | relevant online | seen online | ||||
| (2:1) | of column (2) | (4:2) | (5:2) | (4:1) | (5:1) | ||||
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | |
| 101 | 5 | 5 | 1.00 | 4 | 5 | 0.80 | 1.00 | 0.80 | 1.00 |
| 102 | 4 | 4 | 1.00 | 4 | 4 | 1.00 | 1.00 | 1.00 | 1.00 |
| 103 | 10 | 6 | 0.60 | 3 | 6 | 0.50 | 1.00 | 0.30 | 0.60 |
| 105 | 8 | 2 | 0.25 | 0 | 1 | 0 | 0.50 | 0 | 0.12 |
| 108 | 15 | 4 | 0.26 | 2 | 2 | 0.50 | 0.50 | 0.13 | 0.13 |
| 110 | 11 | 8 | 0.72 | 6 | 8 | 0.75 | 1.00 | 0.55 | 0.73 |
| 111 | 5 | 4 | 0.80 | 3 | 4 | 0.75 | 1.00 | 0.60 | 0.80 |
| 112 | 11 | 5 | 0.45 | 2 | 5 | 0.40 | 1.00 | 0.18 | 0.45 |
| 113 | 12 | 5 | 0.41 | 3 | 5 | 0.60 | 1.00 | 0.25 | 0.42 |
| 114 | 13 | 6 | 0.46 | 5 | 6 | 0.83 | 1.00 | 0.39 | 0.46 |
| 115 | 12 | 4 | 0.33 | 3 | 4 | 0.75 | 1.00 | 0.25 | 0.33 |
| 116 | 15 | 4 | 0.26 | 4 | 4 | 1.00 | 1.00 | 0.26 | 0.26 |
| 117 | 11 | 5 | 0.45 | 2 | 3 | 0.40 | 0.60 | 0.18 | 0.27 |
| 118 | 10 | 7 | 0.70 | 4 | 6 | 0.57 | 0.86 | 0.40 | 0.60 |
| 119 | 9 | 5 | 0.55 | 0 | 2 | 0 | 0.40 | 0 | 0.22 |
| 120 | 10 | 8 | 0.80 | 0 | 8 | 0.88 | 1.00 | 0 | 0.80 |
| 121 | 14 | 5 | 0.35 | 3 | 5 | 0.60 | 1.00 | 0.21 | 0.36 |
| 122 | 14 | 8 | 0.57 | 5 | 8 | 0.63 | 1.00 | 0.36 | 0.58 |
| 123 | 4 | 4 | 1.00 | 4 | 4 | 1.00 | 1.00 | 1.00 | 1.00 |
| 124 | 10 | 6 | 0.60 | 2 | 4 | 0.33 | 0.66 | 0.20 | 0.40 |
| 125 | 12 | 5 | 0.41 | 0 | 5 | 0 | 1.00 | 0 | 0.42 |
| 126 | 19 | 5 | 0.26 | 3 | 4 | 0.60 | 0.80 | 0.16 | 0.21 |
| 127 | 9 | 7 | 0.77 | 4 | 7 | 0.57 | 1.00 | 0.44 | 0.78 |
| 128 | 14 | 12 | 0.85 | 12 | 12 | 1.00 | 1.00 | 0.86 | 0.86 |
| 129 | 16 | 2 | 0.12 | 1 | 2 | 0.50 | 1.00 | 0.06 | 0.12 |
| Average Prec | 56% | 60% | 89% | 34% | 52% | ||||
Table 2 presents the results of the comparison of how the same documents were judged for relevance first online and then offline. The table gives the total number of documents viewed online (column 1), the number of documents that were judged relevant online (column 2) and the corresponding precision ratio (column 3). The offline relevance assessments of the documents judged relevant online (column 2) is given for relevance1 (column 4) and relevance2 (column 5). The precision ratios for offline assessments over documents judged relevant online are given in columns 6-7, and the precision ratios for offline assessments over all documents seen online are given in columns 8-9.
Among the studies on relevance assessments the study by Resnick and Savage assessed the consistency of human judgments of relevance. They studied the ability of humans to judge consistently the relevance of documents to their general interest from different document representations, i.e. citations, abstracts, keywords and full text. They concluded that their subjects were able to make such judgments consistently.
Looking at the results of the online to offline correspondence of relevance judgments, i.e. columns (2) & (4) and (2) & (5) in Table 2, it can be established that the assessments are also quite consistent. The Pearson's product moment correlation shows that the correlation is high between online assessments and relevance1 (r = 0.853) and becomes even stronger between online assessments and relevance2 (r = 0.932).
The average precision of the positive online assessments (column 2 in Table 2) against the corresponding offline relevance assessments of relevance1 (column 4) is 59% (column 6) and of relevance2 (column 5) is 89% (column 7). The average precision achieved from the online relevance judgments (column 3) is 56%.
It would be of interest, from a relevance point of view, to analyze the documents that were seen online but were rejected as non-relevant. That is, the negative relevance judgments online could be printed and re-assessed for relevance by the users offline. This kind of analysis could not be done because documents judged non-relevant online were not saved.
Following the previous section of the correspondence of the online and offline relevance judgments the next question that is addressed relates to the retrieval performance of the initial search as compared to the query expansion search. It is necessary to reiterate the procedure followed. Documents seen in the initial search and judged relevant were included in the offline prints; those judged non-relevant online were excluded. All documents retrieved in the query expansion search were included. However, for the purpose of the present analysis, we should consider the documents judged non-relevant online to have been retrieved.
First, the online relevance assessments are matched to the offline assessments for relevance1 and relevance2. Then precision is calculated for the offline assessments against the total number of documents seen online, as a measure of precision of the initial search. Secondly, the positive online relevance assessments are excluded from the final sets of assessments of relevance1 and relevance2. The result of this operation is two sets which contain the number of relevant documents that were retrieved from the query expansion search. The two sets are used for the calculation of the precision achieved in the query expansion searches. These figures are then compared to the precision of the initial search as described above in order to establish the levels of performance of the different stages of the search and see the effect of query expansion.
| relevant | rel1 | rel2 | Rel1 | Rel2 | QErel1 | QErel2 | total | total | QEprec1 | QEprec2 | |
| search | online | offline assessments | total offline | printed | QEretrieved | offline | offline | ||||
| of docs in column (1) | printed | (4-2) | (5-3) | offline | (8-1) | (6:9) | (7:9) | ||||
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | (10) | (11) | |
| 101 | 5 | 4 | 5 | 23 | 44 | 19 | 39 | 46 | 41 | 0.46 | 0.95 |
| 102 | 4 | 4 | 4 | 14 | 20 | 10 | 16 | 25 | 21 | 0.48 | 0.76 |
| 103 | 6 | 3 | 6 | 19 | 26 | 16 | 20 | 26 | 20 | 0.80 | 1.00 |
| 105 | 2 | 0 | 1 | 14 | 31 | 14 | 30 | 66 | 64 | 0.22 | 0.47 |
| 108 | 4 | 2 | 2 | 10 | 17 | 8 | 15 | 57 | 53 | 0.15 | 0.28 |
| 110 | 8 | 6 | 8 | 18 | 24 | 12 | 16 | 53 | 41 | 0.29 | 0.39 |
| 111 | 4 | 3 | 4 | 6 | 17 | 3 | 13 | 18 | 14 | 0.21 | 0.93 |
| 112 | 5 | 2 | 5 | 5 | 9 | 3 | 4 | 48 | 43 | 0.07 | 0.09 |
| 113 | 5 | 3 | 5 | 5 | 9 | 2 | 4 | 17 | 12 | 0.17 | 0.33 |
| 114 | 6 | 5 | 6 | 8 | 15 | 3 | 9 | 17 | 11 | 0.27 | 0.82 |
| 115 | 4 | 3 | 4 | 14 | 24 | 11 | 20 | 25 | 21 | 0.52 | 0.95 |
| 116 | 4 | 4 | 4 | 18 | 37 | 14 | 33 | 37 | 33 | 0.42 | 1.00 |
| 117 | 5 | 2 | 3 | 2 | 9 | 0 | 6 | 18 | 13 | 0.00 | 0.46 |
| 118 | 7 | 4 | 6 | 10 | 13 | 6 | 7 | 21 | 14 | 0.43 | 0.50 |
| 119 | 5 | 0 | 2 | 20 | 44 | 20 | 42 | 48 | 43 | 0.47 | 0.98 |
| 120 | 8 | 0 | 8 | 0 | 8 | 0 | 0 | 21 | 13 | 0.00 | 0.00 |
| 121 | 5 | 3 | 5 | 12 | 18 | 9 | 13 | 21 | 16 | 0.56 | 0.81 |
| 122 | 8 | 5 | 8 | 18 | 32 | 13 | 24 | 35 | 27 | 0.48 | 0.89 |
| 123 | 4 | 4 | 4 | 17 | 27 | 13 | 23 | 28 | 24 | 0.54 | 0.96 |
| 124 | 6 | 2 | 4 | 10 | 17 | 8 | 13 | 20 | 14 | 0.57 | 0.93 |
| 125 | 5 | 0 | 5 | 7 | 14 | 7 | 9 | 16 | 11 | 0.64 | 0.82 |
| 126 | 5 | 3 | 4 | 3 | 4 | 0 | 0 | 18 | 13 | 0.00 | 0.00 |
| 127 | 7 | 4 | 7 | 7 | 12 | 3 | 5 | 25 | 18 | 0.17 | 0.28 |
| 128 | 12 | 12 | 12 | 29 | 29 | 17 | 17 | 32 | 20 | 0.85 | 0.85 |
| 129 | 2 | 1 | 2 | 8 | 15 | 7 | 13 | 21 | 19 | 0.37 | 0.68 |
| Average Precision | 37% | 65% | |||||||||
The number of documents assessed online for each of the 25 searches is given in the first column of Table 2 under the heading `seen online'. The number of citations assessed ranged from 4 to 19 with an average of 11 citations per search. The number of documents judged relevant online (column 2 of Table 2) ranged from 2 to 12 with an average of 5 citations per search. The precision achieved online, averaged over the 25 searches, is 56% (for an average of 5 relevant documents retrieved).
The precision of the positive online assessments as identified offline for relevance1 (column 4) and for relevance2 (column 5) against the number of documents seen online (column 1) is given in Table 2. The average precision, over the 25 searches, for relevance1 (column 8) is 34% (for an average of 3 relevance1 documents retrieved) and for relevance2 (column 9) is 52% (for an average of 5 relevance2 documents retrieved).
Table 3 presents the data of the query expansion part of the results. In order to facilitate the discussion some data presented earlier are repeated in this table. Column 1 gives the total number of documents judged relevant online. Column 2 and column 3 present the results of the offline relevance assessments of the documents in column 1 for relevance1 and relevance2, i.e. the number of those documents identified as relevant online and which were also judged to be relevant or partially relevant offline. Column 4 and column 5 give the total number of offline relevance judgments for relevance1 and relevance2 respectively. Column 6 gives the number of relevance1 documents retrieved by the query expansion search and is derived by subtracting column 2 from column 4. Similarly, column 7 presents the number of relevance2 documents retrieved by the query expansion search and is derived by subtracting column 3 from column 5. The total number of documents printed offline in each search is given in column 8 whereas column 9 presents the total number of documents attributed to the query expansion searches. The precision of the query expansion part of the searches is given in column 10 (QEprec1) and column 11 (QEprec2) respectively. QEprec1 is the ratio of QErel1 (column 6) over QEretrieved (column 9) and QEprec2 the ratio of QErel2 (column 7) over QEretrieved.
The average precision for QErel1, over the 25 searches, is 37% (for an average of 9 relevance1 documents retrieved), and for QErel2 is 65% (for an average of 16 relevance2 documents retrieved). These results demonstrate that the query expansion search has been effective and that precision has been increased slightly for a substantial increase in the documents retrieved.
The level of precision for relevance1 is very similar for the initial and query expansion search with 34% and 37% respectively. However, there is a threefold increase of the number of relevant documents retrieved for the query expansion search (i.e., from 3 to 9 documents). Precision for relevance2 has been increased from 52% in the initial search to 65% in the query expansion search. In addition, there is also a threefold increase in the number of documents retrieved (i.e., from 5 to 16).
These results demonstrate that there is a consistency between the online and offline relevance judgments of documents by users. The consistency in judgments is also seen through the figures of the average precision achieved online and offline.
A reason for differences between online and offline assessments and between relevance1 and relevance2 is the combination of internal information, (e.g. knowledge on the subject, how they intend to use the information, etc.) that the users brought in at the onset of the search and the learning that took place during the search and during the evaluation of the printouts of the search results. In other words, there are a few references that users see online which they judge for relevance with their state of knowledge at that moment. However, with each new piece of information they are exposed to their knowledge changes/restructures. The magnitude of the change depends on the amount of previous knowledge they had and can be coupled to the anticipated use of the stimulus. As restructuring takes place the relevance assessment on an item may vary mainly because of the potentially different usefulness perceived for it. The twelve options given to the users for assessing relevance capture changes and provide a detailed breakdown not found in other scales. However, changes due to learning are difficult to measure because they require in-depth interviews and are case specific. To answer those questions more studies like the present are needed.
The final question is whether these measures of performance are meaningful to the users of an interactive retrieval system? As Cleverdon has commented, precision and recall may serve well for testing effectiveness of system components in a controlled environment but they do not necessarily serve well in information retrieval situations with real users. Su reported that user satisfaction with completeness of search results or value of search results as a whole appear to be the best single measure of successful IR performance. Some of the variables addressed in the questionnaires dealt with these issues and are presented in the following section.
Intended use of information. The intentions of the users regarding usage of the information that the search would provide are: PhD dissertation research (56%) and research projects (20%), that is research related use accounts for 76% of the intended use. The remainder 24% is for course related use, that is final year (capstone) projects. This pattern corresponds exactly to that from the results of the use status.
User's assessment of the nature of the enquiry. An accurate or precise assessment of the enquiry was indicated by 64% of the users. The remaining 36% indicated the nature as general and no one thought of it as vague.
Work done on the problem. Data that indicate the user's progress on the project they wanted to find information was measured using a scale from 1 to 5, where 1 represents beginning to think about the project and 5 is the end of the project. The results show a slightly skewed distribution with 56% indicating that they are at early stages of their projects, 20% indicated to be half way through, another 20% approaching the end of the project and 4% indicated as being at the very last stage of it.
Clarity of the problem. Data that express the users' own assessment of their level of knowledge about the subject of the project or reason that made them request the search were collected. Users could assess their knowledge on a 5-point scale where 1 represents `know-nothing' on the subject and 5 `know a lot'. The responses tend to concentrate in the center with 44% indicating the mid-scale value 3, 28% the value 2, 20% the value 4 and 8% were very confident about their knowledge. The responses that represent the first half of the scale amount to 72%. This correlates well with the 76% that represents the first half of the scale for the work done on the problem.
Type of search required. The choice for broad searches, i.e. one that is aimed to retrieve all the references including peripheral material, accounted for 68%, whereas 32% chose a narrow search, i.e. only very specific references.
Users' assessment of the results. The user's overall impression on the quality of the results was elicited. Users were asked to comment judging from what was seen during the search, i.e. before having a chance to look at the printouts. The responses were measured on a 5-point scale: excellent, good, satisfactory, poor, bad. The majority of the responses 92% were on the positive end of the scale, 20% `satisfactory', 68% `good' and 4% `excellent'. Negative feeling about the results was very low (8% for `poor', and there were no responses for the scale `bad'). The user's impression of the results is very positive (92% on the positive end of the scale). The total of the most positive responses, i.e. excellent and good, amounts to 72%. This corresponds exactly to the 71% overall precision level (calculated for relevance2). Judging from the correspondence of the online to offline relevance judgments and of the pre- and post-evaluation of the results it seems that the user responses are rather consistent throughout.
Match of search to enquiry. The user's feeling about the closeness of the online search to their original or intended enquiry was examined. The responses were measured on a 3-point scale. The closeness was 36% `exact' and 64% `fairly close' and no one felt that the search `considerably altered' their enquiry.
Expected references. Users were asked whether they felt that the number of references retrieved was as they expected it to be. This question was given before the evaluation of the search results. The responses were measured on a 3-point scale. Almost half of the users (48% thought that the number of references retrieved was `more than expected', 32% said it was `less then expected', and 20% found it `as expected'. The comparison of the questions on `user status' with the `intended use of information', the `clarity of the problem' (i.e. user's knowledge on the subject) and the `expected references' shows that those users who felt they knew a lot about the subject could provide better estimates of how many references they were expecting to get from the search. On the other hand, users who did not know the subject matter well (most of those who were at the beginning stages) were underestimating or overestimating the number of references.
User's satisfaction with the results. After users had completed the relevance assessments of the printouts of the search results they were given the last questionnaire that contained two questions that elicited information about their satisfaction with the references and whether, in retrospect, they thought of any other areas or concepts that they would like to search on.
An overwhelming 96% responded that they were satisfied with the references. A 68% said that there was nothing else in relation to the original/searched query that they would like to search on. The 32% that expressed an interest in another search is further divided in those who wanted to repeat the search in another database (16% of the total), to search additional (usually more specific) concepts (12%) and to do the same search all over again (4%). Those who suggested to search on another database felt that the subject matter of their queries was multidisciplinary and that INSPEC's coverage was addressing only one aspect of it. Such subject matters included industrial chemistry (search 101, 126) and medical informatics (search 108).
The results demonstrate that single posted terms should be eliminated from the list of the candidate terms for query expansion presented to users. In 42% of the searches users chose a term with collection frequency of one. By eliminating them users will be prevented from selecting terms that will retrieve only one document, which they have already seen. Single posted terms will contribute to poor retrieval results and cause disappointment and frustration to users.
A pattern that emerges from the user selection of terms for query expansion suggests that about one third of the terms from a list of candidate terms are potentially useful. This finding has design implications for a module for the selection of candidate terms for query expansion and their presentation to the user. Such facility can be applied to both types of query expansion, interactive or automatic.
The relationship between the terms chosen from the ranked list and the initial query terms were identified mostly as alternatives (related terms) or variant expressions (synonymous terms). A substantial portion of the selected terms was identified as representing new ideas. The analysis of the relationship of the 5 best terms chosen by the users to the query terms reveals that the hierarchical relationship predominates. Users mainly selected query expansion terms that were narrower terms to the corresponding query terms.
The average precision ranged from 39% to 68% for relevance1 and relevance2 respectively. When these figures are matched to the responses concerning the users satisfaction about the results it is concluded that users were satisfied with this outcome. The correspondence of the online to the offline relevance judgments demonstrates an overall consistency in user judgments.
The results provide evidence for the effectiveness of interactive query expansion. The initial search produced on average 3 highly relevant documents at a precision of 34%; the query expansion search produced on average 9 further highly relevant documents at slightly higher precision.
Many questions emerge for future research and some are presented below. For example, of particular interest are the query expansion terms that were identified that represent new ideas. What was the reason that users chose these terms? Were users aware about these new concepts/ideas at the beginning of the search? Did users recognize and choose these concepts as the result of a learning process during the search? On the whole, what were the reason(s) and stimuli that made them choose these new terms? Answers to these questions will contribute to the understanding of the users' searching behavior.
The finding that most of the query expansion terms were identified as being hierarchically related to the query terms has user interface design implications for assisting users in their search. For example, in query expansion a thesaurus could be used for displaying the relationships of the selected terms to other terms. This can be done, for example, by displaying the hierarchical tree that the term belongs (like in the INSPEC or MESH tree displays) or by presenting broader, narrower or related terms under such headings on the screen for the user to browse and choose from. How could these relationships be expressed if a similarity thesaurus is used? Visualization techniques can also be employed to relate the terms to the thesaurus and the database and facilitate navigation and term selection.
Another set of important research questions is: Are the users able to recognize the good terms during the search and especially at the onset of the search? At what stage should interactive query expansion be implemented? Could interactive query expansion be useful at the query input stage, or would it be better to provide it after the initial search, or should it be an option available on request at any stage of the search? If interactive query expansion is implemented at the query input stage then it presupposes some other source for drawing the expansion terms than the relevant document set. What should that source be?, a similarity thesaurus or some other knowledge structure? Furthermore, how does automatic query expansion compare to interactive query expansion during the query input stage? Which is preferable for that stage of the search?
Finally, this study points to the need for more research in the various aspects of interactive query expansion. Especially attention is drawn to the process of term selection by users because of its is importance for understanding the users' searching behavior and its design implications for the user interface.
1 The terms interactive query expansion and semi-automatic query expansion are used interchangeably in the text.
2 N.B. All percentages have been rounded to the nearest integer.