Notes on Trec Eval
TRECEVAL is a program to evaluate TREC results using the standard, NIST evaluation procedures.
Compilation should be reasonably system independent. In most cases, typing "make" in the unpacked directory should work. If the binary should be installed elsewhere, change "Makefile" accordingly.
The format for the command line is: trec_eval [-q] [-a] trec_rel_file trec_top_file
Where trec_eval is the executable name for the code, -q is a parameter specifying detail for all queries, -a is a parameter specifying summary output only (-a and –q are mutually exclusive), trec_rel_file is the qrels, trec_top_file is the results file. (Remember, you may have to specify the path for the trec_rel_file (qrels) and the trec_top_file (results) in the command line when running trec_eval.)
The results file has the format: query_id, iter, docno, rank, sim, run_id delimited by spaces. Query id is the query number (e.g. 136.6 or 1894, depending on the evaluation year). The iter constant, 0, is required but ignored by trec_eval. The Document numbers are string values like FR940104-0-00001 (found between <DOCNO> tags in the documents). The Similarity (sim) is a float value. Rank is an integer from 0 to 1000, which is required but ignored by the program. Runid is a string which gets printed out with the output. An example of a line from the results file:
351 0 FR940104-0-00001 1 42.38 run-name
Input is assumed to be sorted numerically by qid. Sim is assumed to be higher for the docs to be retrieved first.
Relevance for each docno to qid is determined from a qrels file, which consists of text tuples of the form qid iter docno rel, giving TREC document numbers (a string) and their relevance to query qid (an integer). Tuples are assumed to be sorted numerically by qid. The text tuples with relevence judgements are converted to TR_VEC form and then submitted to the SMART evaluation routines.
Procedure is to read all the docs retrieved for a query, and all the relevant docs for that query,
sort and rank the retrieved docs by sim/docno, and look up docno in the relevant docs to determine relevance. The qid,did,rank,sim,rel fields of TR_VEC are filled in; action,iter fields are set to 0.
Queries for which there are no relevant docs are ignored (the retrieved docs are NOT written out).
-q: In addition to summary evaluation, give evaluation for each query
interpolation, measures under consideration for future TRECs.
EXPLANATION OF OFFICIAL VALUES PRINTED
1. Total number of documents over all queries
Rel_ret: (relevant and retrieved)
These should be self-explanatory. All values are totals over all
queries being evaluated.
2. Interpolated Recall - Precision Averages:
See any standard IR text for more details of recall-precision evaluation. Measures precision (percent of retrieved docs that are relevant) at various recall levels (after a certain percentage of all the relevant docs for that query have been retrieved). "Interpolated" means that, for example, precision at recall
0.10 (ie, after 10% of rel docs for a query have been retrieved) is taken to be MAXIMUM of precision at all recall points >= 0.10. Values are averaged over all queries (for each of the 11 recall levels).
These values are used for Recall-Precision graphs.
3. Average precision (non-interpolated) over all rel docs The precision is calculated after each relevant doc is retrieved.
If a relevant doc is not retrieved, the precision is 0.0. All precision values are then averaged together to get a single number
for the performance of a query. Conceptually this is the area underneath the recall-precision graph for the query.
The values are then averaged over all queries.
at 5 docs
at 10 docs
at 1000 docs
The precision (percent of retrieved docs that are relevant) after X documents (whether relevant or nonrelevant) have been retrieved.
Values averaged over all queries. If X docs were not retrieved for a query, then all missing docs are assumed to be non-relevant.
5. R-Precision (precision after R (= num_rel for a query) docs retrieved):
New measure, intended mainly to be used for routing environments. Measures precision (or recall, they're the same) after R docs
have been retrieved, where R is the total number of relevant docs for a query. Thus if a query has 40 relevant docs, then precision is measured after 40 docs, while if it has 600 relevant docs, precision is measured after 600 docs. This avoids some of the averaging problems of the "precision at X docs" values in (4) above. If R is greater than the number of docs retrieved for a query, then the nonretrieved docs are all assumed to be nonrelevant.