Notes on Trec Eval
TRECEVAL is a program to evaluate
TREC results using the standard, NIST evaluation procedures.
INSTALLATION:
Compilation
should be reasonably system independent.
In most cases, typing "make" in the unpacked directory should
work. If the binary should be installed
elsewhere, change "Makefile" accordingly.
The
format for the command line is:
trec_eval [-q] [-a] trec_rel_file trec_top_file
Where
trec_eval is the executable name for the code, -q is a parameter specifying
detail for all queries, -a is a parameter specifying summary output only (-a
and –q are mutually exclusive), trec_rel_file is the qrels, trec_top_file is
the results file. (Remember, you may
have to specify the path for the trec_rel_file (qrels) and the trec_top_file
(results) in the command line when running trec_eval.)
The
results file has the format: query_id, iter, docno, rank, sim, run_id delimited by spaces. Query id is the query number (e.g. 136.6 or 1894,
depending on the evaluation year). The
iter constant, 0, is required but ignored by trec_eval. The Document numbers are string values like
FR940104-0-00001 (found between <DOCNO> tags in the documents). The Similarity (sim) is a float value. Rank is an integer from 0 to 1000, which is
required but ignored by the program.
Runid is a string which gets printed out with the output. An example of a line from the results file:
351
0 FR940104-0-00001 1
42.38 run-name
Input
is assumed to be sorted numerically by qid.
Sim is assumed to be higher for the docs to be retrieved first.
Relevance
for each docno to qid is determined from a qrels file, which consists of
text tuples of the form qid iter
docno rel, giving TREC document
numbers (a string) and their relevance to query qid (an integer). Tuples are assumed to be sorted numerically
by qid. The text tuples with relevence
judgements are converted to TR_VEC form and then submitted to the SMART
evaluation routines.
Procedure
is to read all the docs retrieved for a query, and all the relevant docs for
that query,
sort
and rank the retrieved docs by sim/docno, and look up docno in the relevant
docs to determine relevance. The qid,did,rank,sim,rel fields of TR_VEC are
filled in; action,iter fields are set to 0.
Queries
for which there are no relevant docs are ignored (the retrieved docs are NOT
written out).
-q:
In addition to summary evaluation, give evaluation for each query
interpolation, measures under
consideration for future TRECs.
EXPLANATION
OF OFFICIAL VALUES PRINTED
1.
Total number of documents over all queries
Retrieved:
Relevant:
Rel_ret: (relevant and retrieved)
These should be self-explanatory. All values are totals over all
queries being evaluated.
2.
Interpolated Recall - Precision Averages:
at 0.00
at 0.10
...
at 1.00
See
any standard IR text for more details of
recall-precision evaluation.
Measures precision (percent of retrieved docs that are relevant) at
various recall levels (after a certain
percentage of all the relevant docs for that query have been retrieved).
"Interpolated" means that,
for example, precision at recall
0.10
(ie, after 10% of rel docs for a query have been retrieved) is taken to be MAXIMUM of precision at all
recall points >= 0.10. Values are
averaged over all queries (for each of the 11 recall levels).
These
values are used for Recall-Precision graphs.
3.
Average precision (non-interpolated) over all rel docs The precision is
calculated after each relevant doc is retrieved.
If
a relevant doc is not retrieved, the precision is 0.0. All precision values are
then averaged together to get a single number
for
the performance of a query.
Conceptually this is the area underneath the recall-precision graph for
the query.
The
values are then averaged over all queries.
4.
Precision:
at 5
docs
at 10
docs
...
at 1000 docs
The
precision (percent of retrieved docs that are relevant) after X documents
(whether relevant or nonrelevant) have been retrieved.
Values
averaged over all queries. If X docs
were not retrieved for a query, then
all missing docs are assumed to be non-relevant.
5.
R-Precision (precision after R (= num_rel for a query) docs retrieved):
New
measure, intended mainly to be used for routing environments. Measures precision (or recall, they're the
same) after R docs
have
been retrieved, where R is the total number of relevant docs for a query.
Thus if a query has 40 relevant docs, then precision is measured after 40 docs, while if it has
600 relevant docs, precision is
measured after 600 docs. This avoids
some of the averaging problems of the
"precision at X docs" values in (4) above. If R is greater than the number of docs retrieved for a query,
then the nonretrieved docs are all assumed to be nonrelevant.