TY - GEN
T1 - Variation of relevance assessments for medical image retrieval
AU - Müller, Henning
AU - Clough, Paul
AU - Hersh, Bill
AU - Geissbühler, Antoine
PY - 2007
Y1 - 2007
N2 - Evaluation is crucial for the success of most research domains, and image retrieval is no exception to this. Recently, several benchmarks have been developed for visual information retrieval such as TRECVID, ImageCLEF, and ImagEval to create frameworks for evaluating image retrieval research. An important part of evaluation is the creation of a ground truth or gold standard to evaluate systems against. Much experience has been gained on creating ground truths for textual information retrieval, but for image retrieval these issues require further research. This article will present the process of generating relevance judgements for the medical image retrieval task of ImageCLEF. Many of the problems encountered can be generalised to other image retrieval tasks as well, so the outcome is not limited to the medical domain. Part of the images analysed for relevance were judged by two assessors, and these are analysed with respect to their consistency and potential problems. Our goal is to obtain more information on the ambiguity of the topics developed and generally to keep the variation amongst relevance assessors low. This might partially reduce the subjectivity of system-oriented evaluation, although the evaluation shows that the differences in relevance judgements only have a limited influence on comparative system ranking. A number of outcomes are presented with a goal in mind to create less ambiguous topics for future evaluation campaigns.
AB - Evaluation is crucial for the success of most research domains, and image retrieval is no exception to this. Recently, several benchmarks have been developed for visual information retrieval such as TRECVID, ImageCLEF, and ImagEval to create frameworks for evaluating image retrieval research. An important part of evaluation is the creation of a ground truth or gold standard to evaluate systems against. Much experience has been gained on creating ground truths for textual information retrieval, but for image retrieval these issues require further research. This article will present the process of generating relevance judgements for the medical image retrieval task of ImageCLEF. Many of the problems encountered can be generalised to other image retrieval tasks as well, so the outcome is not limited to the medical domain. Part of the images analysed for relevance were judged by two assessors, and these are analysed with respect to their consistency and potential problems. Our goal is to obtain more information on the ambiguity of the topics developed and generally to keep the variation amongst relevance assessors low. This might partially reduce the subjectivity of system-oriented evaluation, although the evaluation shows that the differences in relevance judgements only have a limited influence on comparative system ranking. A number of outcomes are presented with a goal in mind to create less ambiguous topics for future evaluation campaigns.
UR - http://www.scopus.com/inward/record.url?scp=38049125886&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=38049125886&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-71545-0_18
DO - 10.1007/978-3-540-71545-0_18
M3 - Conference contribution
AN - SCOPUS:38049125886
SN - 9783540715443
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 232
EP - 246
BT - Adaptive Multimedia Retrieval
PB - Springer-Verlag
T2 - 4th International Workshop on Adaptive Multimedia Retrieval, AMR 2006
Y2 - 27 July 2006 through 28 July 2006
ER -