Establishing a Gold Standard for Test Sets. Variation in Interpretive Agreement of Expert Mammographers

Tracy Onega, Melissa L. Anderson, Diana L. Miglioretti, Diana S M Buist, Berta Geller, Andy Bogart, Robert A. Smith, Edward A. Sickles, Barbara Monsees, Lawrence Bassett, Patricia (Patty) Carney, Karla Kerlikowske, Bonnie C. Yankaskas

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

Rationale and Objectives: Test sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold standard interpretations by experts. To assess test sets for screening mammography, a gold standard for whether a woman should be recalled for additional workup is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography. Materials and Methods: Using digitized films from 314 screening mammography exams (n = 143 cancer cases) performed in the Breast Cancer Surveillance Consortium, we evaluated interpretive agreement among three expert radiologists who independently assessed whether each examination should be recalled, and the lesion location, finding type (mass, calcification, asymmetric density, or architectural distortion), and interpretive difficulty in the recalled images. Results: Agreement among the three expert pairs for recall/no recall was higher for cancer cases (mean 74.3 ± 6.5) than for noncancers (mean 62.6 ± 7.1). Complete agreement on recall, lesion location, finding type and difficulty ranged from 36.4% to 42.0% for cancer cases and from 43.9% to 65.6% for noncancer cases. Two of three experts agreed on recall and lesion location for 95.1% of cancer cases and 91.8% of noncancer cases, but all three experts agreed on only 55.2% of cancer cases and 42.1% of noncancer cases. Conclusion: Variability in expert interpretive is notable. A minimum of three independent experts combined with a consensus should be used for establishing any gold standard interpretation for test sets, especially for noncancer cases.

Original languageEnglish (US)
Pages (from-to)731-739
Number of pages9
JournalAcademic Radiology
Volume20
Issue number6
DOIs
StatePublished - Jun 2013

Fingerprint

Mammography
Neoplasms
Breast Neoplasms
Biopsy

Keywords

  • Expert opinion
  • Gold standard
  • Mammography
  • Test set
  • Variation

ASJC Scopus subject areas

  • Radiology Nuclear Medicine and imaging

Cite this

Onega, T., Anderson, M. L., Miglioretti, D. L., Buist, D. S. M., Geller, B., Bogart, A., ... Yankaskas, B. C. (2013). Establishing a Gold Standard for Test Sets. Variation in Interpretive Agreement of Expert Mammographers. Academic Radiology, 20(6), 731-739. https://doi.org/10.1016/j.acra.2013.01.012

Establishing a Gold Standard for Test Sets. Variation in Interpretive Agreement of Expert Mammographers. / Onega, Tracy; Anderson, Melissa L.; Miglioretti, Diana L.; Buist, Diana S M; Geller, Berta; Bogart, Andy; Smith, Robert A.; Sickles, Edward A.; Monsees, Barbara; Bassett, Lawrence; Carney, Patricia (Patty); Kerlikowske, Karla; Yankaskas, Bonnie C.

In: Academic Radiology, Vol. 20, No. 6, 06.2013, p. 731-739.

Research output: Contribution to journalArticle

Onega, T, Anderson, ML, Miglioretti, DL, Buist, DSM, Geller, B, Bogart, A, Smith, RA, Sickles, EA, Monsees, B, Bassett, L, Carney, PP, Kerlikowske, K & Yankaskas, BC 2013, 'Establishing a Gold Standard for Test Sets. Variation in Interpretive Agreement of Expert Mammographers', Academic Radiology, vol. 20, no. 6, pp. 731-739. https://doi.org/10.1016/j.acra.2013.01.012
Onega, Tracy ; Anderson, Melissa L. ; Miglioretti, Diana L. ; Buist, Diana S M ; Geller, Berta ; Bogart, Andy ; Smith, Robert A. ; Sickles, Edward A. ; Monsees, Barbara ; Bassett, Lawrence ; Carney, Patricia (Patty) ; Kerlikowske, Karla ; Yankaskas, Bonnie C. / Establishing a Gold Standard for Test Sets. Variation in Interpretive Agreement of Expert Mammographers. In: Academic Radiology. 2013 ; Vol. 20, No. 6. pp. 731-739.
@article{39552763af444d578d310494e009b0d7,
title = "Establishing a Gold Standard for Test Sets. Variation in Interpretive Agreement of Expert Mammographers",
abstract = "Rationale and Objectives: Test sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold standard interpretations by experts. To assess test sets for screening mammography, a gold standard for whether a woman should be recalled for additional workup is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography. Materials and Methods: Using digitized films from 314 screening mammography exams (n = 143 cancer cases) performed in the Breast Cancer Surveillance Consortium, we evaluated interpretive agreement among three expert radiologists who independently assessed whether each examination should be recalled, and the lesion location, finding type (mass, calcification, asymmetric density, or architectural distortion), and interpretive difficulty in the recalled images. Results: Agreement among the three expert pairs for recall/no recall was higher for cancer cases (mean 74.3 ± 6.5) than for noncancers (mean 62.6 ± 7.1). Complete agreement on recall, lesion location, finding type and difficulty ranged from 36.4{\%} to 42.0{\%} for cancer cases and from 43.9{\%} to 65.6{\%} for noncancer cases. Two of three experts agreed on recall and lesion location for 95.1{\%} of cancer cases and 91.8{\%} of noncancer cases, but all three experts agreed on only 55.2{\%} of cancer cases and 42.1{\%} of noncancer cases. Conclusion: Variability in expert interpretive is notable. A minimum of three independent experts combined with a consensus should be used for establishing any gold standard interpretation for test sets, especially for noncancer cases.",
keywords = "Expert opinion, Gold standard, Mammography, Test set, Variation",
author = "Tracy Onega and Anderson, {Melissa L.} and Miglioretti, {Diana L.} and Buist, {Diana S M} and Berta Geller and Andy Bogart and Smith, {Robert A.} and Sickles, {Edward A.} and Barbara Monsees and Lawrence Bassett and Carney, {Patricia (Patty)} and Karla Kerlikowske and Yankaskas, {Bonnie C.}",
year = "2013",
month = "6",
doi = "10.1016/j.acra.2013.01.012",
language = "English (US)",
volume = "20",
pages = "731--739",
journal = "Academic Radiology",
issn = "1076-6332",
publisher = "Elsevier USA",
number = "6",

}

TY - JOUR

T1 - Establishing a Gold Standard for Test Sets. Variation in Interpretive Agreement of Expert Mammographers

AU - Onega, Tracy

AU - Anderson, Melissa L.

AU - Miglioretti, Diana L.

AU - Buist, Diana S M

AU - Geller, Berta

AU - Bogart, Andy

AU - Smith, Robert A.

AU - Sickles, Edward A.

AU - Monsees, Barbara

AU - Bassett, Lawrence

AU - Carney, Patricia (Patty)

AU - Kerlikowske, Karla

AU - Yankaskas, Bonnie C.

PY - 2013/6

Y1 - 2013/6

N2 - Rationale and Objectives: Test sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold standard interpretations by experts. To assess test sets for screening mammography, a gold standard for whether a woman should be recalled for additional workup is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography. Materials and Methods: Using digitized films from 314 screening mammography exams (n = 143 cancer cases) performed in the Breast Cancer Surveillance Consortium, we evaluated interpretive agreement among three expert radiologists who independently assessed whether each examination should be recalled, and the lesion location, finding type (mass, calcification, asymmetric density, or architectural distortion), and interpretive difficulty in the recalled images. Results: Agreement among the three expert pairs for recall/no recall was higher for cancer cases (mean 74.3 ± 6.5) than for noncancers (mean 62.6 ± 7.1). Complete agreement on recall, lesion location, finding type and difficulty ranged from 36.4% to 42.0% for cancer cases and from 43.9% to 65.6% for noncancer cases. Two of three experts agreed on recall and lesion location for 95.1% of cancer cases and 91.8% of noncancer cases, but all three experts agreed on only 55.2% of cancer cases and 42.1% of noncancer cases. Conclusion: Variability in expert interpretive is notable. A minimum of three independent experts combined with a consensus should be used for establishing any gold standard interpretation for test sets, especially for noncancer cases.

AB - Rationale and Objectives: Test sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold standard interpretations by experts. To assess test sets for screening mammography, a gold standard for whether a woman should be recalled for additional workup is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography. Materials and Methods: Using digitized films from 314 screening mammography exams (n = 143 cancer cases) performed in the Breast Cancer Surveillance Consortium, we evaluated interpretive agreement among three expert radiologists who independently assessed whether each examination should be recalled, and the lesion location, finding type (mass, calcification, asymmetric density, or architectural distortion), and interpretive difficulty in the recalled images. Results: Agreement among the three expert pairs for recall/no recall was higher for cancer cases (mean 74.3 ± 6.5) than for noncancers (mean 62.6 ± 7.1). Complete agreement on recall, lesion location, finding type and difficulty ranged from 36.4% to 42.0% for cancer cases and from 43.9% to 65.6% for noncancer cases. Two of three experts agreed on recall and lesion location for 95.1% of cancer cases and 91.8% of noncancer cases, but all three experts agreed on only 55.2% of cancer cases and 42.1% of noncancer cases. Conclusion: Variability in expert interpretive is notable. A minimum of three independent experts combined with a consensus should be used for establishing any gold standard interpretation for test sets, especially for noncancer cases.

KW - Expert opinion

KW - Gold standard

KW - Mammography

KW - Test set

KW - Variation

UR - http://www.scopus.com/inward/record.url?scp=84877340589&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84877340589&partnerID=8YFLogxK

U2 - 10.1016/j.acra.2013.01.012

DO - 10.1016/j.acra.2013.01.012

M3 - Article

C2 - 23664400

AN - SCOPUS:84877340589

VL - 20

SP - 731

EP - 739

JO - Academic Radiology

JF - Academic Radiology

SN - 1076-6332

IS - 6

ER -