Improving the repeatability of deep learning models with Monte Carlo dropout

Andreanne Lemay; Katharina Hoebel; Christopher P. Bridge; Brian Befano; Silvia De Sanjosé; Didem Egemen; Ana Cecilia Rodriguez; Mark Schiffman; John Peter Campbell; Jayashree Kalpathy-Cramer

doi:10.1038/s41746-022-00709-3

Improving the repeatability of deep learning models with Monte Carlo dropout

Andreanne Lemay, Katharina Hoebel, Christopher P. Bridge, Brian Befano, Silvia De Sanjosé, Didem Egemen, Ana Cecilia Rodriguez, Mark Schiffman, John Peter Campbell, Jayashree Kalpathy-Cramer

Ophthalmology

Research output: Contribution to journal › Article › peer-review

13 Scopus citations

Abstract

The integration of artificial intelligence into clinical workflows requires reliable and robust models. Repeatability is a key attribute of model robustness. Ideal repeatable models output predictions without variation during independent tests carried out under similar conditions. However, slight variations, though not ideal, may be unavoidable and acceptable in practice. During model development and evaluation, much attention is given to classification performance while model repeatability is rarely assessed, leading to the development of models that are unusable in clinical practice. In this work, we evaluate the repeatability of four model types (binary classification, multi-class classification, ordinal classification, and regression) on images that were acquired from the same patient during the same visit. We study the each model’s performance on four medical image classification tasks from public and private datasets: knee osteoarthritis, cervical cancer screening, breast density estimation, and retinopathy of prematurity. Repeatability is measured and compared on ResNet and DenseNet architectures. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increases repeatability, in particular at the class boundaries, for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95% limits of agreement by 16% points and of the class disagreement rate by 7% points. The classification accuracy improves in most settings along with the repeatability. Our results suggest that beyond about 20 Monte Carlo iterations, there is no further gain in repeatability. In addition to the higher test-retest agreement, Monte Carlo predictions are better calibrated which leads to output probabilities reflecting more accurately the true likelihood of being correctly classified.

Original language	English (US)
Article number	174
Journal	npj Digital Medicine
Volume	5
Issue number	1
DOIs	https://doi.org/10.1038/s41746-022-00709-3
State	Published - Dec 2022

ASJC Scopus subject areas

Medicine (miscellaneous)
Health Informatics
Computer Science Applications
Health Information Management

Access to Document

10.1038/s41746-022-00709-3

Cite this

@article{9b257770406947e487f7377bfb21359f,

title = "Improving the repeatability of deep learning models with Monte Carlo dropout",

abstract = "The integration of artificial intelligence into clinical workflows requires reliable and robust models. Repeatability is a key attribute of model robustness. Ideal repeatable models output predictions without variation during independent tests carried out under similar conditions. However, slight variations, though not ideal, may be unavoidable and acceptable in practice. During model development and evaluation, much attention is given to classification performance while model repeatability is rarely assessed, leading to the development of models that are unusable in clinical practice. In this work, we evaluate the repeatability of four model types (binary classification, multi-class classification, ordinal classification, and regression) on images that were acquired from the same patient during the same visit. We study the each model{\textquoteright}s performance on four medical image classification tasks from public and private datasets: knee osteoarthritis, cervical cancer screening, breast density estimation, and retinopathy of prematurity. Repeatability is measured and compared on ResNet and DenseNet architectures. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increases repeatability, in particular at the class boundaries, for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95% limits of agreement by 16% points and of the class disagreement rate by 7% points. The classification accuracy improves in most settings along with the repeatability. Our results suggest that beyond about 20 Monte Carlo iterations, there is no further gain in repeatability. In addition to the higher test-retest agreement, Monte Carlo predictions are better calibrated which leads to output probabilities reflecting more accurately the true likelihood of being correctly classified.",

author = "Andreanne Lemay and Katharina Hoebel and Bridge, {Christopher P.} and Brian Befano and {De Sanjos{\'e}}, Silvia and Didem Egemen and Rodriguez, {Ana Cecilia} and Mark Schiffman and Campbell, {John Peter} and Jayashree Kalpathy-Cramer",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s).",

year = "2022",

month = dec,

doi = "10.1038/s41746-022-00709-3",

language = "English (US)",

volume = "5",

journal = "npj Digital Medicine",

issn = "2398-6352",

publisher = "Nature Publishing Group",

number = "1",

}

TY - JOUR

T1 - Improving the repeatability of deep learning models with Monte Carlo dropout

AU - Lemay, Andreanne

AU - Hoebel, Katharina

AU - Bridge, Christopher P.

AU - Befano, Brian

AU - De Sanjosé, Silvia

AU - Egemen, Didem

AU - Rodriguez, Ana Cecilia

AU - Schiffman, Mark

AU - Campbell, John Peter

AU - Kalpathy-Cramer, Jayashree

PY - 2022/12

Y1 - 2022/12

N2 - The integration of artificial intelligence into clinical workflows requires reliable and robust models. Repeatability is a key attribute of model robustness. Ideal repeatable models output predictions without variation during independent tests carried out under similar conditions. However, slight variations, though not ideal, may be unavoidable and acceptable in practice. During model development and evaluation, much attention is given to classification performance while model repeatability is rarely assessed, leading to the development of models that are unusable in clinical practice. In this work, we evaluate the repeatability of four model types (binary classification, multi-class classification, ordinal classification, and regression) on images that were acquired from the same patient during the same visit. We study the each model’s performance on four medical image classification tasks from public and private datasets: knee osteoarthritis, cervical cancer screening, breast density estimation, and retinopathy of prematurity. Repeatability is measured and compared on ResNet and DenseNet architectures. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increases repeatability, in particular at the class boundaries, for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95% limits of agreement by 16% points and of the class disagreement rate by 7% points. The classification accuracy improves in most settings along with the repeatability. Our results suggest that beyond about 20 Monte Carlo iterations, there is no further gain in repeatability. In addition to the higher test-retest agreement, Monte Carlo predictions are better calibrated which leads to output probabilities reflecting more accurately the true likelihood of being correctly classified.

AB - The integration of artificial intelligence into clinical workflows requires reliable and robust models. Repeatability is a key attribute of model robustness. Ideal repeatable models output predictions without variation during independent tests carried out under similar conditions. However, slight variations, though not ideal, may be unavoidable and acceptable in practice. During model development and evaluation, much attention is given to classification performance while model repeatability is rarely assessed, leading to the development of models that are unusable in clinical practice. In this work, we evaluate the repeatability of four model types (binary classification, multi-class classification, ordinal classification, and regression) on images that were acquired from the same patient during the same visit. We study the each model’s performance on four medical image classification tasks from public and private datasets: knee osteoarthritis, cervical cancer screening, breast density estimation, and retinopathy of prematurity. Repeatability is measured and compared on ResNet and DenseNet architectures. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increases repeatability, in particular at the class boundaries, for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95% limits of agreement by 16% points and of the class disagreement rate by 7% points. The classification accuracy improves in most settings along with the repeatability. Our results suggest that beyond about 20 Monte Carlo iterations, there is no further gain in repeatability. In addition to the higher test-retest agreement, Monte Carlo predictions are better calibrated which leads to output probabilities reflecting more accurately the true likelihood of being correctly classified.

UR - http://www.scopus.com/inward/record.url?scp=85142085581&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85142085581&partnerID=8YFLogxK

U2 - 10.1038/s41746-022-00709-3

DO - 10.1038/s41746-022-00709-3

M3 - Article

AN - SCOPUS:85142085581

SN - 2398-6352

VL - 5

JO - npj Digital Medicine

JF - npj Digital Medicine

IS - 1

M1 - 174

ER -

Improving the repeatability of deep learning models with Monte Carlo dropout

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this