TY - JOUR
T1 - Improving the repeatability of deep learning models with Monte Carlo dropout
AU - Lemay, Andreanne
AU - Hoebel, Katharina
AU - Bridge, Christopher P.
AU - Befano, Brian
AU - De Sanjosé, Silvia
AU - Egemen, Didem
AU - Rodriguez, Ana Cecilia
AU - Schiffman, Mark
AU - Campbell, John Peter
AU - Kalpathy-Cramer, Jayashree
N1 - Funding Information:
We’d like to thank our funding partners for making this research possible. Funded by the National Institutes of Health (Bethesda, MD) [R01 HD107493], an investigator-initiated grant from Genentech (San Francisco, CA) [R21 EY031883], and by unrestricted departmental funding and a Career Development Award (JPC) from Research to Prevent Blindness (New York, NY) [P30 EY10572]. A.L. has a scholarship from Mitacs [IT24359], NSERC, and ”Fondation et Alumni de Polytechnique Montréal”. B.B. has a fellowship from NCI/NIH [T32CA09168]. GPU computing resources were provided by the MGH and BWH Center for Clinical Data Science.
Funding Information:
We’d like to thank our funding partners for making this research possible. Funded by the National Institutes of Health (Bethesda, MD) [R01 HD107493], an investigator-initiated grant from Genentech (San Francisco, CA) [R21 EY031883], and by unrestricted departmental funding and a Career Development Award (JPC) from Research to Prevent Blindness (New York, NY) [P30 EY10572]. A.L. has a scholarship from Mitacs [IT24359], NSERC, and ”Fondation et Alumni de Polytechnique Montréal”. B.B. has a fellowship from NCI/NIH [T32CA09168]. GPU computing resources were provided by the MGH and BWH Center for Clinical Data Science.
Publisher Copyright:
© 2022, The Author(s).
PY - 2022/12
Y1 - 2022/12
N2 - The integration of artificial intelligence into clinical workflows requires reliable and robust models. Repeatability is a key attribute of model robustness. Ideal repeatable models output predictions without variation during independent tests carried out under similar conditions. However, slight variations, though not ideal, may be unavoidable and acceptable in practice. During model development and evaluation, much attention is given to classification performance while model repeatability is rarely assessed, leading to the development of models that are unusable in clinical practice. In this work, we evaluate the repeatability of four model types (binary classification, multi-class classification, ordinal classification, and regression) on images that were acquired from the same patient during the same visit. We study the each model’s performance on four medical image classification tasks from public and private datasets: knee osteoarthritis, cervical cancer screening, breast density estimation, and retinopathy of prematurity. Repeatability is measured and compared on ResNet and DenseNet architectures. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increases repeatability, in particular at the class boundaries, for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95% limits of agreement by 16% points and of the class disagreement rate by 7% points. The classification accuracy improves in most settings along with the repeatability. Our results suggest that beyond about 20 Monte Carlo iterations, there is no further gain in repeatability. In addition to the higher test-retest agreement, Monte Carlo predictions are better calibrated which leads to output probabilities reflecting more accurately the true likelihood of being correctly classified.
AB - The integration of artificial intelligence into clinical workflows requires reliable and robust models. Repeatability is a key attribute of model robustness. Ideal repeatable models output predictions without variation during independent tests carried out under similar conditions. However, slight variations, though not ideal, may be unavoidable and acceptable in practice. During model development and evaluation, much attention is given to classification performance while model repeatability is rarely assessed, leading to the development of models that are unusable in clinical practice. In this work, we evaluate the repeatability of four model types (binary classification, multi-class classification, ordinal classification, and regression) on images that were acquired from the same patient during the same visit. We study the each model’s performance on four medical image classification tasks from public and private datasets: knee osteoarthritis, cervical cancer screening, breast density estimation, and retinopathy of prematurity. Repeatability is measured and compared on ResNet and DenseNet architectures. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increases repeatability, in particular at the class boundaries, for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95% limits of agreement by 16% points and of the class disagreement rate by 7% points. The classification accuracy improves in most settings along with the repeatability. Our results suggest that beyond about 20 Monte Carlo iterations, there is no further gain in repeatability. In addition to the higher test-retest agreement, Monte Carlo predictions are better calibrated which leads to output probabilities reflecting more accurately the true likelihood of being correctly classified.
UR - http://www.scopus.com/inward/record.url?scp=85142085581&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85142085581&partnerID=8YFLogxK
U2 - 10.1038/s41746-022-00709-3
DO - 10.1038/s41746-022-00709-3
M3 - Article
AN - SCOPUS:85142085581
SN - 2398-6352
VL - 5
JO - npj Digital Medicine
JF - npj Digital Medicine
IS - 1
M1 - 174
ER -