Simulation studies as designed experiments: The comparison of penalized regression models in the "large p, small n " setting

Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin

doi:10.1371/journal.pone.0107957

Simulation studies as designed experiments: The comparison of penalized regression models in the "large p, small n " setting

Elias Chaibub Neto, J. Christopher Bare, Adam A. Margolin

Research output: Contribution to journal › Article › peer-review

12 Scopus citations

Abstract

New algorithms are continuously proposed in computational biology. Performance evaluation of novel methods is important in practice. Nonetheless, the field experiences a lack of rigorous methodology aimed to systematically and objectively evaluate competing approaches. Simulation studies are frequently used to show that a particular method outperforms another. Often times, however, simulation studies are not well designed, and it is hard to characterize the particular conditions under which different methods perform better. In this paper we propose the adoption of well established techniques in the design of computer and physical experiments for developing effective simulation studies. By following best practices in planning of experiments we are better able to understand the strengths and weaknesses of competing algorithms leading to more informed decisions about which method to use for a particular task. We illustrate the application of our proposed simulation framework with a detailed comparison of the ridge-regression, lasso and elastic-net algorithms in a large scale study investigating the effects on predictive performance of sample size, number of features, true model sparsity, signal-to-noise ratio, and feature correlation, in situations where the number of covariates is usually much larger than sample size. Analysis of data sets containing tens of thousands of features but only a few hundred samples is nowadays routine in computational biology, where "omics" features such as gene expression, copy number variation and sequence data are frequently used in the predictive modeling of complex phenotypes such as anticancer drug response. The penalized regression approaches investigated in this study are popular choices in this setting and our simulations corroborate well established results concerning the conditions under which each one of these methods is expected to perform best while providing several novel insights.

Original language	English (US)
Article number	e107957
Journal	PloS one
Volume	9
Issue number	10
DOIs	https://doi.org/10.1371/journal.pone.0107957
State	Published - Oct 7 2014
Externally published	Yes

ASJC Scopus subject areas

General

Access to Document

10.1371/journal.pone.0107957

Cite this

@article{8b5872f9f7cf491f8613a28dcd8720bd,

title = "Simulation studies as designed experiments: The comparison of penalized regression models in the {"}large p, small n {"} setting",

abstract = "New algorithms are continuously proposed in computational biology. Performance evaluation of novel methods is important in practice. Nonetheless, the field experiences a lack of rigorous methodology aimed to systematically and objectively evaluate competing approaches. Simulation studies are frequently used to show that a particular method outperforms another. Often times, however, simulation studies are not well designed, and it is hard to characterize the particular conditions under which different methods perform better. In this paper we propose the adoption of well established techniques in the design of computer and physical experiments for developing effective simulation studies. By following best practices in planning of experiments we are better able to understand the strengths and weaknesses of competing algorithms leading to more informed decisions about which method to use for a particular task. We illustrate the application of our proposed simulation framework with a detailed comparison of the ridge-regression, lasso and elastic-net algorithms in a large scale study investigating the effects on predictive performance of sample size, number of features, true model sparsity, signal-to-noise ratio, and feature correlation, in situations where the number of covariates is usually much larger than sample size. Analysis of data sets containing tens of thousands of features but only a few hundred samples is nowadays routine in computational biology, where {"}omics{"} features such as gene expression, copy number variation and sequence data are frequently used in the predictive modeling of complex phenotypes such as anticancer drug response. The penalized regression approaches investigated in this study are popular choices in this setting and our simulations corroborate well established results concerning the conditions under which each one of these methods is expected to perform best while providing several novel insights.",

author = "Neto, {Elias Chaibub} and Bare, {J. Christopher} and Margolin, {Adam A.}",

note = "Publisher Copyright: {\textcopyright} 2014 Chaibub Neto et al.",

year = "2014",

month = oct,

day = "7",

doi = "10.1371/journal.pone.0107957",

language = "English (US)",

volume = "9",

journal = "PloS one",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "10",

}

TY - JOUR

T1 - Simulation studies as designed experiments

T2 - The comparison of penalized regression models in the "large p, small n " setting

AU - Neto, Elias Chaibub

AU - Bare, J. Christopher

AU - Margolin, Adam A.

PY - 2014/10/7

Y1 - 2014/10/7

N2 - New algorithms are continuously proposed in computational biology. Performance evaluation of novel methods is important in practice. Nonetheless, the field experiences a lack of rigorous methodology aimed to systematically and objectively evaluate competing approaches. Simulation studies are frequently used to show that a particular method outperforms another. Often times, however, simulation studies are not well designed, and it is hard to characterize the particular conditions under which different methods perform better. In this paper we propose the adoption of well established techniques in the design of computer and physical experiments for developing effective simulation studies. By following best practices in planning of experiments we are better able to understand the strengths and weaknesses of competing algorithms leading to more informed decisions about which method to use for a particular task. We illustrate the application of our proposed simulation framework with a detailed comparison of the ridge-regression, lasso and elastic-net algorithms in a large scale study investigating the effects on predictive performance of sample size, number of features, true model sparsity, signal-to-noise ratio, and feature correlation, in situations where the number of covariates is usually much larger than sample size. Analysis of data sets containing tens of thousands of features but only a few hundred samples is nowadays routine in computational biology, where "omics" features such as gene expression, copy number variation and sequence data are frequently used in the predictive modeling of complex phenotypes such as anticancer drug response. The penalized regression approaches investigated in this study are popular choices in this setting and our simulations corroborate well established results concerning the conditions under which each one of these methods is expected to perform best while providing several novel insights.

AB - New algorithms are continuously proposed in computational biology. Performance evaluation of novel methods is important in practice. Nonetheless, the field experiences a lack of rigorous methodology aimed to systematically and objectively evaluate competing approaches. Simulation studies are frequently used to show that a particular method outperforms another. Often times, however, simulation studies are not well designed, and it is hard to characterize the particular conditions under which different methods perform better. In this paper we propose the adoption of well established techniques in the design of computer and physical experiments for developing effective simulation studies. By following best practices in planning of experiments we are better able to understand the strengths and weaknesses of competing algorithms leading to more informed decisions about which method to use for a particular task. We illustrate the application of our proposed simulation framework with a detailed comparison of the ridge-regression, lasso and elastic-net algorithms in a large scale study investigating the effects on predictive performance of sample size, number of features, true model sparsity, signal-to-noise ratio, and feature correlation, in situations where the number of covariates is usually much larger than sample size. Analysis of data sets containing tens of thousands of features but only a few hundred samples is nowadays routine in computational biology, where "omics" features such as gene expression, copy number variation and sequence data are frequently used in the predictive modeling of complex phenotypes such as anticancer drug response. The penalized regression approaches investigated in this study are popular choices in this setting and our simulations corroborate well established results concerning the conditions under which each one of these methods is expected to perform best while providing several novel insights.

UR - http://www.scopus.com/inward/record.url?scp=84907841894&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84907841894&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0107957

DO - 10.1371/journal.pone.0107957

M3 - Article

C2 - 25289666

AN - SCOPUS:84907841894

SN - 1932-6203

VL - 9

JO - PloS one

JF - PloS one

IS - 10

M1 - e107957

ER -

Simulation studies as designed experiments: The comparison of penalized regression models in the "large p, small n " setting

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this