Machine learning integration for predicting the effect of single amino acid substitutions on protein stability

Ayşegül Özen; Mehmet Gönen; Ethem Alpaydn; Türkan Haliloǧlu

doi:10.1186/1472-6807-9-66

Machine learning integration for predicting the effect of single amino acid substitutions on protein stability

Ayşegül Özen, Mehmet Gönen, Ethem Alpaydn, Türkan Haliloǧlu

Research output: Contribution to journal › Article › peer-review

22 Scopus citations

Abstract

Background. Computational prediction of protein stability change due to single-site amino acid substitutions is of interest in protein design and analysis. We consider the following four ways to improve the performance of the currently available predictors: (1) We include additional sequence- and structure-based features, namely, the amino acid substitution likelihoods, the equilibrium fluctuations of the alpha- and beta-carbon atoms, and the packing density. (2) By implementing different machine learning integration approaches, we combine information from different features or representations. (3) We compare classification vs. regression methods to predict the sign vs. the output of stability change. (4) We allow a reject option for doubtful cases where the risk of misclassification is high. Results. We investigate three different approaches: early, intermediate and late integration, which respectively combine features, kernels over feature subsets, and decisions. We perform simulations on two data sets: (1) S1615 is used in previous studies, (2) S2783 is the updated version (as of July 2, 2009) extracted also from ProTherm. For S1615 data set, our highest accuracy using both sequence and structure information is 0.842 on cross-validation and 0.904 on testing using early integration. Newly added features, namely, local compositional packing and the mobility extent of the mutated residues, improve accuracy significantly with intermediate integration. For S2783 data set, we also train regression methods to estimate not only the sign but also the amount of stability change and apply risk-based classification to reject when the learner has low confidence and the loss of misclassification is high. The highest accuracy is 0.835 on cross-validation and 0.832 on testing using only sequence information. The percentage of false positives can be decreased to less than 0.005 by rejecting 10 per cent using late integration. Conclusion. We find that in both early and late integration, combining inputs or decisions is useful in increasing accuracy. Intermediate integration allows assessing the contributions of individual features by looking at the assigned weights. Overall accuracy of regression is not better than that of classification but it has less false positives, especially when combined with the reject option. The server for stability prediction for three integration approaches and the data sets are available at http://www.prc.boun. edu.tr/appserv/prc/mlsta.

Original language	English (US)
Article number	66
Journal	BMC Structural Biology
Volume	9
DOIs	https://doi.org/10.1186/1472-6807-9-66
State	Published - 2009
Externally published	Yes

ASJC Scopus subject areas

Structural Biology

Access to Document

10.1186/1472-6807-9-66

Cite this

@article{a9cc6800e9254995a3944388c1568c45,

title = "Machine learning integration for predicting the effect of single amino acid substitutions on protein stability",

abstract = "Background. Computational prediction of protein stability change due to single-site amino acid substitutions is of interest in protein design and analysis. We consider the following four ways to improve the performance of the currently available predictors: (1) We include additional sequence- and structure-based features, namely, the amino acid substitution likelihoods, the equilibrium fluctuations of the alpha- and beta-carbon atoms, and the packing density. (2) By implementing different machine learning integration approaches, we combine information from different features or representations. (3) We compare classification vs. regression methods to predict the sign vs. the output of stability change. (4) We allow a reject option for doubtful cases where the risk of misclassification is high. Results. We investigate three different approaches: early, intermediate and late integration, which respectively combine features, kernels over feature subsets, and decisions. We perform simulations on two data sets: (1) S1615 is used in previous studies, (2) S2783 is the updated version (as of July 2, 2009) extracted also from ProTherm. For S1615 data set, our highest accuracy using both sequence and structure information is 0.842 on cross-validation and 0.904 on testing using early integration. Newly added features, namely, local compositional packing and the mobility extent of the mutated residues, improve accuracy significantly with intermediate integration. For S2783 data set, we also train regression methods to estimate not only the sign but also the amount of stability change and apply risk-based classification to reject when the learner has low confidence and the loss of misclassification is high. The highest accuracy is 0.835 on cross-validation and 0.832 on testing using only sequence information. The percentage of false positives can be decreased to less than 0.005 by rejecting 10 per cent using late integration. Conclusion. We find that in both early and late integration, combining inputs or decisions is useful in increasing accuracy. Intermediate integration allows assessing the contributions of individual features by looking at the assigned weights. Overall accuracy of regression is not better than that of classification but it has less false positives, especially when combined with the reject option. The server for stability prediction for three integration approaches and the data sets are available at http://www.prc.boun. edu.tr/appserv/prc/mlsta.",

author = "Ay{\c s}eg{\"u}l {\"O}zen and Mehmet G{\"o}nen and Ethem Alpaydn and T{\"u}rkan Haliloǧlu",

note = "Funding Information: This work was supported by the Turkish Academy of Sciences in the frame-work of the Young Scientist Award Program (EA-T{\"U}BA-GEBİP/2001-1-1 and TH-T{\"U}BA-GEBİP/2001-1-1), Boğazi{\c c}i University Scientific Research Projects (BAP 04A502, 06A508, and 07HA101), the Turkish State Planning Organization (DPT 03K120250), the Turkish Scientific Technical Research Council (T{\"U}BİTAK EEEAG 107E222). T. Haliloğlu acknowledges Betil Fund. A. {\"O}zen acknowledges T{\"U}BİTAK-BİDEB SSA-2 Project Fellowship. The work of M. G{\"o}nen was supported by the PhD scholarship (2211) from T{\"U}BİTAK.",

year = "2009",

doi = "10.1186/1472-6807-9-66",

language = "English (US)",

volume = "9",

journal = "BMC Structural Biology",

issn = "1472-6807",

publisher = "BioMed Central",

}

TY - JOUR

T1 - Machine learning integration for predicting the effect of single amino acid substitutions on protein stability

AU - Özen, Ayşegül

AU - Gönen, Mehmet

AU - Alpaydn, Ethem

AU - Haliloǧlu, Türkan

N1 - Funding Information: This work was supported by the Turkish Academy of Sciences in the frame-work of the Young Scientist Award Program (EA-TÜBA-GEBİP/2001-1-1 and TH-TÜBA-GEBİP/2001-1-1), Boğaziçi University Scientific Research Projects (BAP 04A502, 06A508, and 07HA101), the Turkish State Planning Organization (DPT 03K120250), the Turkish Scientific Technical Research Council (TÜBİTAK EEEAG 107E222). T. Haliloğlu acknowledges Betil Fund. A. Özen acknowledges TÜBİTAK-BİDEB SSA-2 Project Fellowship. The work of M. Gönen was supported by the PhD scholarship (2211) from TÜBİTAK.

PY - 2009

Y1 - 2009

N2 - Background. Computational prediction of protein stability change due to single-site amino acid substitutions is of interest in protein design and analysis. We consider the following four ways to improve the performance of the currently available predictors: (1) We include additional sequence- and structure-based features, namely, the amino acid substitution likelihoods, the equilibrium fluctuations of the alpha- and beta-carbon atoms, and the packing density. (2) By implementing different machine learning integration approaches, we combine information from different features or representations. (3) We compare classification vs. regression methods to predict the sign vs. the output of stability change. (4) We allow a reject option for doubtful cases where the risk of misclassification is high. Results. We investigate three different approaches: early, intermediate and late integration, which respectively combine features, kernels over feature subsets, and decisions. We perform simulations on two data sets: (1) S1615 is used in previous studies, (2) S2783 is the updated version (as of July 2, 2009) extracted also from ProTherm. For S1615 data set, our highest accuracy using both sequence and structure information is 0.842 on cross-validation and 0.904 on testing using early integration. Newly added features, namely, local compositional packing and the mobility extent of the mutated residues, improve accuracy significantly with intermediate integration. For S2783 data set, we also train regression methods to estimate not only the sign but also the amount of stability change and apply risk-based classification to reject when the learner has low confidence and the loss of misclassification is high. The highest accuracy is 0.835 on cross-validation and 0.832 on testing using only sequence information. The percentage of false positives can be decreased to less than 0.005 by rejecting 10 per cent using late integration. Conclusion. We find that in both early and late integration, combining inputs or decisions is useful in increasing accuracy. Intermediate integration allows assessing the contributions of individual features by looking at the assigned weights. Overall accuracy of regression is not better than that of classification but it has less false positives, especially when combined with the reject option. The server for stability prediction for three integration approaches and the data sets are available at http://www.prc.boun. edu.tr/appserv/prc/mlsta.

AB - Background. Computational prediction of protein stability change due to single-site amino acid substitutions is of interest in protein design and analysis. We consider the following four ways to improve the performance of the currently available predictors: (1) We include additional sequence- and structure-based features, namely, the amino acid substitution likelihoods, the equilibrium fluctuations of the alpha- and beta-carbon atoms, and the packing density. (2) By implementing different machine learning integration approaches, we combine information from different features or representations. (3) We compare classification vs. regression methods to predict the sign vs. the output of stability change. (4) We allow a reject option for doubtful cases where the risk of misclassification is high. Results. We investigate three different approaches: early, intermediate and late integration, which respectively combine features, kernels over feature subsets, and decisions. We perform simulations on two data sets: (1) S1615 is used in previous studies, (2) S2783 is the updated version (as of July 2, 2009) extracted also from ProTherm. For S1615 data set, our highest accuracy using both sequence and structure information is 0.842 on cross-validation and 0.904 on testing using early integration. Newly added features, namely, local compositional packing and the mobility extent of the mutated residues, improve accuracy significantly with intermediate integration. For S2783 data set, we also train regression methods to estimate not only the sign but also the amount of stability change and apply risk-based classification to reject when the learner has low confidence and the loss of misclassification is high. The highest accuracy is 0.835 on cross-validation and 0.832 on testing using only sequence information. The percentage of false positives can be decreased to less than 0.005 by rejecting 10 per cent using late integration. Conclusion. We find that in both early and late integration, combining inputs or decisions is useful in increasing accuracy. Intermediate integration allows assessing the contributions of individual features by looking at the assigned weights. Overall accuracy of regression is not better than that of classification but it has less false positives, especially when combined with the reject option. The server for stability prediction for three integration approaches and the data sets are available at http://www.prc.boun. edu.tr/appserv/prc/mlsta.

UR - http://www.scopus.com/inward/record.url?scp=72449189217&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=72449189217&partnerID=8YFLogxK

U2 - 10.1186/1472-6807-9-66

DO - 10.1186/1472-6807-9-66

M3 - Article

C2 - 19840377

AN - SCOPUS:72449189217

SN - 1472-6807

VL - 9

JO - BMC Structural Biology

JF - BMC Structural Biology

M1 - 66

ER -

Machine learning integration for predicting the effect of single amino acid substitutions on protein stability

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this