Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

Steven H. Wu; Rachel S. Schwartz; David J. Winter; Donald F. Conrad; Reed A. Cartwright

doi:10.1093/bioinformatics/btx133

Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

Steven H. Wu, Rachel S. Schwartz, David J. Winter, Donald F. Conrad, Reed A. Cartwright

Research output: Contribution to journal › Article › peer-review

10 Scopus citations

Abstract

Motivation: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. Results: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls.

Original language	English (US)
Pages (from-to)	2322-2329
Number of pages	8
Journal	Bioinformatics
Volume	33
Issue number	15
DOIs	https://doi.org/10.1093/bioinformatics/btx133
State	Published - Aug 1 2017
Externally published	Yes

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/btx133

Cite this

@article{455f07edc2b04e62b5c36bc1b1fd27ae,

title = "Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions",

abstract = "Motivation: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. Results: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls.",

author = "Wu, {Steven H.} and Schwartz, {Rachel S.} and Winter, {David J.} and Conrad, {Donald F.} and Cartwright, {Reed A.}",

year = "2017",

month = aug,

day = "1",

doi = "10.1093/bioinformatics/btx133",

language = "English (US)",

volume = "33",

pages = "2322--2329",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "15",

}

TY - JOUR

T1 - Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

AU - Wu, Steven H.

AU - Schwartz, Rachel S.

AU - Winter, David J.

AU - Conrad, Donald F.

AU - Cartwright, Reed A.

PY - 2017/8/1

Y1 - 2017/8/1

N2 - Motivation: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. Results: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls.

AB - Motivation: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. Results: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls.

UR - http://www.scopus.com/inward/record.url?scp=85026346784&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85026346784&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btx133

DO - 10.1093/bioinformatics/btx133

M3 - Article

C2 - 28334373

AN - SCOPUS:85026346784

SN - 1367-4803

VL - 33

SP - 2322

EP - 2329

JO - Bioinformatics

JF - Bioinformatics

IS - 15

ER -

Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this