Bias from removing read duplication in ultra-deep sequencing experiments

Wanding Zhou; Tenghui Chen; Hao Zhao; Agda Karina Eterovic; Funda Meric-Bernstam; Gordon B. Mills; Ken Chen

doi:10.1093/bioinformatics/btt771

Bias from removing read duplication in ultra-deep sequencing experiments

Wanding Zhou, Tenghui Chen, Hao Zhao, Agda Karina Eterovic, Funda Meric-Bernstam, Gordon B. Mills, Ken Chen

Research output: Contribution to journal › Article › peer-review

28 Scopus citations

Abstract

Motivation: Identifying subclonal mutations and their implications requires accurate estimation of mutant allele fractions from possibly duplicated sequencing reads. Removing duplicate reads assumes that polymerase chain reaction amplification from library constructions is the primary source. The alternative-sampling coincidence from DNA fragmentation-has not been systematically investigated. Results: With sufficiently high-sequencing depth, sampling-induced read duplication is non-negligible, and removing duplicate reads can overcorrect read counts, causing systemic biases in variant allele fraction and copy number variation estimations. Minimal overcorrection occurs when duplicate reads are identified accounting for their mate reads, inserts are of a variety of lengths and samples are sequenced in separate batches. We investigate sampling-induced read duplication in deep sequencing data with 500 to 2000 duplicates-removed sequence coverage. We provide a quantitative solution to overcorrection and guidance for effective designs of deep sequencing platforms that facilitate accurate estimation of variant allele fraction and copy number variation.

Original language	English (US)
Pages (from-to)	1073-1080
Number of pages	8
Journal	Bioinformatics
Volume	30
Issue number	8
DOIs	https://doi.org/10.1093/bioinformatics/btt771
State	Published - 2014
Externally published	Yes

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/btt771

Cite this

@article{56575d67b22e40eaa3931fbd3446a6e7,

title = "Bias from removing read duplication in ultra-deep sequencing experiments",

abstract = "Motivation: Identifying subclonal mutations and their implications requires accurate estimation of mutant allele fractions from possibly duplicated sequencing reads. Removing duplicate reads assumes that polymerase chain reaction amplification from library constructions is the primary source. The alternative-sampling coincidence from DNA fragmentation-has not been systematically investigated. Results: With sufficiently high-sequencing depth, sampling-induced read duplication is non-negligible, and removing duplicate reads can overcorrect read counts, causing systemic biases in variant allele fraction and copy number variation estimations. Minimal overcorrection occurs when duplicate reads are identified accounting for their mate reads, inserts are of a variety of lengths and samples are sequenced in separate batches. We investigate sampling-induced read duplication in deep sequencing data with 500 to 2000 duplicates-removed sequence coverage. We provide a quantitative solution to overcorrection and guidance for effective designs of deep sequencing platforms that facilitate accurate estimation of variant allele fraction and copy number variation.",

author = "Wanding Zhou and Tenghui Chen and Hao Zhao and Eterovic, {Agda Karina} and Funda Meric-Bernstam and Mills, {Gordon B.} and Ken Chen",

note = "Funding Information: Funding: The National Cancer Institute [grant R01-CA172652-01 to KC and grant P30 CA016672]; The MD Anderson Odyssey recruitment fellowship (to W.Z.); and the MD Anderson Cancer Center Sheikh Khalifa Ben Zayed Al Nahyan Institute of Personalized Cancer Therapy.",

year = "2014",

doi = "10.1093/bioinformatics/btt771",

language = "English (US)",

volume = "30",

pages = "1073--1080",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "8",

}

TY - JOUR

T1 - Bias from removing read duplication in ultra-deep sequencing experiments

AU - Zhou, Wanding

AU - Chen, Tenghui

AU - Zhao, Hao

AU - Eterovic, Agda Karina

AU - Meric-Bernstam, Funda

AU - Mills, Gordon B.

AU - Chen, Ken

N1 - Funding Information: Funding: The National Cancer Institute [grant R01-CA172652-01 to KC and grant P30 CA016672]; The MD Anderson Odyssey recruitment fellowship (to W.Z.); and the MD Anderson Cancer Center Sheikh Khalifa Ben Zayed Al Nahyan Institute of Personalized Cancer Therapy.

PY - 2014

Y1 - 2014

N2 - Motivation: Identifying subclonal mutations and their implications requires accurate estimation of mutant allele fractions from possibly duplicated sequencing reads. Removing duplicate reads assumes that polymerase chain reaction amplification from library constructions is the primary source. The alternative-sampling coincidence from DNA fragmentation-has not been systematically investigated. Results: With sufficiently high-sequencing depth, sampling-induced read duplication is non-negligible, and removing duplicate reads can overcorrect read counts, causing systemic biases in variant allele fraction and copy number variation estimations. Minimal overcorrection occurs when duplicate reads are identified accounting for their mate reads, inserts are of a variety of lengths and samples are sequenced in separate batches. We investigate sampling-induced read duplication in deep sequencing data with 500 to 2000 duplicates-removed sequence coverage. We provide a quantitative solution to overcorrection and guidance for effective designs of deep sequencing platforms that facilitate accurate estimation of variant allele fraction and copy number variation.

AB - Motivation: Identifying subclonal mutations and their implications requires accurate estimation of mutant allele fractions from possibly duplicated sequencing reads. Removing duplicate reads assumes that polymerase chain reaction amplification from library constructions is the primary source. The alternative-sampling coincidence from DNA fragmentation-has not been systematically investigated. Results: With sufficiently high-sequencing depth, sampling-induced read duplication is non-negligible, and removing duplicate reads can overcorrect read counts, causing systemic biases in variant allele fraction and copy number variation estimations. Minimal overcorrection occurs when duplicate reads are identified accounting for their mate reads, inserts are of a variety of lengths and samples are sequenced in separate batches. We investigate sampling-induced read duplication in deep sequencing data with 500 to 2000 duplicates-removed sequence coverage. We provide a quantitative solution to overcorrection and guidance for effective designs of deep sequencing platforms that facilitate accurate estimation of variant allele fraction and copy number variation.

UR - http://www.scopus.com/inward/record.url?scp=84898881845&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84898881845&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btt771

DO - 10.1093/bioinformatics/btt771

M3 - Article

C2 - 24389657

AN - SCOPUS:84898881845

SN - 1367-4803

VL - 30

SP - 1073

EP - 1080

JO - Bioinformatics

JF - Bioinformatics

IS - 8

ER -

Bias from removing read duplication in ultra-deep sequencing experiments

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this