Germline contamination and leakage in whole genome somatic single nucleotide variant detection

Dorota H. Sendorek, Cristian Caloian, Kyle Ellrott, J. Christopher Bare, Takafumi N. Yamaguchi, Adam D. Ewing, Kathleen E. Houlahan, Thea C. Norman, Adam Margolin, Joshua M. Stuart, Paul C. Boutros

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Background: The clinical sequencing of cancer genomes to personalize therapy is becoming routine across the world. However, concerns over patient re-identification from these data lead to questions about how tightly access should be controlled. It is not thought to be possible to re-identify patients from somatic variant data. However, somatic variant detection pipelines can mistakenly identify germline variants as somatic ones, a process called "germline leakage". The rate of germline leakage across different somatic variant detection pipelines is not well-understood, and it is uncertain whether or not somatic variant calls should be considered re-identifiable. To fill this gap, we quantified germline leakage across 259 sets of whole-genome somatic single nucleotide variant (SNVs) predictions made by 21 teams as part of the ICGC-TCGA DREAM Somatic Mutation Calling Challenge. Results: The median somatic SNV prediction set contained 4325 somatic SNVs and leaked one germline polymorphism. The level of germline leakage was inversely correlated with somatic SNV prediction accuracy and positively correlated with the amount of infiltrating normal cells. The specific germline variants leaked differed by tumour and algorithm. To aid in quantitation and correction of leakage, we created a tool, called GermlineFilter, for use in public-facing somatic SNV databases. Conclusions: The potential for patient re-identification from leaked germline variants in somatic SNV predictions has led to divergent open data access policies, based on different assessments of the risks. Indeed, a single, well-publicized re-identification event could reshape public perceptions of the values of genomic data sharing. We find that modern somatic SNV prediction pipelines have low germline-leakage rates, which can be further reduced, especially for cloud-sharing, using pre-filtering software.

Original languageEnglish (US)
Article number28
JournalBMC Bioinformatics
Volume19
Issue number1
DOIs
StatePublished - Jan 31 2018

Fingerprint

Nucleotides
Contamination
Leakage
Genome
Genes
Prediction
Pipelines
Leakage (fluid)
Data Sharing
Facings
Polymorphism
Information Dissemination
Sequencing
Therapy
Genomics
Tumor
Cancer
Sharing
Mutation
Filtering

Keywords

  • Cancer genomics
  • Germline contamination
  • Germline leakage
  • Mutation calling
  • Next-generation sequencing
  • Patient identifiability
  • Single nucleotide variant
  • SNV

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Sendorek, D. H., Caloian, C., Ellrott, K., Bare, J. C., Yamaguchi, T. N., Ewing, A. D., ... Boutros, P. C. (2018). Germline contamination and leakage in whole genome somatic single nucleotide variant detection. BMC Bioinformatics, 19(1), [28]. https://doi.org/10.1186/s12859-018-2046-0

Germline contamination and leakage in whole genome somatic single nucleotide variant detection. / Sendorek, Dorota H.; Caloian, Cristian; Ellrott, Kyle; Bare, J. Christopher; Yamaguchi, Takafumi N.; Ewing, Adam D.; Houlahan, Kathleen E.; Norman, Thea C.; Margolin, Adam; Stuart, Joshua M.; Boutros, Paul C.

In: BMC Bioinformatics, Vol. 19, No. 1, 28, 31.01.2018.

Research output: Contribution to journalArticle

Sendorek, DH, Caloian, C, Ellrott, K, Bare, JC, Yamaguchi, TN, Ewing, AD, Houlahan, KE, Norman, TC, Margolin, A, Stuart, JM & Boutros, PC 2018, 'Germline contamination and leakage in whole genome somatic single nucleotide variant detection', BMC Bioinformatics, vol. 19, no. 1, 28. https://doi.org/10.1186/s12859-018-2046-0
Sendorek, Dorota H. ; Caloian, Cristian ; Ellrott, Kyle ; Bare, J. Christopher ; Yamaguchi, Takafumi N. ; Ewing, Adam D. ; Houlahan, Kathleen E. ; Norman, Thea C. ; Margolin, Adam ; Stuart, Joshua M. ; Boutros, Paul C. / Germline contamination and leakage in whole genome somatic single nucleotide variant detection. In: BMC Bioinformatics. 2018 ; Vol. 19, No. 1.
@article{fdc53000f2574771b4ed9b9353fb4f99,
title = "Germline contamination and leakage in whole genome somatic single nucleotide variant detection",
abstract = "Background: The clinical sequencing of cancer genomes to personalize therapy is becoming routine across the world. However, concerns over patient re-identification from these data lead to questions about how tightly access should be controlled. It is not thought to be possible to re-identify patients from somatic variant data. However, somatic variant detection pipelines can mistakenly identify germline variants as somatic ones, a process called {"}germline leakage{"}. The rate of germline leakage across different somatic variant detection pipelines is not well-understood, and it is uncertain whether or not somatic variant calls should be considered re-identifiable. To fill this gap, we quantified germline leakage across 259 sets of whole-genome somatic single nucleotide variant (SNVs) predictions made by 21 teams as part of the ICGC-TCGA DREAM Somatic Mutation Calling Challenge. Results: The median somatic SNV prediction set contained 4325 somatic SNVs and leaked one germline polymorphism. The level of germline leakage was inversely correlated with somatic SNV prediction accuracy and positively correlated with the amount of infiltrating normal cells. The specific germline variants leaked differed by tumour and algorithm. To aid in quantitation and correction of leakage, we created a tool, called GermlineFilter, for use in public-facing somatic SNV databases. Conclusions: The potential for patient re-identification from leaked germline variants in somatic SNV predictions has led to divergent open data access policies, based on different assessments of the risks. Indeed, a single, well-publicized re-identification event could reshape public perceptions of the values of genomic data sharing. We find that modern somatic SNV prediction pipelines have low germline-leakage rates, which can be further reduced, especially for cloud-sharing, using pre-filtering software.",
keywords = "Cancer genomics, Germline contamination, Germline leakage, Mutation calling, Next-generation sequencing, Patient identifiability, Single nucleotide variant, SNV",
author = "Sendorek, {Dorota H.} and Cristian Caloian and Kyle Ellrott and Bare, {J. Christopher} and Yamaguchi, {Takafumi N.} and Ewing, {Adam D.} and Houlahan, {Kathleen E.} and Norman, {Thea C.} and Adam Margolin and Stuart, {Joshua M.} and Boutros, {Paul C.}",
year = "2018",
month = "1",
day = "31",
doi = "10.1186/s12859-018-2046-0",
language = "English (US)",
volume = "19",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Germline contamination and leakage in whole genome somatic single nucleotide variant detection

AU - Sendorek, Dorota H.

AU - Caloian, Cristian

AU - Ellrott, Kyle

AU - Bare, J. Christopher

AU - Yamaguchi, Takafumi N.

AU - Ewing, Adam D.

AU - Houlahan, Kathleen E.

AU - Norman, Thea C.

AU - Margolin, Adam

AU - Stuart, Joshua M.

AU - Boutros, Paul C.

PY - 2018/1/31

Y1 - 2018/1/31

N2 - Background: The clinical sequencing of cancer genomes to personalize therapy is becoming routine across the world. However, concerns over patient re-identification from these data lead to questions about how tightly access should be controlled. It is not thought to be possible to re-identify patients from somatic variant data. However, somatic variant detection pipelines can mistakenly identify germline variants as somatic ones, a process called "germline leakage". The rate of germline leakage across different somatic variant detection pipelines is not well-understood, and it is uncertain whether or not somatic variant calls should be considered re-identifiable. To fill this gap, we quantified germline leakage across 259 sets of whole-genome somatic single nucleotide variant (SNVs) predictions made by 21 teams as part of the ICGC-TCGA DREAM Somatic Mutation Calling Challenge. Results: The median somatic SNV prediction set contained 4325 somatic SNVs and leaked one germline polymorphism. The level of germline leakage was inversely correlated with somatic SNV prediction accuracy and positively correlated with the amount of infiltrating normal cells. The specific germline variants leaked differed by tumour and algorithm. To aid in quantitation and correction of leakage, we created a tool, called GermlineFilter, for use in public-facing somatic SNV databases. Conclusions: The potential for patient re-identification from leaked germline variants in somatic SNV predictions has led to divergent open data access policies, based on different assessments of the risks. Indeed, a single, well-publicized re-identification event could reshape public perceptions of the values of genomic data sharing. We find that modern somatic SNV prediction pipelines have low germline-leakage rates, which can be further reduced, especially for cloud-sharing, using pre-filtering software.

AB - Background: The clinical sequencing of cancer genomes to personalize therapy is becoming routine across the world. However, concerns over patient re-identification from these data lead to questions about how tightly access should be controlled. It is not thought to be possible to re-identify patients from somatic variant data. However, somatic variant detection pipelines can mistakenly identify germline variants as somatic ones, a process called "germline leakage". The rate of germline leakage across different somatic variant detection pipelines is not well-understood, and it is uncertain whether or not somatic variant calls should be considered re-identifiable. To fill this gap, we quantified germline leakage across 259 sets of whole-genome somatic single nucleotide variant (SNVs) predictions made by 21 teams as part of the ICGC-TCGA DREAM Somatic Mutation Calling Challenge. Results: The median somatic SNV prediction set contained 4325 somatic SNVs and leaked one germline polymorphism. The level of germline leakage was inversely correlated with somatic SNV prediction accuracy and positively correlated with the amount of infiltrating normal cells. The specific germline variants leaked differed by tumour and algorithm. To aid in quantitation and correction of leakage, we created a tool, called GermlineFilter, for use in public-facing somatic SNV databases. Conclusions: The potential for patient re-identification from leaked germline variants in somatic SNV predictions has led to divergent open data access policies, based on different assessments of the risks. Indeed, a single, well-publicized re-identification event could reshape public perceptions of the values of genomic data sharing. We find that modern somatic SNV prediction pipelines have low germline-leakage rates, which can be further reduced, especially for cloud-sharing, using pre-filtering software.

KW - Cancer genomics

KW - Germline contamination

KW - Germline leakage

KW - Mutation calling

KW - Next-generation sequencing

KW - Patient identifiability

KW - Single nucleotide variant

KW - SNV

UR - http://www.scopus.com/inward/record.url?scp=85041452329&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85041452329&partnerID=8YFLogxK

U2 - 10.1186/s12859-018-2046-0

DO - 10.1186/s12859-018-2046-0

M3 - Article

C2 - 29385983

AN - SCOPUS:85041452329

VL - 19

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 28

ER -