Controlling for contaminants in low-biomass 16S rRNA gene sequencing experiments

Lisa Karstens, Mark Asquith, Sean Davin, Damien Fair, William (Tom) Gregory, Alan J. Wolfe, Jonathan Braun, Shannon McWeeney

Research output: Contribution to journalArticle

Abstract

Microbial communities are commonly studied using culture-independent methods, such as 16S rRNA gene sequencing. However, one challenge in accurately characterizing microbial communities is exogenous bacterial DNA contamination, particularly in low-microbial-biomass niches. Computational approaches to identify contaminant sequences have been proposed, but their performance has not been independently evaluated. To identify the impact of decreasing microbial biomass on polymicrobial 16S rRNA gene sequencing experiments, we created a mock microbial community dilution series. We evaluated four computational approaches to identify and remove contaminants, as follows: (i) filtering sequences present in a negative control, (ii) filtering sequences based on relative abundance, (iii) identifying sequences that have an inverse correlation with DNA concentration implemented in Decontam, and (iv) predicting the sequence proportion arising from defined contaminant sources implemented in SourceTracker. As expected, the proportion of contaminant bacterial DNA increased with decreasing starting microbial biomass, with 80.1% of the most diluted sample arising from contaminant sequences. Inclusion of contaminant sequences led to overinflated diversity estimates and distorted microbiome composition. All methods for contaminant identification successfully identified some contaminant sequences, which varied depending on the method parameters used and contaminant prevalence. Notably, removing sequences present in a negative control erroneously removed 20% of expected sequences. SourceTracker successfully removed over 98% of contaminants when the experimental environments were well defined. However, SourceTracker misclassified expected sequences and performed poorly when the experimental environment was unknown, failing to remove 97% of contaminants. In contrast, the Decontam frequency method did not remove expected sequences and successfully removed 70 to 90% of the contaminants. IMPORTANCE The relative scarcity of microbes in low-microbial-biomass environments makes accurate determination of community composition challenging. Identifying and controlling for contaminant bacterial DNA are critical steps in understanding microbial communities from these low-biomass environments. Our study introduces the use of a mock community dilution series as a positive control and evaluates four computational strategies that can identify contaminants in 16S rRNA gene sequencing experiments in order to remove them from downstream analyses. The appropriate computational approach for removing contaminant sequences from an experiment depends on prior knowledge about the microbial environment under investigation and can be evaluated with a dilution series of a mock microbial community.

Original languageEnglish (US)
Article numbere00290-19
JournalmSystems
Volume4
Issue number4
DOIs
StatePublished - Jul 1 2019

Fingerprint

Biomass
rRNA Genes
Sequencing
microbial communities
Genes
Bacterial DNA
microbial biomass
ribosomal RNA
Impurities
Gene
pollutant
gene
biomass
bacterial contamination
Experiment
DNA
genes
experiment
Experiments
DNA Contamination

Keywords

  • 16S rRNA gene sequencing
  • Contamination
  • Decontam
  • Low microbial biomass
  • Microbiome
  • SourceTracker

ASJC Scopus subject areas

  • Microbiology
  • Physiology
  • Biochemistry
  • Ecology, Evolution, Behavior and Systematics
  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computer Science Applications

Cite this

Controlling for contaminants in low-biomass 16S rRNA gene sequencing experiments. / Karstens, Lisa; Asquith, Mark; Davin, Sean; Fair, Damien; Gregory, William (Tom); Wolfe, Alan J.; Braun, Jonathan; McWeeney, Shannon.

In: mSystems, Vol. 4, No. 4, e00290-19, 01.07.2019.

Research output: Contribution to journalArticle

@article{414d730db5d149c6b2961fd058f7d7df,
title = "Controlling for contaminants in low-biomass 16S rRNA gene sequencing experiments",
abstract = "Microbial communities are commonly studied using culture-independent methods, such as 16S rRNA gene sequencing. However, one challenge in accurately characterizing microbial communities is exogenous bacterial DNA contamination, particularly in low-microbial-biomass niches. Computational approaches to identify contaminant sequences have been proposed, but their performance has not been independently evaluated. To identify the impact of decreasing microbial biomass on polymicrobial 16S rRNA gene sequencing experiments, we created a mock microbial community dilution series. We evaluated four computational approaches to identify and remove contaminants, as follows: (i) filtering sequences present in a negative control, (ii) filtering sequences based on relative abundance, (iii) identifying sequences that have an inverse correlation with DNA concentration implemented in Decontam, and (iv) predicting the sequence proportion arising from defined contaminant sources implemented in SourceTracker. As expected, the proportion of contaminant bacterial DNA increased with decreasing starting microbial biomass, with 80.1{\%} of the most diluted sample arising from contaminant sequences. Inclusion of contaminant sequences led to overinflated diversity estimates and distorted microbiome composition. All methods for contaminant identification successfully identified some contaminant sequences, which varied depending on the method parameters used and contaminant prevalence. Notably, removing sequences present in a negative control erroneously removed 20{\%} of expected sequences. SourceTracker successfully removed over 98{\%} of contaminants when the experimental environments were well defined. However, SourceTracker misclassified expected sequences and performed poorly when the experimental environment was unknown, failing to remove 97{\%} of contaminants. In contrast, the Decontam frequency method did not remove expected sequences and successfully removed 70 to 90{\%} of the contaminants. IMPORTANCE The relative scarcity of microbes in low-microbial-biomass environments makes accurate determination of community composition challenging. Identifying and controlling for contaminant bacterial DNA are critical steps in understanding microbial communities from these low-biomass environments. Our study introduces the use of a mock community dilution series as a positive control and evaluates four computational strategies that can identify contaminants in 16S rRNA gene sequencing experiments in order to remove them from downstream analyses. The appropriate computational approach for removing contaminant sequences from an experiment depends on prior knowledge about the microbial environment under investigation and can be evaluated with a dilution series of a mock microbial community.",
keywords = "16S rRNA gene sequencing, Contamination, Decontam, Low microbial biomass, Microbiome, SourceTracker",
author = "Lisa Karstens and Mark Asquith and Sean Davin and Damien Fair and Gregory, {William (Tom)} and Wolfe, {Alan J.} and Jonathan Braun and Shannon McWeeney",
year = "2019",
month = "7",
day = "1",
doi = "10.1128/mSystems.00290-19",
language = "English (US)",
volume = "4",
journal = "mSystems",
issn = "2379-5077",
publisher = "American Society for Microbiology",
number = "4",

}

TY - JOUR

T1 - Controlling for contaminants in low-biomass 16S rRNA gene sequencing experiments

AU - Karstens, Lisa

AU - Asquith, Mark

AU - Davin, Sean

AU - Fair, Damien

AU - Gregory, William (Tom)

AU - Wolfe, Alan J.

AU - Braun, Jonathan

AU - McWeeney, Shannon

PY - 2019/7/1

Y1 - 2019/7/1

N2 - Microbial communities are commonly studied using culture-independent methods, such as 16S rRNA gene sequencing. However, one challenge in accurately characterizing microbial communities is exogenous bacterial DNA contamination, particularly in low-microbial-biomass niches. Computational approaches to identify contaminant sequences have been proposed, but their performance has not been independently evaluated. To identify the impact of decreasing microbial biomass on polymicrobial 16S rRNA gene sequencing experiments, we created a mock microbial community dilution series. We evaluated four computational approaches to identify and remove contaminants, as follows: (i) filtering sequences present in a negative control, (ii) filtering sequences based on relative abundance, (iii) identifying sequences that have an inverse correlation with DNA concentration implemented in Decontam, and (iv) predicting the sequence proportion arising from defined contaminant sources implemented in SourceTracker. As expected, the proportion of contaminant bacterial DNA increased with decreasing starting microbial biomass, with 80.1% of the most diluted sample arising from contaminant sequences. Inclusion of contaminant sequences led to overinflated diversity estimates and distorted microbiome composition. All methods for contaminant identification successfully identified some contaminant sequences, which varied depending on the method parameters used and contaminant prevalence. Notably, removing sequences present in a negative control erroneously removed 20% of expected sequences. SourceTracker successfully removed over 98% of contaminants when the experimental environments were well defined. However, SourceTracker misclassified expected sequences and performed poorly when the experimental environment was unknown, failing to remove 97% of contaminants. In contrast, the Decontam frequency method did not remove expected sequences and successfully removed 70 to 90% of the contaminants. IMPORTANCE The relative scarcity of microbes in low-microbial-biomass environments makes accurate determination of community composition challenging. Identifying and controlling for contaminant bacterial DNA are critical steps in understanding microbial communities from these low-biomass environments. Our study introduces the use of a mock community dilution series as a positive control and evaluates four computational strategies that can identify contaminants in 16S rRNA gene sequencing experiments in order to remove them from downstream analyses. The appropriate computational approach for removing contaminant sequences from an experiment depends on prior knowledge about the microbial environment under investigation and can be evaluated with a dilution series of a mock microbial community.

AB - Microbial communities are commonly studied using culture-independent methods, such as 16S rRNA gene sequencing. However, one challenge in accurately characterizing microbial communities is exogenous bacterial DNA contamination, particularly in low-microbial-biomass niches. Computational approaches to identify contaminant sequences have been proposed, but their performance has not been independently evaluated. To identify the impact of decreasing microbial biomass on polymicrobial 16S rRNA gene sequencing experiments, we created a mock microbial community dilution series. We evaluated four computational approaches to identify and remove contaminants, as follows: (i) filtering sequences present in a negative control, (ii) filtering sequences based on relative abundance, (iii) identifying sequences that have an inverse correlation with DNA concentration implemented in Decontam, and (iv) predicting the sequence proportion arising from defined contaminant sources implemented in SourceTracker. As expected, the proportion of contaminant bacterial DNA increased with decreasing starting microbial biomass, with 80.1% of the most diluted sample arising from contaminant sequences. Inclusion of contaminant sequences led to overinflated diversity estimates and distorted microbiome composition. All methods for contaminant identification successfully identified some contaminant sequences, which varied depending on the method parameters used and contaminant prevalence. Notably, removing sequences present in a negative control erroneously removed 20% of expected sequences. SourceTracker successfully removed over 98% of contaminants when the experimental environments were well defined. However, SourceTracker misclassified expected sequences and performed poorly when the experimental environment was unknown, failing to remove 97% of contaminants. In contrast, the Decontam frequency method did not remove expected sequences and successfully removed 70 to 90% of the contaminants. IMPORTANCE The relative scarcity of microbes in low-microbial-biomass environments makes accurate determination of community composition challenging. Identifying and controlling for contaminant bacterial DNA are critical steps in understanding microbial communities from these low-biomass environments. Our study introduces the use of a mock community dilution series as a positive control and evaluates four computational strategies that can identify contaminants in 16S rRNA gene sequencing experiments in order to remove them from downstream analyses. The appropriate computational approach for removing contaminant sequences from an experiment depends on prior knowledge about the microbial environment under investigation and can be evaluated with a dilution series of a mock microbial community.

KW - 16S rRNA gene sequencing

KW - Contamination

KW - Decontam

KW - Low microbial biomass

KW - Microbiome

KW - SourceTracker

UR - http://www.scopus.com/inward/record.url?scp=85067394004&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85067394004&partnerID=8YFLogxK

U2 - 10.1128/mSystems.00290-19

DO - 10.1128/mSystems.00290-19

M3 - Article

VL - 4

JO - mSystems

JF - mSystems

SN - 2379-5077

IS - 4

M1 - e00290-19

ER -