Whole-genome characterization in pedigreed non-human primates using genotyping-by-sequencing (GBS) and imputation

Benjamin N. Bimber, Michael J. Raboin, John Letaw, Kimberly A. Nevonen, Jennifer E. Spindel, Susan R. McCouch, Rita Cervera-Juanes, Eliot Spindel, Lucia Carbone, Betsy Ferguson, Amanda Vinson

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Background: Rhesus macaques are widely used in biomedical research, but the application of genomic information in this species to better understand human disease is still in its infancy. Whole-genome sequence (WGS) data in large pedigreed macaque colonies could provide substantial experimental power for genetic discovery, but the collection of WGS data in large cohorts remains a formidable expense. Here, we describe a cost-effective approach that selects the most informative macaques in a pedigree for 30X WGS, followed by low-cost genotyping-by-sequencing (GBS) at 30X on the remaining macaques in order to generate sparse genotype data at high accuracy. Dense variants from the selected macaques with WGS data are then imputed into macaques having only sparse GBS data, resulting in dense genome-wide genotypes throughout the pedigree. Results: We developed GBS for the macaque genome using a digestion with PstI, followed by sequencing of size-selected fragments at 30X coverage. From GBS sequence data collected on all individuals in a 16-member pedigree, we characterized high-confidence genotypes at 22,455 single nucleotide variant (SNV) sites that were suitable for guiding imputation of dense sequence data from WGS. To characterize dense markers for imputation, we performed WGS at 30X coverage on nine of the 16 individuals, yielding 10,193,425 high-confidence SNVs. To validate the use of GBS data for facilitating imputation, we initially focused on chromosome 19 as a test case, using an optimized panel of 833 sparse, evenly-spaced markers from GBS and 5,010 dense markers from WGS. Using the method of "Genotype Imputation Given Inheritance" (GIGI), we evaluated the effects on imputation accuracy of 3 different strategies for selecting individuals for WGS, including 1) using "GIGI-Pick" to select the most informative individuals, 2) using the most recent generation, or 3) using founders only. We also evaluated the effects on imputation accuracy of using a range of from 1 to 9 WGS individuals for imputation. We found that the GIGI-Pick algorithm for selection of WGS individuals outperformed common heuristic approaches, and that genotype numbers and accuracy improved very little when using > 5 WGS individuals for imputation. Informed by our findings, we used 4 macaques with WGS data to impute variants at up to 7,655,491 sites spanning all 20 autosomes in the 12 remainingmacaques, based on their GBS genotypes at only 17,158 loci. Using a strict confidence threshold, we imputed an average of 3,680,238 variants per individual at > 99 % accuracy, or an average 4,458,883 variants per individual at a more relaxed threshold, yielding > 97 % accuracy. Conclusions: We conclude that an optimal tradeoff between genotype accuracy, number of imputed genotypes, and overall cost exists at the ratio of one individual selected for WGS using the GIGI-Pick algorithm, per 3-5 relatives selected for GBS. This approach makes feasible the collection of accurate, dense genome-wide sequence data in large pedigreed macaque cohorts without the need for more expensive WGS data on all individuals.

Original languageEnglish (US)
Article number676
JournalBMC Genomics
Volume17
Issue number1
DOIs
StatePublished - 2016

Fingerprint

Primates
Genome
Macaca
Genotype
Pedigree
Costs and Cost Analysis
Chromosomes, Human, Pair 19
Macaca mulatta
Biomedical Research
Digestion

Keywords

  • Genotyping-by-sequencing
  • Imputation
  • Macaque
  • Pedigree
  • Whole-genome sequencing

ASJC Scopus subject areas

  • Biotechnology
  • Genetics

Cite this

Whole-genome characterization in pedigreed non-human primates using genotyping-by-sequencing (GBS) and imputation. / Bimber, Benjamin N.; Raboin, Michael J.; Letaw, John; Nevonen, Kimberly A.; Spindel, Jennifer E.; McCouch, Susan R.; Cervera-Juanes, Rita; Spindel, Eliot; Carbone, Lucia; Ferguson, Betsy; Vinson, Amanda.

In: BMC Genomics, Vol. 17, No. 1, 676, 2016.

Research output: Contribution to journalArticle

Bimber, Benjamin N. ; Raboin, Michael J. ; Letaw, John ; Nevonen, Kimberly A. ; Spindel, Jennifer E. ; McCouch, Susan R. ; Cervera-Juanes, Rita ; Spindel, Eliot ; Carbone, Lucia ; Ferguson, Betsy ; Vinson, Amanda. / Whole-genome characterization in pedigreed non-human primates using genotyping-by-sequencing (GBS) and imputation. In: BMC Genomics. 2016 ; Vol. 17, No. 1.
@article{a15dcd78fd44489e85ea8f9bb3a5e7e1,
title = "Whole-genome characterization in pedigreed non-human primates using genotyping-by-sequencing (GBS) and imputation",
abstract = "Background: Rhesus macaques are widely used in biomedical research, but the application of genomic information in this species to better understand human disease is still in its infancy. Whole-genome sequence (WGS) data in large pedigreed macaque colonies could provide substantial experimental power for genetic discovery, but the collection of WGS data in large cohorts remains a formidable expense. Here, we describe a cost-effective approach that selects the most informative macaques in a pedigree for 30X WGS, followed by low-cost genotyping-by-sequencing (GBS) at 30X on the remaining macaques in order to generate sparse genotype data at high accuracy. Dense variants from the selected macaques with WGS data are then imputed into macaques having only sparse GBS data, resulting in dense genome-wide genotypes throughout the pedigree. Results: We developed GBS for the macaque genome using a digestion with PstI, followed by sequencing of size-selected fragments at 30X coverage. From GBS sequence data collected on all individuals in a 16-member pedigree, we characterized high-confidence genotypes at 22,455 single nucleotide variant (SNV) sites that were suitable for guiding imputation of dense sequence data from WGS. To characterize dense markers for imputation, we performed WGS at 30X coverage on nine of the 16 individuals, yielding 10,193,425 high-confidence SNVs. To validate the use of GBS data for facilitating imputation, we initially focused on chromosome 19 as a test case, using an optimized panel of 833 sparse, evenly-spaced markers from GBS and 5,010 dense markers from WGS. Using the method of {"}Genotype Imputation Given Inheritance{"} (GIGI), we evaluated the effects on imputation accuracy of 3 different strategies for selecting individuals for WGS, including 1) using {"}GIGI-Pick{"} to select the most informative individuals, 2) using the most recent generation, or 3) using founders only. We also evaluated the effects on imputation accuracy of using a range of from 1 to 9 WGS individuals for imputation. We found that the GIGI-Pick algorithm for selection of WGS individuals outperformed common heuristic approaches, and that genotype numbers and accuracy improved very little when using > 5 WGS individuals for imputation. Informed by our findings, we used 4 macaques with WGS data to impute variants at up to 7,655,491 sites spanning all 20 autosomes in the 12 remainingmacaques, based on their GBS genotypes at only 17,158 loci. Using a strict confidence threshold, we imputed an average of 3,680,238 variants per individual at > 99 {\%} accuracy, or an average 4,458,883 variants per individual at a more relaxed threshold, yielding > 97 {\%} accuracy. Conclusions: We conclude that an optimal tradeoff between genotype accuracy, number of imputed genotypes, and overall cost exists at the ratio of one individual selected for WGS using the GIGI-Pick algorithm, per 3-5 relatives selected for GBS. This approach makes feasible the collection of accurate, dense genome-wide sequence data in large pedigreed macaque cohorts without the need for more expensive WGS data on all individuals.",
keywords = "Genotyping-by-sequencing, Imputation, Macaque, Pedigree, Whole-genome sequencing",
author = "Bimber, {Benjamin N.} and Raboin, {Michael J.} and John Letaw and Nevonen, {Kimberly A.} and Spindel, {Jennifer E.} and McCouch, {Susan R.} and Rita Cervera-Juanes and Eliot Spindel and Lucia Carbone and Betsy Ferguson and Amanda Vinson",
year = "2016",
doi = "10.1186/s12864-016-2966-x",
language = "English (US)",
volume = "17",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Whole-genome characterization in pedigreed non-human primates using genotyping-by-sequencing (GBS) and imputation

AU - Bimber, Benjamin N.

AU - Raboin, Michael J.

AU - Letaw, John

AU - Nevonen, Kimberly A.

AU - Spindel, Jennifer E.

AU - McCouch, Susan R.

AU - Cervera-Juanes, Rita

AU - Spindel, Eliot

AU - Carbone, Lucia

AU - Ferguson, Betsy

AU - Vinson, Amanda

PY - 2016

Y1 - 2016

N2 - Background: Rhesus macaques are widely used in biomedical research, but the application of genomic information in this species to better understand human disease is still in its infancy. Whole-genome sequence (WGS) data in large pedigreed macaque colonies could provide substantial experimental power for genetic discovery, but the collection of WGS data in large cohorts remains a formidable expense. Here, we describe a cost-effective approach that selects the most informative macaques in a pedigree for 30X WGS, followed by low-cost genotyping-by-sequencing (GBS) at 30X on the remaining macaques in order to generate sparse genotype data at high accuracy. Dense variants from the selected macaques with WGS data are then imputed into macaques having only sparse GBS data, resulting in dense genome-wide genotypes throughout the pedigree. Results: We developed GBS for the macaque genome using a digestion with PstI, followed by sequencing of size-selected fragments at 30X coverage. From GBS sequence data collected on all individuals in a 16-member pedigree, we characterized high-confidence genotypes at 22,455 single nucleotide variant (SNV) sites that were suitable for guiding imputation of dense sequence data from WGS. To characterize dense markers for imputation, we performed WGS at 30X coverage on nine of the 16 individuals, yielding 10,193,425 high-confidence SNVs. To validate the use of GBS data for facilitating imputation, we initially focused on chromosome 19 as a test case, using an optimized panel of 833 sparse, evenly-spaced markers from GBS and 5,010 dense markers from WGS. Using the method of "Genotype Imputation Given Inheritance" (GIGI), we evaluated the effects on imputation accuracy of 3 different strategies for selecting individuals for WGS, including 1) using "GIGI-Pick" to select the most informative individuals, 2) using the most recent generation, or 3) using founders only. We also evaluated the effects on imputation accuracy of using a range of from 1 to 9 WGS individuals for imputation. We found that the GIGI-Pick algorithm for selection of WGS individuals outperformed common heuristic approaches, and that genotype numbers and accuracy improved very little when using > 5 WGS individuals for imputation. Informed by our findings, we used 4 macaques with WGS data to impute variants at up to 7,655,491 sites spanning all 20 autosomes in the 12 remainingmacaques, based on their GBS genotypes at only 17,158 loci. Using a strict confidence threshold, we imputed an average of 3,680,238 variants per individual at > 99 % accuracy, or an average 4,458,883 variants per individual at a more relaxed threshold, yielding > 97 % accuracy. Conclusions: We conclude that an optimal tradeoff between genotype accuracy, number of imputed genotypes, and overall cost exists at the ratio of one individual selected for WGS using the GIGI-Pick algorithm, per 3-5 relatives selected for GBS. This approach makes feasible the collection of accurate, dense genome-wide sequence data in large pedigreed macaque cohorts without the need for more expensive WGS data on all individuals.

AB - Background: Rhesus macaques are widely used in biomedical research, but the application of genomic information in this species to better understand human disease is still in its infancy. Whole-genome sequence (WGS) data in large pedigreed macaque colonies could provide substantial experimental power for genetic discovery, but the collection of WGS data in large cohorts remains a formidable expense. Here, we describe a cost-effective approach that selects the most informative macaques in a pedigree for 30X WGS, followed by low-cost genotyping-by-sequencing (GBS) at 30X on the remaining macaques in order to generate sparse genotype data at high accuracy. Dense variants from the selected macaques with WGS data are then imputed into macaques having only sparse GBS data, resulting in dense genome-wide genotypes throughout the pedigree. Results: We developed GBS for the macaque genome using a digestion with PstI, followed by sequencing of size-selected fragments at 30X coverage. From GBS sequence data collected on all individuals in a 16-member pedigree, we characterized high-confidence genotypes at 22,455 single nucleotide variant (SNV) sites that were suitable for guiding imputation of dense sequence data from WGS. To characterize dense markers for imputation, we performed WGS at 30X coverage on nine of the 16 individuals, yielding 10,193,425 high-confidence SNVs. To validate the use of GBS data for facilitating imputation, we initially focused on chromosome 19 as a test case, using an optimized panel of 833 sparse, evenly-spaced markers from GBS and 5,010 dense markers from WGS. Using the method of "Genotype Imputation Given Inheritance" (GIGI), we evaluated the effects on imputation accuracy of 3 different strategies for selecting individuals for WGS, including 1) using "GIGI-Pick" to select the most informative individuals, 2) using the most recent generation, or 3) using founders only. We also evaluated the effects on imputation accuracy of using a range of from 1 to 9 WGS individuals for imputation. We found that the GIGI-Pick algorithm for selection of WGS individuals outperformed common heuristic approaches, and that genotype numbers and accuracy improved very little when using > 5 WGS individuals for imputation. Informed by our findings, we used 4 macaques with WGS data to impute variants at up to 7,655,491 sites spanning all 20 autosomes in the 12 remainingmacaques, based on their GBS genotypes at only 17,158 loci. Using a strict confidence threshold, we imputed an average of 3,680,238 variants per individual at > 99 % accuracy, or an average 4,458,883 variants per individual at a more relaxed threshold, yielding > 97 % accuracy. Conclusions: We conclude that an optimal tradeoff between genotype accuracy, number of imputed genotypes, and overall cost exists at the ratio of one individual selected for WGS using the GIGI-Pick algorithm, per 3-5 relatives selected for GBS. This approach makes feasible the collection of accurate, dense genome-wide sequence data in large pedigreed macaque cohorts without the need for more expensive WGS data on all individuals.

KW - Genotyping-by-sequencing

KW - Imputation

KW - Macaque

KW - Pedigree

KW - Whole-genome sequencing

UR - http://www.scopus.com/inward/record.url?scp=85007552090&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85007552090&partnerID=8YFLogxK

U2 - 10.1186/s12864-016-2966-x

DO - 10.1186/s12864-016-2966-x

M3 - Article

C2 - 27558348

AN - SCOPUS:85007552090

VL - 17

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

M1 - 676

ER -