Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive

Abhinav Nellore, Andrew E. Jaffe, Jean Philippe Fortin, José Alquicira-Hernández, Leonardo Collado-Torres, Siruo Wang, Robert A. Phillips, Nishika Karbhari, Kasper D. Hansen, Ben Langmead, Jeffrey T. Leek

Research output: Contribution to journalArticle

32 Citations (Scopus)

Abstract

Background: Gene annotations, such as those in GENCODE, are derived primarily from alignments of spliced cDNA sequences and protein sequences. The impact of RNA-seq data on annotation has been confined to major projects like ENCODE and Illumina Body Map 2.0. Results: We aligned 21,504 Illumina-sequenced human RNA-seq samples from the Sequence Read Archive (SRA) to the human genome and compared detected exon-exon junctions with junctions in several recent gene annotations. We found 56,861 junctions (18.6%) in at least 1000 samples that were not annotated, and their expression associated with tissue type. Junctions well expressed in individual samples tended to be annotated. Newer samples contributed few novel well-supported junctions, with the vast majority of detected junctions present in samples before 2013. We compiled junction data into a resource called intropolis available at http://intropolis.rail.bio. We used this resource to search for a recently validated isoform of the ALK gene and characterized the potential functional implications of unannotated junctions with publicly available TRAP-seq data. Conclusions: Considering only the variation contained in annotation may suffice if an investigator is interested only in well-expressed transcript isoforms. However, genes that are not generally well expressed and nonetheless present in a small but significant number of samples in the SRA are likelier to be incompletely annotated. The rate at which evidence for novel junctions has been added to the SRA has tapered dramatically, even to the point of an asymptote. Now is perhaps an appropriate time to update incomplete annotations to include splicing present in the now-stable snapshot provided by the SRA.

Original languageEnglish (US)
Article number266
JournalGenome Biology
Volume17
Issue number1
DOIs
StatePublished - Dec 30 2016
Externally publishedYes

Fingerprint

RNA
Molecular Sequence Annotation
gene
Exons
Protein Isoforms
sampling
exons
Human Genome
genes
resource
Genes
genome
Complementary DNA
Research Personnel
protein
amino acid sequences
Proteins

Keywords

  • Intron
  • RNA-seq
  • Splicing

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Genetics
  • Cell Biology

Cite this

Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. / Nellore, Abhinav; Jaffe, Andrew E.; Fortin, Jean Philippe; Alquicira-Hernández, José; Collado-Torres, Leonardo; Wang, Siruo; Phillips, Robert A.; Karbhari, Nishika; Hansen, Kasper D.; Langmead, Ben; Leek, Jeffrey T.

In: Genome Biology, Vol. 17, No. 1, 266, 30.12.2016.

Research output: Contribution to journalArticle

Nellore, A, Jaffe, AE, Fortin, JP, Alquicira-Hernández, J, Collado-Torres, L, Wang, S, Phillips, RA, Karbhari, N, Hansen, KD, Langmead, B & Leek, JT 2016, 'Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive', Genome Biology, vol. 17, no. 1, 266. https://doi.org/10.1186/s13059-016-1118-6
Nellore, Abhinav ; Jaffe, Andrew E. ; Fortin, Jean Philippe ; Alquicira-Hernández, José ; Collado-Torres, Leonardo ; Wang, Siruo ; Phillips, Robert A. ; Karbhari, Nishika ; Hansen, Kasper D. ; Langmead, Ben ; Leek, Jeffrey T. / Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. In: Genome Biology. 2016 ; Vol. 17, No. 1.
@article{2bd5ec5989754bf096935023911edd0a,
title = "Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive",
abstract = "Background: Gene annotations, such as those in GENCODE, are derived primarily from alignments of spliced cDNA sequences and protein sequences. The impact of RNA-seq data on annotation has been confined to major projects like ENCODE and Illumina Body Map 2.0. Results: We aligned 21,504 Illumina-sequenced human RNA-seq samples from the Sequence Read Archive (SRA) to the human genome and compared detected exon-exon junctions with junctions in several recent gene annotations. We found 56,861 junctions (18.6{\%}) in at least 1000 samples that were not annotated, and their expression associated with tissue type. Junctions well expressed in individual samples tended to be annotated. Newer samples contributed few novel well-supported junctions, with the vast majority of detected junctions present in samples before 2013. We compiled junction data into a resource called intropolis available at http://intropolis.rail.bio. We used this resource to search for a recently validated isoform of the ALK gene and characterized the potential functional implications of unannotated junctions with publicly available TRAP-seq data. Conclusions: Considering only the variation contained in annotation may suffice if an investigator is interested only in well-expressed transcript isoforms. However, genes that are not generally well expressed and nonetheless present in a small but significant number of samples in the SRA are likelier to be incompletely annotated. The rate at which evidence for novel junctions has been added to the SRA has tapered dramatically, even to the point of an asymptote. Now is perhaps an appropriate time to update incomplete annotations to include splicing present in the now-stable snapshot provided by the SRA.",
keywords = "Intron, RNA-seq, Splicing",
author = "Abhinav Nellore and Jaffe, {Andrew E.} and Fortin, {Jean Philippe} and Jos{\'e} Alquicira-Hern{\'a}ndez and Leonardo Collado-Torres and Siruo Wang and Phillips, {Robert A.} and Nishika Karbhari and Hansen, {Kasper D.} and Ben Langmead and Leek, {Jeffrey T.}",
year = "2016",
month = "12",
day = "30",
doi = "10.1186/s13059-016-1118-6",
language = "English (US)",
volume = "17",
journal = "Genome Biology",
issn = "1474-7596",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive

AU - Nellore, Abhinav

AU - Jaffe, Andrew E.

AU - Fortin, Jean Philippe

AU - Alquicira-Hernández, José

AU - Collado-Torres, Leonardo

AU - Wang, Siruo

AU - Phillips, Robert A.

AU - Karbhari, Nishika

AU - Hansen, Kasper D.

AU - Langmead, Ben

AU - Leek, Jeffrey T.

PY - 2016/12/30

Y1 - 2016/12/30

N2 - Background: Gene annotations, such as those in GENCODE, are derived primarily from alignments of spliced cDNA sequences and protein sequences. The impact of RNA-seq data on annotation has been confined to major projects like ENCODE and Illumina Body Map 2.0. Results: We aligned 21,504 Illumina-sequenced human RNA-seq samples from the Sequence Read Archive (SRA) to the human genome and compared detected exon-exon junctions with junctions in several recent gene annotations. We found 56,861 junctions (18.6%) in at least 1000 samples that were not annotated, and their expression associated with tissue type. Junctions well expressed in individual samples tended to be annotated. Newer samples contributed few novel well-supported junctions, with the vast majority of detected junctions present in samples before 2013. We compiled junction data into a resource called intropolis available at http://intropolis.rail.bio. We used this resource to search for a recently validated isoform of the ALK gene and characterized the potential functional implications of unannotated junctions with publicly available TRAP-seq data. Conclusions: Considering only the variation contained in annotation may suffice if an investigator is interested only in well-expressed transcript isoforms. However, genes that are not generally well expressed and nonetheless present in a small but significant number of samples in the SRA are likelier to be incompletely annotated. The rate at which evidence for novel junctions has been added to the SRA has tapered dramatically, even to the point of an asymptote. Now is perhaps an appropriate time to update incomplete annotations to include splicing present in the now-stable snapshot provided by the SRA.

AB - Background: Gene annotations, such as those in GENCODE, are derived primarily from alignments of spliced cDNA sequences and protein sequences. The impact of RNA-seq data on annotation has been confined to major projects like ENCODE and Illumina Body Map 2.0. Results: We aligned 21,504 Illumina-sequenced human RNA-seq samples from the Sequence Read Archive (SRA) to the human genome and compared detected exon-exon junctions with junctions in several recent gene annotations. We found 56,861 junctions (18.6%) in at least 1000 samples that were not annotated, and their expression associated with tissue type. Junctions well expressed in individual samples tended to be annotated. Newer samples contributed few novel well-supported junctions, with the vast majority of detected junctions present in samples before 2013. We compiled junction data into a resource called intropolis available at http://intropolis.rail.bio. We used this resource to search for a recently validated isoform of the ALK gene and characterized the potential functional implications of unannotated junctions with publicly available TRAP-seq data. Conclusions: Considering only the variation contained in annotation may suffice if an investigator is interested only in well-expressed transcript isoforms. However, genes that are not generally well expressed and nonetheless present in a small but significant number of samples in the SRA are likelier to be incompletely annotated. The rate at which evidence for novel junctions has been added to the SRA has tapered dramatically, even to the point of an asymptote. Now is perhaps an appropriate time to update incomplete annotations to include splicing present in the now-stable snapshot provided by the SRA.

KW - Intron

KW - RNA-seq

KW - Splicing

UR - http://www.scopus.com/inward/record.url?scp=85007507393&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85007507393&partnerID=8YFLogxK

U2 - 10.1186/s13059-016-1118-6

DO - 10.1186/s13059-016-1118-6

M3 - Article

C2 - 28038678

AN - SCOPUS:85007507393

VL - 17

JO - Genome Biology

JF - Genome Biology

SN - 1474-7596

IS - 1

M1 - 266

ER -