Exposing the cancer genome atlas as a SPARQL endpoint

Helena F. Deus, Diogo F. Veiga, Pablo R. Freire, John N. Weinstein, Gordon Mills, Jonas S. Almeida

Research output: Contribution to journalArticle

25 Citations (Scopus)

Abstract

The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.

Original languageEnglish (US)
Pages (from-to)998-1008
Number of pages11
JournalJournal of Biomedical Informatics
Volume43
Issue number6
DOIs
StatePublished - Dec 1 2010
Externally publishedYes

Fingerprint

Atlases
Genes
Genome
Neoplasms
Information Storage and Retrieval
Web services
Query languages
Reusability
Practice Guidelines
Semantics
Language
Demography
Databases

Keywords

  • Data integration
  • Linked Data
  • RDF
  • SPARQL
  • TCGA

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Cite this

Deus, H. F., Veiga, D. F., Freire, P. R., Weinstein, J. N., Mills, G., & Almeida, J. S. (2010). Exposing the cancer genome atlas as a SPARQL endpoint. Journal of Biomedical Informatics, 43(6), 998-1008. https://doi.org/10.1016/j.jbi.2010.09.004

Exposing the cancer genome atlas as a SPARQL endpoint. / Deus, Helena F.; Veiga, Diogo F.; Freire, Pablo R.; Weinstein, John N.; Mills, Gordon; Almeida, Jonas S.

In: Journal of Biomedical Informatics, Vol. 43, No. 6, 01.12.2010, p. 998-1008.

Research output: Contribution to journalArticle

Deus, HF, Veiga, DF, Freire, PR, Weinstein, JN, Mills, G & Almeida, JS 2010, 'Exposing the cancer genome atlas as a SPARQL endpoint', Journal of Biomedical Informatics, vol. 43, no. 6, pp. 998-1008. https://doi.org/10.1016/j.jbi.2010.09.004
Deus, Helena F. ; Veiga, Diogo F. ; Freire, Pablo R. ; Weinstein, John N. ; Mills, Gordon ; Almeida, Jonas S. / Exposing the cancer genome atlas as a SPARQL endpoint. In: Journal of Biomedical Informatics. 2010 ; Vol. 43, No. 6. pp. 998-1008.
@article{d97e17ae93c64e59a3947349250ebb1b,
title = "Exposing the cancer genome atlas as a SPARQL endpoint",
abstract = "The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.",
keywords = "Data integration, Linked Data, RDF, SPARQL, TCGA",
author = "Deus, {Helena F.} and Veiga, {Diogo F.} and Freire, {Pablo R.} and Weinstein, {John N.} and Gordon Mills and Almeida, {Jonas S.}",
year = "2010",
month = "12",
day = "1",
doi = "10.1016/j.jbi.2010.09.004",
language = "English (US)",
volume = "43",
pages = "998--1008",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",
number = "6",

}

TY - JOUR

T1 - Exposing the cancer genome atlas as a SPARQL endpoint

AU - Deus, Helena F.

AU - Veiga, Diogo F.

AU - Freire, Pablo R.

AU - Weinstein, John N.

AU - Mills, Gordon

AU - Almeida, Jonas S.

PY - 2010/12/1

Y1 - 2010/12/1

N2 - The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.

AB - The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.

KW - Data integration

KW - Linked Data

KW - RDF

KW - SPARQL

KW - TCGA

UR - http://www.scopus.com/inward/record.url?scp=78649318168&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78649318168&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2010.09.004

DO - 10.1016/j.jbi.2010.09.004

M3 - Article

C2 - 20851208

AN - SCOPUS:78649318168

VL - 43

SP - 998

EP - 1008

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

IS - 6

ER -