Gold-standard ontology-based anatomical annotation in the CRAFT Corpus

Michael Bada, Nicole Vasilevsky, William A. Baumgartner, Melissa Haendel, Lawrence E. Hunter

Research output: Contribution to journalArticle

Abstract

Gold-standard annotated corpora have become important resources for the training and testing of natural-language-processing (NLP) systems designed to support biocuration efforts, and ontologies are increasingly used to facilitate curational consistency and semantic integration across disparate resources. Bringing together the respective power of these, the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of full-length, open-access biomedical journal articles with extensive manually created syntactic, formatting and semantic markup, was previously created and released. This initial public release has already been used in multiple projects to drive development of systems focused on a variety of biocuration, search, visualization, and semantic and syntactic NLP tasks. Building on its demonstrated utility, we have expanded the CRAFT Corpus with a large set of manually created semantic annotations relying on Uberon, an ontology representing anatomical entities and life-cycle stages of multicellular organisms across species as well as types of multicellular organisms defined in terms of life-cycle stage and sexual characteristics. This newly created set of annotations, which has been added for v2.1 of the corpus, is by far the largest publicly available collection of gold-standard anatomical markup and is the first large-scale effort at manual markup of biomedical text relying on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical categories, as performed in previous corpora. In addition to presenting and discussing this newly available resource, we apply it to provide a performance baseline for the automatic annotation of anatomical concepts in biomedical text using a prominent concept recognition system. The full corpus, released with a CC BY 3.0 license, may be downloaded from http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. Database URL: http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

Original languageEnglish (US)
JournalDatabase : the journal of biological databases and curation
Volume2017
DOIs
StatePublished - Jan 1 2017

Fingerprint

Semantics
Gold
gold
Ontology
Natural Language Processing
Syntactics
Life Cycle Stages
Life cycle
Natural language processing systems
developmental stages
organisms
terminology
Licensure
Terminology
Visualization
Databases
Testing
Processing
testing

ASJC Scopus subject areas

  • Information Systems
  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

Gold-standard ontology-based anatomical annotation in the CRAFT Corpus. / Bada, Michael; Vasilevsky, Nicole; Baumgartner, William A.; Haendel, Melissa; Hunter, Lawrence E.

In: Database : the journal of biological databases and curation, Vol. 2017, 01.01.2017.

Research output: Contribution to journalArticle

Bada, Michael ; Vasilevsky, Nicole ; Baumgartner, William A. ; Haendel, Melissa ; Hunter, Lawrence E. / Gold-standard ontology-based anatomical annotation in the CRAFT Corpus. In: Database : the journal of biological databases and curation. 2017 ; Vol. 2017.
@article{a0373e33ba1643578d5f40cbda31724f,
title = "Gold-standard ontology-based anatomical annotation in the CRAFT Corpus",
abstract = "Gold-standard annotated corpora have become important resources for the training and testing of natural-language-processing (NLP) systems designed to support biocuration efforts, and ontologies are increasingly used to facilitate curational consistency and semantic integration across disparate resources. Bringing together the respective power of these, the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of full-length, open-access biomedical journal articles with extensive manually created syntactic, formatting and semantic markup, was previously created and released. This initial public release has already been used in multiple projects to drive development of systems focused on a variety of biocuration, search, visualization, and semantic and syntactic NLP tasks. Building on its demonstrated utility, we have expanded the CRAFT Corpus with a large set of manually created semantic annotations relying on Uberon, an ontology representing anatomical entities and life-cycle stages of multicellular organisms across species as well as types of multicellular organisms defined in terms of life-cycle stage and sexual characteristics. This newly created set of annotations, which has been added for v2.1 of the corpus, is by far the largest publicly available collection of gold-standard anatomical markup and is the first large-scale effort at manual markup of biomedical text relying on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical categories, as performed in previous corpora. In addition to presenting and discussing this newly available resource, we apply it to provide a performance baseline for the automatic annotation of anatomical concepts in biomedical text using a prominent concept recognition system. The full corpus, released with a CC BY 3.0 license, may be downloaded from http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. Database URL: http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.",
author = "Michael Bada and Nicole Vasilevsky and Baumgartner, {William A.} and Melissa Haendel and Hunter, {Lawrence E.}",
year = "2017",
month = "1",
day = "1",
doi = "10.1093/database/bax087",
language = "English (US)",
volume = "2017",
journal = "Database : the journal of biological databases and curation",
issn = "1758-0463",
publisher = "Oxford University Press",

}

TY - JOUR

T1 - Gold-standard ontology-based anatomical annotation in the CRAFT Corpus

AU - Bada, Michael

AU - Vasilevsky, Nicole

AU - Baumgartner, William A.

AU - Haendel, Melissa

AU - Hunter, Lawrence E.

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Gold-standard annotated corpora have become important resources for the training and testing of natural-language-processing (NLP) systems designed to support biocuration efforts, and ontologies are increasingly used to facilitate curational consistency and semantic integration across disparate resources. Bringing together the respective power of these, the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of full-length, open-access biomedical journal articles with extensive manually created syntactic, formatting and semantic markup, was previously created and released. This initial public release has already been used in multiple projects to drive development of systems focused on a variety of biocuration, search, visualization, and semantic and syntactic NLP tasks. Building on its demonstrated utility, we have expanded the CRAFT Corpus with a large set of manually created semantic annotations relying on Uberon, an ontology representing anatomical entities and life-cycle stages of multicellular organisms across species as well as types of multicellular organisms defined in terms of life-cycle stage and sexual characteristics. This newly created set of annotations, which has been added for v2.1 of the corpus, is by far the largest publicly available collection of gold-standard anatomical markup and is the first large-scale effort at manual markup of biomedical text relying on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical categories, as performed in previous corpora. In addition to presenting and discussing this newly available resource, we apply it to provide a performance baseline for the automatic annotation of anatomical concepts in biomedical text using a prominent concept recognition system. The full corpus, released with a CC BY 3.0 license, may be downloaded from http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. Database URL: http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

AB - Gold-standard annotated corpora have become important resources for the training and testing of natural-language-processing (NLP) systems designed to support biocuration efforts, and ontologies are increasingly used to facilitate curational consistency and semantic integration across disparate resources. Bringing together the respective power of these, the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of full-length, open-access biomedical journal articles with extensive manually created syntactic, formatting and semantic markup, was previously created and released. This initial public release has already been used in multiple projects to drive development of systems focused on a variety of biocuration, search, visualization, and semantic and syntactic NLP tasks. Building on its demonstrated utility, we have expanded the CRAFT Corpus with a large set of manually created semantic annotations relying on Uberon, an ontology representing anatomical entities and life-cycle stages of multicellular organisms across species as well as types of multicellular organisms defined in terms of life-cycle stage and sexual characteristics. This newly created set of annotations, which has been added for v2.1 of the corpus, is by far the largest publicly available collection of gold-standard anatomical markup and is the first large-scale effort at manual markup of biomedical text relying on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical categories, as performed in previous corpora. In addition to presenting and discussing this newly available resource, we apply it to provide a performance baseline for the automatic annotation of anatomical concepts in biomedical text using a prominent concept recognition system. The full corpus, released with a CC BY 3.0 license, may be downloaded from http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. Database URL: http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

UR - http://www.scopus.com/inward/record.url?scp=85075114633&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85075114633&partnerID=8YFLogxK

U2 - 10.1093/database/bax087

DO - 10.1093/database/bax087

M3 - Article

C2 - 31725864

AN - SCOPUS:85075114633

VL - 2017

JO - Database : the journal of biological databases and curation

JF - Database : the journal of biological databases and curation

SN - 1758-0463

ER -