Overview of BioCreative II gene normalization

Alexander A. Morgan, Zhiyong Lu, Xinglong Wang, Aaron Cohen, Juliane Fluck, Patrick Ruch, Anna Divoli, Katrin Fundel, Robert Leaman, Jörg Hakenberg, Chengjie Sun, Heng Hui Liu, Rafael Torres, Michael Krauthammer, William W. Lau, Hongfang Liu, Chun Nan Hsu, Martijn Schuemie, K. Bretonnel Cohen, Lynette Hirschman

Research output: Contribution to journalArticle

188 Citations (Scopus)

Abstract

Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. Results: Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. Conclusion: Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases.

Original languageEnglish (US)
Article numberS3
JournalGenome Biology
Volume9
Issue numberSUPPL. 2
DOIs
StatePublished - Sep 1 2008

Fingerprint

gene
Genes
genes
normalisation
Databases
Politics
PubMed
MEDLINE
gold
Names
fold
organisms

ASJC Scopus subject areas

  • Genetics
  • Cell Biology
  • Ecology, Evolution, Behavior and Systematics

Cite this

Morgan, A. A., Lu, Z., Wang, X., Cohen, A., Fluck, J., Ruch, P., ... Hirschman, L. (2008). Overview of BioCreative II gene normalization. Genome Biology, 9(SUPPL. 2), [S3]. https://doi.org/10.1186/gb-2008-9-s2-s3

Overview of BioCreative II gene normalization. / Morgan, Alexander A.; Lu, Zhiyong; Wang, Xinglong; Cohen, Aaron; Fluck, Juliane; Ruch, Patrick; Divoli, Anna; Fundel, Katrin; Leaman, Robert; Hakenberg, Jörg; Sun, Chengjie; Liu, Heng Hui; Torres, Rafael; Krauthammer, Michael; Lau, William W.; Liu, Hongfang; Hsu, Chun Nan; Schuemie, Martijn; Cohen, K. Bretonnel; Hirschman, Lynette.

In: Genome Biology, Vol. 9, No. SUPPL. 2, S3, 01.09.2008.

Research output: Contribution to journalArticle

Morgan, AA, Lu, Z, Wang, X, Cohen, A, Fluck, J, Ruch, P, Divoli, A, Fundel, K, Leaman, R, Hakenberg, J, Sun, C, Liu, HH, Torres, R, Krauthammer, M, Lau, WW, Liu, H, Hsu, CN, Schuemie, M, Cohen, KB & Hirschman, L 2008, 'Overview of BioCreative II gene normalization', Genome Biology, vol. 9, no. SUPPL. 2, S3. https://doi.org/10.1186/gb-2008-9-s2-s3
Morgan AA, Lu Z, Wang X, Cohen A, Fluck J, Ruch P et al. Overview of BioCreative II gene normalization. Genome Biology. 2008 Sep 1;9(SUPPL. 2). S3. https://doi.org/10.1186/gb-2008-9-s2-s3
Morgan, Alexander A. ; Lu, Zhiyong ; Wang, Xinglong ; Cohen, Aaron ; Fluck, Juliane ; Ruch, Patrick ; Divoli, Anna ; Fundel, Katrin ; Leaman, Robert ; Hakenberg, Jörg ; Sun, Chengjie ; Liu, Heng Hui ; Torres, Rafael ; Krauthammer, Michael ; Lau, William W. ; Liu, Hongfang ; Hsu, Chun Nan ; Schuemie, Martijn ; Cohen, K. Bretonnel ; Hirschman, Lynette. / Overview of BioCreative II gene normalization. In: Genome Biology. 2008 ; Vol. 9, No. SUPPL. 2.
@article{b70a4daad35f48dda6290d93ba917546,
title = "Overview of BioCreative II gene normalization",
abstract = "Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90{\%}. Results: Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. Conclusion: Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90{\%} agreement. These results show promise as tools to link the literature with biological databases.",
author = "Morgan, {Alexander A.} and Zhiyong Lu and Xinglong Wang and Aaron Cohen and Juliane Fluck and Patrick Ruch and Anna Divoli and Katrin Fundel and Robert Leaman and J{\"o}rg Hakenberg and Chengjie Sun and Liu, {Heng Hui} and Rafael Torres and Michael Krauthammer and Lau, {William W.} and Hongfang Liu and Hsu, {Chun Nan} and Martijn Schuemie and Cohen, {K. Bretonnel} and Lynette Hirschman",
year = "2008",
month = "9",
day = "1",
doi = "10.1186/gb-2008-9-s2-s3",
language = "English (US)",
volume = "9",
journal = "Genome Biology",
issn = "1474-7596",
publisher = "BioMed Central",
number = "SUPPL. 2",

}

TY - JOUR

T1 - Overview of BioCreative II gene normalization

AU - Morgan, Alexander A.

AU - Lu, Zhiyong

AU - Wang, Xinglong

AU - Cohen, Aaron

AU - Fluck, Juliane

AU - Ruch, Patrick

AU - Divoli, Anna

AU - Fundel, Katrin

AU - Leaman, Robert

AU - Hakenberg, Jörg

AU - Sun, Chengjie

AU - Liu, Heng Hui

AU - Torres, Rafael

AU - Krauthammer, Michael

AU - Lau, William W.

AU - Liu, Hongfang

AU - Hsu, Chun Nan

AU - Schuemie, Martijn

AU - Cohen, K. Bretonnel

AU - Hirschman, Lynette

PY - 2008/9/1

Y1 - 2008/9/1

N2 - Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. Results: Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. Conclusion: Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases.

AB - Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. Results: Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. Conclusion: Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases.

UR - http://www.scopus.com/inward/record.url?scp=41349103793&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=41349103793&partnerID=8YFLogxK

U2 - 10.1186/gb-2008-9-s2-s3

DO - 10.1186/gb-2008-9-s2-s3

M3 - Article

C2 - 18834494

AN - SCOPUS:41349103793

VL - 9

JO - Genome Biology

JF - Genome Biology

SN - 1474-7596

IS - SUPPL. 2

M1 - S3

ER -