Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts

Aaron Cohen, William (Bill) Hersh, C. Dubay, K. Spackman

Research output: Contribution to journalArticle

40 Citations (Scopus)

Abstract

Background: Text-mining can assist biomedical researchers in reducing information overload by extracting useful knowledge from large collections of text. We developed a novel text-mining method based on analyzing the network structure created by symbol co-occurrences as a way to extend the capabilities of knowledge extraction. The method was applied to the task of automatic gene and protein name synonym extraction. Results: Performance was measured on a test set consisting of about 50,000 abstracts from one year of MEDLINE. Synonyms retrieved from curated genomics databases were used as a gold standard. The system obtained a maximum F-score of 22.21% (23.18% precision and 21.36% recall), with high efficiency in the use of seed pairs. Conclusion: The method performs comparably with other studied methods, does not rely on sophisticated named-entity recognition, and requires little initial seed knowledge.

Original languageEnglish (US)
Article number103
JournalBMC Bioinformatics
Volume6
DOIs
StatePublished - Apr 22 2005

Fingerprint

Network Structure
MEDLINE
Names
Genes
Gene
Proteins
Protein
Data Mining
Text Mining
Seed
Named Entity Recognition
Knowledge Extraction
Overload
Test Set
Genomics
Gold
High Efficiency
Seeds
Research Personnel
Databases

ASJC Scopus subject areas

  • Medicine(all)
  • Structural Biology
  • Applied Mathematics

Cite this

Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts. / Cohen, Aaron; Hersh, William (Bill); Dubay, C.; Spackman, K.

In: BMC Bioinformatics, Vol. 6, 103, 22.04.2005.

Research output: Contribution to journalArticle

@article{14389a0af7394d10a498778959214688,
title = "Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts",
abstract = "Background: Text-mining can assist biomedical researchers in reducing information overload by extracting useful knowledge from large collections of text. We developed a novel text-mining method based on analyzing the network structure created by symbol co-occurrences as a way to extend the capabilities of knowledge extraction. The method was applied to the task of automatic gene and protein name synonym extraction. Results: Performance was measured on a test set consisting of about 50,000 abstracts from one year of MEDLINE. Synonyms retrieved from curated genomics databases were used as a gold standard. The system obtained a maximum F-score of 22.21{\%} (23.18{\%} precision and 21.36{\%} recall), with high efficiency in the use of seed pairs. Conclusion: The method performs comparably with other studied methods, does not rely on sophisticated named-entity recognition, and requires little initial seed knowledge.",
author = "Aaron Cohen and Hersh, {William (Bill)} and C. Dubay and K. Spackman",
year = "2005",
month = "4",
day = "22",
doi = "10.1186/1471-2105-6-103",
language = "English (US)",
volume = "6",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts

AU - Cohen, Aaron

AU - Hersh, William (Bill)

AU - Dubay, C.

AU - Spackman, K.

PY - 2005/4/22

Y1 - 2005/4/22

N2 - Background: Text-mining can assist biomedical researchers in reducing information overload by extracting useful knowledge from large collections of text. We developed a novel text-mining method based on analyzing the network structure created by symbol co-occurrences as a way to extend the capabilities of knowledge extraction. The method was applied to the task of automatic gene and protein name synonym extraction. Results: Performance was measured on a test set consisting of about 50,000 abstracts from one year of MEDLINE. Synonyms retrieved from curated genomics databases were used as a gold standard. The system obtained a maximum F-score of 22.21% (23.18% precision and 21.36% recall), with high efficiency in the use of seed pairs. Conclusion: The method performs comparably with other studied methods, does not rely on sophisticated named-entity recognition, and requires little initial seed knowledge.

AB - Background: Text-mining can assist biomedical researchers in reducing information overload by extracting useful knowledge from large collections of text. We developed a novel text-mining method based on analyzing the network structure created by symbol co-occurrences as a way to extend the capabilities of knowledge extraction. The method was applied to the task of automatic gene and protein name synonym extraction. Results: Performance was measured on a test set consisting of about 50,000 abstracts from one year of MEDLINE. Synonyms retrieved from curated genomics databases were used as a gold standard. The system obtained a maximum F-score of 22.21% (23.18% precision and 21.36% recall), with high efficiency in the use of seed pairs. Conclusion: The method performs comparably with other studied methods, does not rely on sophisticated named-entity recognition, and requires little initial seed knowledge.

UR - http://www.scopus.com/inward/record.url?scp=25444525410&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=25444525410&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-6-103

DO - 10.1186/1471-2105-6-103

M3 - Article

VL - 6

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 103

ER -