Assessing the Feasibility of Large-Scale Natural Language Processing in a Corpus of Ordinary Medical Records: A Lexical Analysis

William R. Hersh; Emily M. Campbell; Susan E. Malveau

Assessing the Feasibility of Large-Scale Natural Language Processing in a Corpus of Ordinary Medical Records: A Lexical Analysis

William R. Hersh, Emily M. Campbell, Susan E. Malveau

Medical Informatics and Clinical Epidemiology

Research output: Contribution to journal › Article › peer-review

16 Scopus citations

Abstract

Objective: Identify the lexical content of a large corpus of ordinary medical records to assess the feasibility of large-scale natural language processing. Methods: A corpus of 560 megabytes of medical record text from an academic medical center was broken into individual words and compared with the words in six medical vocabularies, a common word list, and a database of patient names. Unrecognized words were assessed for algorithmic and contextual approaches to identifying more words, while the remainder were analyzed for spelling correctness. Results: About 60% of the words occurred in the medical vocabularies, common word list, or names database. Of the remainder, one-third were recognizable by other means. Of the remaining unrecognizable words, over three-fourths represented correctly spelled real words and the rest were misspellings. Conclusions: Large-scale generalized natural language processing methods for the medical record will require expansion of existing vocabularies, spelling error correction, and other algorithmic approaches to map words into those from clinical vocabularies.

Original language	English (US)
Pages (from-to)	580-584
Number of pages	5
Journal	Journal of the American Medical Informatics Association
Volume	4
Issue number	SUPPL.
State	Published - 1997

ASJC Scopus subject areas

Health Informatics

Cite this

@article{e63f5967e3e04188ba9340bb7d340213,

title = "Assessing the Feasibility of Large-Scale Natural Language Processing in a Corpus of Ordinary Medical Records: A Lexical Analysis",

abstract = "Objective: Identify the lexical content of a large corpus of ordinary medical records to assess the feasibility of large-scale natural language processing. Methods: A corpus of 560 megabytes of medical record text from an academic medical center was broken into individual words and compared with the words in six medical vocabularies, a common word list, and a database of patient names. Unrecognized words were assessed for algorithmic and contextual approaches to identifying more words, while the remainder were analyzed for spelling correctness. Results: About 60% of the words occurred in the medical vocabularies, common word list, or names database. Of the remainder, one-third were recognizable by other means. Of the remaining unrecognizable words, over three-fourths represented correctly spelled real words and the rest were misspellings. Conclusions: Large-scale generalized natural language processing methods for the medical record will require expansion of existing vocabularies, spelling error correction, and other algorithmic approaches to map words into those from clinical vocabularies.",

author = "Hersh, {William R.} and Campbell, {Emily M.} and Malveau, {Susan E.}",

year = "1997",

language = "English (US)",

volume = "4",

pages = "580--584",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "SUPPL.",

}

TY - JOUR

T1 - Assessing the Feasibility of Large-Scale Natural Language Processing in a Corpus of Ordinary Medical Records

T2 - A Lexical Analysis

AU - Hersh, William R.

AU - Campbell, Emily M.

AU - Malveau, Susan E.

PY - 1997

Y1 - 1997

N2 - Objective: Identify the lexical content of a large corpus of ordinary medical records to assess the feasibility of large-scale natural language processing. Methods: A corpus of 560 megabytes of medical record text from an academic medical center was broken into individual words and compared with the words in six medical vocabularies, a common word list, and a database of patient names. Unrecognized words were assessed for algorithmic and contextual approaches to identifying more words, while the remainder were analyzed for spelling correctness. Results: About 60% of the words occurred in the medical vocabularies, common word list, or names database. Of the remainder, one-third were recognizable by other means. Of the remaining unrecognizable words, over three-fourths represented correctly spelled real words and the rest were misspellings. Conclusions: Large-scale generalized natural language processing methods for the medical record will require expansion of existing vocabularies, spelling error correction, and other algorithmic approaches to map words into those from clinical vocabularies.

AB - Objective: Identify the lexical content of a large corpus of ordinary medical records to assess the feasibility of large-scale natural language processing. Methods: A corpus of 560 megabytes of medical record text from an academic medical center was broken into individual words and compared with the words in six medical vocabularies, a common word list, and a database of patient names. Unrecognized words were assessed for algorithmic and contextual approaches to identifying more words, while the remainder were analyzed for spelling correctness. Results: About 60% of the words occurred in the medical vocabularies, common word list, or names database. Of the remainder, one-third were recognizable by other means. Of the remaining unrecognizable words, over three-fourths represented correctly spelled real words and the rest were misspellings. Conclusions: Large-scale generalized natural language processing methods for the medical record will require expansion of existing vocabularies, spelling error correction, and other algorithmic approaches to map words into those from clinical vocabularies.

UR - http://www.scopus.com/inward/record.url?scp=0347210453&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0347210453&partnerID=8YFLogxK

M3 - Article

C2 - 9357692

AN - SCOPUS:0347210453

SN - 1067-5027

VL - 4

SP - 580

EP - 584

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - SUPPL.

ER -

Assessing the Feasibility of Large-Scale Natural Language Processing in a Corpus of Ordinary Medical Records: A Lexical Analysis

Abstract

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this