Assessing the difficulty and time cost of de-identification in clinical narratives

David A. Dorr; W. F. Phillips; S. Phansalkar; S. A. Sims; J. F. Hurdle

doi:10.1055/s-0038-1634080

Assessing the difficulty and time cost of de-identification in clinical narratives

David A. Dorr, W. F. Phillips, S. Phansalkar, S. A. Sims, J. F. Hurdle

Medical Informatics and Clinical Epidemiology

Research output: Contribution to journal › Review article › peer-review

52 Scopus citations

Abstract

Objective: To characterize the difficulty confronting investigators in removing protected health information (PHI) from cross-discipline, free-text clinical notes, an important challenge to clinical informatics research as recalibrated by the introduction of the US Health Insurance Portability and Accountability Act (HIPAA) and similar regulations. Methods: Randomized selection of clinical narratives from complete admissions written by diverse providers, reviewed using a two-tiered rater system and simple automated regular expression tools. For manual review, two independent reviewers used simple search and replace algorithms and visual scanning to find PHI as defined by HIPAA, followed by an independent second review to detect any missed PHI. Simple automated review was also performed for the "easy" PHI that are number- or date-based. Results: From 262 notes, 2074 PHI, or 7.9 ± 6.1 per note, were found. The average recall (or sensitivity) was 95.9% while precision was 99.6% for single reviewers. Agreement between individual reviewers was strong (ICC = 0.99), although some asymmetry in errors was seen between reviewers (p = 0.001). The automated technique had better recall (98.5%) but worse precision (88.4%) for its subset of identifiers. Manually de-identifying a note took 87.3 ± 61 seconds on average. Conclusions: Manual de-identification of free-text notes is tedious and time-consuming, but even simple PHI is difficult to automatically identify with the exactitude required under HIPAA.

Original language	English (US)
Pages (from-to)	246-252
Number of pages	7
Journal	Methods of Information in Medicine
Volume	45
Issue number	3
DOIs	https://doi.org/10.1055/s-0038-1634080
State	Published - 2006

Keywords

Computerized medical records systems
De-identification
Health Insurance Portability and Accountability Act
Medical informatics computing
natural language processing

ASJC Scopus subject areas

Health Informatics
Advanced and Specialized Nursing
Health Information Management

Access to Document

10.1055/s-0038-1634080

Cite this

@article{6355723d942949998609bb62ab34016e,

title = "Assessing the difficulty and time cost of de-identification in clinical narratives",

abstract = "Objective: To characterize the difficulty confronting investigators in removing protected health information (PHI) from cross-discipline, free-text clinical notes, an important challenge to clinical informatics research as recalibrated by the introduction of the US Health Insurance Portability and Accountability Act (HIPAA) and similar regulations. Methods: Randomized selection of clinical narratives from complete admissions written by diverse providers, reviewed using a two-tiered rater system and simple automated regular expression tools. For manual review, two independent reviewers used simple search and replace algorithms and visual scanning to find PHI as defined by HIPAA, followed by an independent second review to detect any missed PHI. Simple automated review was also performed for the {"}easy{"} PHI that are number- or date-based. Results: From 262 notes, 2074 PHI, or 7.9 ± 6.1 per note, were found. The average recall (or sensitivity) was 95.9% while precision was 99.6% for single reviewers. Agreement between individual reviewers was strong (ICC = 0.99), although some asymmetry in errors was seen between reviewers (p = 0.001). The automated technique had better recall (98.5%) but worse precision (88.4%) for its subset of identifiers. Manually de-identifying a note took 87.3 ± 61 seconds on average. Conclusions: Manual de-identification of free-text notes is tedious and time-consuming, but even simple PHI is difficult to automatically identify with the exactitude required under HIPAA.",

keywords = "Computerized medical records systems, De-identification, Health Insurance Portability and Accountability Act, Medical informatics computing, natural language processing",

author = "Dorr, {David A.} and Phillips, {W. F.} and S. Phansalkar and Sims, {S. A.} and Hurdle, {J. F.}",

year = "2006",

doi = "10.1055/s-0038-1634080",

language = "English (US)",

volume = "45",

pages = "246--252",

journal = "Methods of Information in Medicine",

issn = "0026-1270",

publisher = "Schattauer GmbH",

number = "3",

}

TY - JOUR

T1 - Assessing the difficulty and time cost of de-identification in clinical narratives

AU - Dorr, David A.

AU - Phillips, W. F.

AU - Phansalkar, S.

AU - Sims, S. A.

AU - Hurdle, J. F.

PY - 2006

Y1 - 2006

N2 - Objective: To characterize the difficulty confronting investigators in removing protected health information (PHI) from cross-discipline, free-text clinical notes, an important challenge to clinical informatics research as recalibrated by the introduction of the US Health Insurance Portability and Accountability Act (HIPAA) and similar regulations. Methods: Randomized selection of clinical narratives from complete admissions written by diverse providers, reviewed using a two-tiered rater system and simple automated regular expression tools. For manual review, two independent reviewers used simple search and replace algorithms and visual scanning to find PHI as defined by HIPAA, followed by an independent second review to detect any missed PHI. Simple automated review was also performed for the "easy" PHI that are number- or date-based. Results: From 262 notes, 2074 PHI, or 7.9 ± 6.1 per note, were found. The average recall (or sensitivity) was 95.9% while precision was 99.6% for single reviewers. Agreement between individual reviewers was strong (ICC = 0.99), although some asymmetry in errors was seen between reviewers (p = 0.001). The automated technique had better recall (98.5%) but worse precision (88.4%) for its subset of identifiers. Manually de-identifying a note took 87.3 ± 61 seconds on average. Conclusions: Manual de-identification of free-text notes is tedious and time-consuming, but even simple PHI is difficult to automatically identify with the exactitude required under HIPAA.

AB - Objective: To characterize the difficulty confronting investigators in removing protected health information (PHI) from cross-discipline, free-text clinical notes, an important challenge to clinical informatics research as recalibrated by the introduction of the US Health Insurance Portability and Accountability Act (HIPAA) and similar regulations. Methods: Randomized selection of clinical narratives from complete admissions written by diverse providers, reviewed using a two-tiered rater system and simple automated regular expression tools. For manual review, two independent reviewers used simple search and replace algorithms and visual scanning to find PHI as defined by HIPAA, followed by an independent second review to detect any missed PHI. Simple automated review was also performed for the "easy" PHI that are number- or date-based. Results: From 262 notes, 2074 PHI, or 7.9 ± 6.1 per note, were found. The average recall (or sensitivity) was 95.9% while precision was 99.6% for single reviewers. Agreement between individual reviewers was strong (ICC = 0.99), although some asymmetry in errors was seen between reviewers (p = 0.001). The automated technique had better recall (98.5%) but worse precision (88.4%) for its subset of identifiers. Manually de-identifying a note took 87.3 ± 61 seconds on average. Conclusions: Manual de-identification of free-text notes is tedious and time-consuming, but even simple PHI is difficult to automatically identify with the exactitude required under HIPAA.

KW - Computerized medical records systems

KW - De-identification

KW - Health Insurance Portability and Accountability Act

KW - Medical informatics computing

KW - natural language processing

UR - http://www.scopus.com/inward/record.url?scp=33745562886&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33745562886&partnerID=8YFLogxK

U2 - 10.1055/s-0038-1634080

DO - 10.1055/s-0038-1634080

M3 - Review article

C2 - 16685332

AN - SCOPUS:33745562886

SN - 0026-1270

VL - 45

SP - 246

EP - 252

JO - Methods of Information in Medicine

JF - Methods of Information in Medicine

IS - 3

ER -

Assessing the difficulty and time cost of de-identification in clinical narratives

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this