A probabilistic automated tagger to identify human-related publications

Aaron Cohen, Zackary O. Dunivin, Neil R. Smalheiser

Research output: Contribution to journalArticle

Abstract

The Medical Subject Heading 'Humans' is manually curated and indicates human-related studies within MEDLINE. However, newly published MEDLINE articles may take months to be indexed and non-MEDLINE articles lack consistent, transparent indexing of this feature. Therefore, for up to date and broad literature searches, there is a need for an independent automated system to identify whether a given publication is human-related, particularly when they lack Medical Subject Headings. One million MEDLINE records published in 1987-2014 were randomly selected. Text-based features from the title, abstract, author name and journal fields were extracted. A linear support vector machine was trained to estimate the probability that a given article should be indexed as Humans and was evaluated on records from 2015 to 2016. Overall accuracy was high: area under the receiver operating curve = 0.976, F1 = 95% relative to MeSH indexing. Manual review of cases of extreme disagreement with MEDLINE showed 73.5% agreement with the automated prediction. We have tagged all articles indexed in PubMed with predictive scores and have made the information publicly available at http://arrowsmith.psych.uic.edu/evidence-based-medicine/index.html. We have also made available a web-based interface to allow users to obtain predictive scores for non-MEDLINE articles. This will assist in the triage of clinical evidence for writing systematic reviews.

Original languageEnglish (US)
JournalDatabase
Volume2018
Issue number2018
DOIs
StatePublished - Jan 1 2018

Fingerprint

MEDLINE
Medical Subject Headings
Medicine
Support vector machines
Triage
Evidence-Based Medicine
systematic review
PubMed
Names
Publications
medicine
prediction

ASJC Scopus subject areas

  • Information Systems
  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

A probabilistic automated tagger to identify human-related publications. / Cohen, Aaron; Dunivin, Zackary O.; Smalheiser, Neil R.

In: Database, Vol. 2018, No. 2018, 01.01.2018.

Research output: Contribution to journalArticle

Cohen, Aaron ; Dunivin, Zackary O. ; Smalheiser, Neil R. / A probabilistic automated tagger to identify human-related publications. In: Database. 2018 ; Vol. 2018, No. 2018.
@article{b5c399a362c1469187d33113399fcd23,
title = "A probabilistic automated tagger to identify human-related publications",
abstract = "The Medical Subject Heading 'Humans' is manually curated and indicates human-related studies within MEDLINE. However, newly published MEDLINE articles may take months to be indexed and non-MEDLINE articles lack consistent, transparent indexing of this feature. Therefore, for up to date and broad literature searches, there is a need for an independent automated system to identify whether a given publication is human-related, particularly when they lack Medical Subject Headings. One million MEDLINE records published in 1987-2014 were randomly selected. Text-based features from the title, abstract, author name and journal fields were extracted. A linear support vector machine was trained to estimate the probability that a given article should be indexed as Humans and was evaluated on records from 2015 to 2016. Overall accuracy was high: area under the receiver operating curve = 0.976, F1 = 95{\%} relative to MeSH indexing. Manual review of cases of extreme disagreement with MEDLINE showed 73.5{\%} agreement with the automated prediction. We have tagged all articles indexed in PubMed with predictive scores and have made the information publicly available at http://arrowsmith.psych.uic.edu/evidence-based-medicine/index.html. We have also made available a web-based interface to allow users to obtain predictive scores for non-MEDLINE articles. This will assist in the triage of clinical evidence for writing systematic reviews.",
author = "Aaron Cohen and Dunivin, {Zackary O.} and Smalheiser, {Neil R.}",
year = "2018",
month = "1",
day = "1",
doi = "10.1093/database/bay079",
language = "English (US)",
volume = "2018",
journal = "Database : the journal of biological databases and curation",
issn = "1758-0463",
publisher = "Oxford University Press",
number = "2018",

}

TY - JOUR

T1 - A probabilistic automated tagger to identify human-related publications

AU - Cohen, Aaron

AU - Dunivin, Zackary O.

AU - Smalheiser, Neil R.

PY - 2018/1/1

Y1 - 2018/1/1

N2 - The Medical Subject Heading 'Humans' is manually curated and indicates human-related studies within MEDLINE. However, newly published MEDLINE articles may take months to be indexed and non-MEDLINE articles lack consistent, transparent indexing of this feature. Therefore, for up to date and broad literature searches, there is a need for an independent automated system to identify whether a given publication is human-related, particularly when they lack Medical Subject Headings. One million MEDLINE records published in 1987-2014 were randomly selected. Text-based features from the title, abstract, author name and journal fields were extracted. A linear support vector machine was trained to estimate the probability that a given article should be indexed as Humans and was evaluated on records from 2015 to 2016. Overall accuracy was high: area under the receiver operating curve = 0.976, F1 = 95% relative to MeSH indexing. Manual review of cases of extreme disagreement with MEDLINE showed 73.5% agreement with the automated prediction. We have tagged all articles indexed in PubMed with predictive scores and have made the information publicly available at http://arrowsmith.psych.uic.edu/evidence-based-medicine/index.html. We have also made available a web-based interface to allow users to obtain predictive scores for non-MEDLINE articles. This will assist in the triage of clinical evidence for writing systematic reviews.

AB - The Medical Subject Heading 'Humans' is manually curated and indicates human-related studies within MEDLINE. However, newly published MEDLINE articles may take months to be indexed and non-MEDLINE articles lack consistent, transparent indexing of this feature. Therefore, for up to date and broad literature searches, there is a need for an independent automated system to identify whether a given publication is human-related, particularly when they lack Medical Subject Headings. One million MEDLINE records published in 1987-2014 were randomly selected. Text-based features from the title, abstract, author name and journal fields were extracted. A linear support vector machine was trained to estimate the probability that a given article should be indexed as Humans and was evaluated on records from 2015 to 2016. Overall accuracy was high: area under the receiver operating curve = 0.976, F1 = 95% relative to MeSH indexing. Manual review of cases of extreme disagreement with MEDLINE showed 73.5% agreement with the automated prediction. We have tagged all articles indexed in PubMed with predictive scores and have made the information publicly available at http://arrowsmith.psych.uic.edu/evidence-based-medicine/index.html. We have also made available a web-based interface to allow users to obtain predictive scores for non-MEDLINE articles. This will assist in the triage of clinical evidence for writing systematic reviews.

UR - http://www.scopus.com/inward/record.url?scp=85057276211&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057276211&partnerID=8YFLogxK

U2 - 10.1093/database/bay079

DO - 10.1093/database/bay079

M3 - Article

VL - 2018

JO - Database : the journal of biological databases and curation

JF - Database : the journal of biological databases and curation

SN - 1758-0463

IS - 2018

ER -