TY - JOUR
T1 - A probabilistic automated tagger to identify human-related publications
AU - Cohen, Aaron M.
AU - Dunivin, Zackary O.
AU - Smalheiser, Neil R.
N1 - Publisher Copyright:
© The Author(s) 2018. Published by Oxford University Press.
PY - 2018/1/1
Y1 - 2018/1/1
N2 - The Medical Subject Heading 'Humans' is manually curated and indicates human-related studies within MEDLINE. However, newly published MEDLINE articles may take months to be indexed and non-MEDLINE articles lack consistent, transparent indexing of this feature. Therefore, for up to date and broad literature searches, there is a need for an independent automated system to identify whether a given publication is human-related, particularly when they lack Medical Subject Headings. One million MEDLINE records published in 1987-2014 were randomly selected. Text-based features from the title, abstract, author name and journal fields were extracted. A linear support vector machine was trained to estimate the probability that a given article should be indexed as Humans and was evaluated on records from 2015 to 2016. Overall accuracy was high: area under the receiver operating curve = 0.976, F1 = 95% relative to MeSH indexing. Manual review of cases of extreme disagreement with MEDLINE showed 73.5% agreement with the automated prediction. We have tagged all articles indexed in PubMed with predictive scores and have made the information publicly available at http://arrowsmith.psych.uic.edu/evidence-based-medicine/index.html. We have also made available a web-based interface to allow users to obtain predictive scores for non-MEDLINE articles. This will assist in the triage of clinical evidence for writing systematic reviews.
AB - The Medical Subject Heading 'Humans' is manually curated and indicates human-related studies within MEDLINE. However, newly published MEDLINE articles may take months to be indexed and non-MEDLINE articles lack consistent, transparent indexing of this feature. Therefore, for up to date and broad literature searches, there is a need for an independent automated system to identify whether a given publication is human-related, particularly when they lack Medical Subject Headings. One million MEDLINE records published in 1987-2014 were randomly selected. Text-based features from the title, abstract, author name and journal fields were extracted. A linear support vector machine was trained to estimate the probability that a given article should be indexed as Humans and was evaluated on records from 2015 to 2016. Overall accuracy was high: area under the receiver operating curve = 0.976, F1 = 95% relative to MeSH indexing. Manual review of cases of extreme disagreement with MEDLINE showed 73.5% agreement with the automated prediction. We have tagged all articles indexed in PubMed with predictive scores and have made the information publicly available at http://arrowsmith.psych.uic.edu/evidence-based-medicine/index.html. We have also made available a web-based interface to allow users to obtain predictive scores for non-MEDLINE articles. This will assist in the triage of clinical evidence for writing systematic reviews.
UR - http://www.scopus.com/inward/record.url?scp=85057276211&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85057276211&partnerID=8YFLogxK
U2 - 10.1093/database/bay079
DO - 10.1093/database/bay079
M3 - Article
C2 - 30184195
AN - SCOPUS:85057276211
SN - 1758-0463
VL - 2018
JO - Database : the journal of biological databases and curation
JF - Database : the journal of biological databases and curation
IS - 2018
ER -