Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

Byron C. Wallace, Anna Noel-Storr, Iain J. Marshall, Aaron Cohen, Neil R. Smalheiser, James Thomas

Research output: Contribution to journalArticle

30 Citations (Scopus)

Abstract

Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed tomake this process more efficient via a hybrid approach using both crowdsourcing andML. Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise. Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone. Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

Original languageEnglish (US)
Pages (from-to)1165-1168
Number of pages4
JournalJournal of the American Medical Informatics Association
Volume24
Issue number6
DOIs
StatePublished - Nov 1 2017

Fingerprint

Crowdsourcing
Randomized Controlled Trials
Machine Learning

Keywords

  • Crowdsourcing
  • Evidence-based medicine
  • Human computation
  • Machine learning
  • Natural language processing

ASJC Scopus subject areas

  • Health Informatics

Cite this

Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach. / Wallace, Byron C.; Noel-Storr, Anna; Marshall, Iain J.; Cohen, Aaron; Smalheiser, Neil R.; Thomas, James.

In: Journal of the American Medical Informatics Association, Vol. 24, No. 6, 01.11.2017, p. 1165-1168.

Research output: Contribution to journalArticle

Wallace, Byron C. ; Noel-Storr, Anna ; Marshall, Iain J. ; Cohen, Aaron ; Smalheiser, Neil R. ; Thomas, James. / Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach. In: Journal of the American Medical Informatics Association. 2017 ; Vol. 24, No. 6. pp. 1165-1168.
@article{91e5c4eb452f4463ba682e8704c33cea,
title = "Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach",
abstract = "Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed tomake this process more efficient via a hybrid approach using both crowdsourcing andML. Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise. Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95{\%}-99{\%} recall) with substantially less effort (we observed a reduction of around 60{\%}-80{\%}) than relying on manual screening alone. Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.",
keywords = "Crowdsourcing, Evidence-based medicine, Human computation, Machine learning, Natural language processing",
author = "Wallace, {Byron C.} and Anna Noel-Storr and Marshall, {Iain J.} and Aaron Cohen and Smalheiser, {Neil R.} and James Thomas",
year = "2017",
month = "11",
day = "1",
doi = "10.1093/jamia/ocx053",
language = "English (US)",
volume = "24",
pages = "1165--1168",
journal = "Journal of the American Medical Informatics Association",
issn = "1067-5027",
publisher = "Oxford University Press",
number = "6",

}

TY - JOUR

T1 - Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

AU - Wallace, Byron C.

AU - Noel-Storr, Anna

AU - Marshall, Iain J.

AU - Cohen, Aaron

AU - Smalheiser, Neil R.

AU - Thomas, James

PY - 2017/11/1

Y1 - 2017/11/1

N2 - Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed tomake this process more efficient via a hybrid approach using both crowdsourcing andML. Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise. Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone. Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

AB - Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed tomake this process more efficient via a hybrid approach using both crowdsourcing andML. Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise. Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone. Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

KW - Crowdsourcing

KW - Evidence-based medicine

KW - Human computation

KW - Machine learning

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85028670339&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85028670339&partnerID=8YFLogxK

U2 - 10.1093/jamia/ocx053

DO - 10.1093/jamia/ocx053

M3 - Article

VL - 24

SP - 1165

EP - 1168

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

SN - 1067-5027

IS - 6

ER -