Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

Byron C. Wallace, Anna Noel-Storr, Iain J. Marshall, Aaron M. Cohen, Neil R. Smalheiser, James Thomas

Research output: Contribution to journalArticle

38 Scopus citations


Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed tomake this process more efficient via a hybrid approach using both crowdsourcing andML. Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise. Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone. Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

Original languageEnglish (US)
Pages (from-to)1165-1168
Number of pages4
JournalJournal of the American Medical Informatics Association
Issue number6
StatePublished - Nov 1 2017



  • Crowdsourcing
  • Evidence-based medicine
  • Human computation
  • Machine learning
  • Natural language processing

ASJC Scopus subject areas

  • Health Informatics

Cite this