Virk: An active learning-based system for bootstrapping knowledge base development in the neurosciences

Kyle H. Ambert; Aaron M. Cohen; Gully A.P.C. Burns; Eilis Boudreau; Kemal Sonmez

doi:10.3389/fninf.2013.00038

Virk: An active learning-based system for bootstrapping knowledge base development in the neurosciences

Kyle H. Ambert, Aaron M. Cohen, Gully A.P.C. Burns, Eilis Boudreau, Kemal Sonmez

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

The frequency and volume of newly-published scientific literature is quickly making manual maintenance of publicly-available databases of primary data unrealistic and costly. Although machine learning (ML) can be useful for developing automated approaches to identifying scientific publications containing relevant information for a database, developing such tools necessitates manually annotating an unrealistic number of documents. One approach to this problem, active learning (AL), builds classification models by iteratively identifying documents that provide the most information to a classifier. Although this approach has been shown to be effective for related problems, in the context of scientific databases curation, it falls short. We present Virk, an AL system that, while being trained, simultaneously learns a classification model and identifies documents having information of interest for a knowledge base. Our approach uses a support vector machine (SVM) classifier with input features derived from neuroscience-related publications from the primary literature. Using our approach, we were able to increase the size of the Neuron Registry, a knowledge base of neuron-related information, by a factor of 90%, a knowledge base of neuron-related information, in 3 months. Using standard biocuration methods, it would have taken between 1 and 2 years to make the same number of contributions to the Neuron Registry. Here, we describe the system pipeline in detail, and evaluate its performance against other approaches to sampling in AL.

Original language	English (US)
Article number	38
Journal	Frontiers in Neuroinformatics
Volume	7
Issue number	DEC
DOIs	https://doi.org/10.3389/fninf.2013.00038
State	Published - Dec 25 2013

Keywords

Active learning
Biocuration
Community-curated database
Machine learning
Neuroinformatics
Text-mining

ASJC Scopus subject areas

Neuroscience (miscellaneous)
Biomedical Engineering
Computer Science Applications

Access to Document

10.3389/fninf.2013.00038

Cite this

@article{b529dfac91ef485d8ee0c71dd3e8f577,

title = "Virk: An active learning-based system for bootstrapping knowledge base development in the neurosciences",

abstract = "The frequency and volume of newly-published scientific literature is quickly making manual maintenance of publicly-available databases of primary data unrealistic and costly. Although machine learning (ML) can be useful for developing automated approaches to identifying scientific publications containing relevant information for a database, developing such tools necessitates manually annotating an unrealistic number of documents. One approach to this problem, active learning (AL), builds classification models by iteratively identifying documents that provide the most information to a classifier. Although this approach has been shown to be effective for related problems, in the context of scientific databases curation, it falls short. We present Virk, an AL system that, while being trained, simultaneously learns a classification model and identifies documents having information of interest for a knowledge base. Our approach uses a support vector machine (SVM) classifier with input features derived from neuroscience-related publications from the primary literature. Using our approach, we were able to increase the size of the Neuron Registry, a knowledge base of neuron-related information, by a factor of 90%, a knowledge base of neuron-related information, in 3 months. Using standard biocuration methods, it would have taken between 1 and 2 years to make the same number of contributions to the Neuron Registry. Here, we describe the system pipeline in detail, and evaluate its performance against other approaches to sampling in AL.",

keywords = "Active learning, Biocuration, Community-curated database, Machine learning, Neuroinformatics, Text-mining",

author = "Ambert, {Kyle H.} and Cohen, {Aaron M.} and Burns, {Gully A.P.C.} and Eilis Boudreau and Kemal Sonmez",

year = "2013",

month = dec,

day = "25",

doi = "10.3389/fninf.2013.00038",

language = "English (US)",

volume = "7",

journal = "Frontiers in Neuroinformatics",

issn = "1662-5196",

publisher = "Frontiers Research Foundation",

number = "DEC",

}

TY - JOUR

T1 - Virk

T2 - An active learning-based system for bootstrapping knowledge base development in the neurosciences

AU - Ambert, Kyle H.

AU - Cohen, Aaron M.

AU - Burns, Gully A.P.C.

AU - Boudreau, Eilis

AU - Sonmez, Kemal

PY - 2013/12/25

Y1 - 2013/12/25

N2 - The frequency and volume of newly-published scientific literature is quickly making manual maintenance of publicly-available databases of primary data unrealistic and costly. Although machine learning (ML) can be useful for developing automated approaches to identifying scientific publications containing relevant information for a database, developing such tools necessitates manually annotating an unrealistic number of documents. One approach to this problem, active learning (AL), builds classification models by iteratively identifying documents that provide the most information to a classifier. Although this approach has been shown to be effective for related problems, in the context of scientific databases curation, it falls short. We present Virk, an AL system that, while being trained, simultaneously learns a classification model and identifies documents having information of interest for a knowledge base. Our approach uses a support vector machine (SVM) classifier with input features derived from neuroscience-related publications from the primary literature. Using our approach, we were able to increase the size of the Neuron Registry, a knowledge base of neuron-related information, by a factor of 90%, a knowledge base of neuron-related information, in 3 months. Using standard biocuration methods, it would have taken between 1 and 2 years to make the same number of contributions to the Neuron Registry. Here, we describe the system pipeline in detail, and evaluate its performance against other approaches to sampling in AL.

AB - The frequency and volume of newly-published scientific literature is quickly making manual maintenance of publicly-available databases of primary data unrealistic and costly. Although machine learning (ML) can be useful for developing automated approaches to identifying scientific publications containing relevant information for a database, developing such tools necessitates manually annotating an unrealistic number of documents. One approach to this problem, active learning (AL), builds classification models by iteratively identifying documents that provide the most information to a classifier. Although this approach has been shown to be effective for related problems, in the context of scientific databases curation, it falls short. We present Virk, an AL system that, while being trained, simultaneously learns a classification model and identifies documents having information of interest for a knowledge base. Our approach uses a support vector machine (SVM) classifier with input features derived from neuroscience-related publications from the primary literature. Using our approach, we were able to increase the size of the Neuron Registry, a knowledge base of neuron-related information, by a factor of 90%, a knowledge base of neuron-related information, in 3 months. Using standard biocuration methods, it would have taken between 1 and 2 years to make the same number of contributions to the Neuron Registry. Here, we describe the system pipeline in detail, and evaluate its performance against other approaches to sampling in AL.

KW - Active learning

KW - Biocuration

KW - Community-curated database

KW - Machine learning

KW - Neuroinformatics

KW - Text-mining

UR - http://www.scopus.com/inward/record.url?scp=84891541647&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84891541647&partnerID=8YFLogxK

U2 - 10.3389/fninf.2013.00038

DO - 10.3389/fninf.2013.00038

M3 - Article

AN - SCOPUS:84891541647

SN - 1662-5196

VL - 7

JO - Frontiers in Neuroinformatics

JF - Frontiers in Neuroinformatics

IS - DEC

M1 - 38

ER -

Virk: An active learning-based system for bootstrapping knowledge base development in the neurosciences

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this