TY - JOUR
T1 - Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database
AU - Smalheiser, Neil R.
AU - Cohen, Aaron M.
N1 - Funding Information:
Our studies are supported by NIH grants R01LM10817 and P01AG03934. We thank Sophia Ananiadou for discussions about ways to share NLP tools and their products with end-users.
Publisher Copyright:
© 2018 © 2018 Neil R. Smalheiser, Aaron M. Cohe, published by Sciendo
PY - 2018/6/1
Y1 - 2018/6/1
N2 - Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and use machine learning algorithms. At present, each research group tackles each problem from scratch, in isolation of other projects, which causes redundancy and a great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects and as a public repository for their outputs. We initially focus on a specific goal, namely, classifying articles according to publication type and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning-based goals and projects and can be used as a public platform for disseminating the results of natural language processing (NLP) tools to end-users as well.
AB - Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and use machine learning algorithms. At present, each research group tackles each problem from scratch, in isolation of other projects, which causes redundancy and a great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects and as a public repository for their outputs. We initially focus on a specific goal, namely, classifying articles according to publication type and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning-based goals and projects and can be used as a public platform for disseminating the results of natural language processing (NLP) tools to end-users as well.
KW - Text mining
KW - community platforms
KW - data sharing
KW - machine learning
KW - open science
KW - semantic similarity
KW - vector representation
UR - http://www.scopus.com/inward/record.url?scp=85060540654&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85060540654&partnerID=8YFLogxK
U2 - 10.2478/dim-2018-0004
DO - 10.2478/dim-2018-0004
M3 - Article
AN - SCOPUS:85060540654
SN - 2543-9251
VL - 2
SP - 27
EP - 36
JO - Data and Information Management
JF - Data and Information Management
IS - 1
ER -