Rule-based deduplication of article records from bibliographic databases

Yu Jiang, Can Lin, Weiyi Meng, Clement Yu, Aaron Cohen, Neil R. Smalheiser

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments.

Original languageEnglish (US)
Article numberbat086
JournalDatabase
Volume2014
DOIs
StatePublished - 2014

Fingerprint

Bibliographic Databases
Databases
PubMed
Evidence-Based Medicine
systematic review
engines
Medicine
Names
Publications
medicine
Engines
methodology

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Information Systems
  • Medicine(all)

Cite this

Rule-based deduplication of article records from bibliographic databases. / Jiang, Yu; Lin, Can; Meng, Weiyi; Yu, Clement; Cohen, Aaron; Smalheiser, Neil R.

In: Database, Vol. 2014, bat086, 2014.

Research output: Contribution to journalArticle

Jiang, Yu ; Lin, Can ; Meng, Weiyi ; Yu, Clement ; Cohen, Aaron ; Smalheiser, Neil R. / Rule-based deduplication of article records from bibliographic databases. In: Database. 2014 ; Vol. 2014.
@article{6b32419dd22448c5894b9ab6ac365940,
title = "Rule-based deduplication of article records from bibliographic databases",
abstract = "We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments.",
author = "Yu Jiang and Can Lin and Weiyi Meng and Clement Yu and Aaron Cohen and Smalheiser, {Neil R.}",
year = "2014",
doi = "10.1093/database/bat086",
language = "English (US)",
volume = "2014",
journal = "Database : the journal of biological databases and curation",
issn = "1758-0463",
publisher = "Oxford University Press",

}

TY - JOUR

T1 - Rule-based deduplication of article records from bibliographic databases

AU - Jiang, Yu

AU - Lin, Can

AU - Meng, Weiyi

AU - Yu, Clement

AU - Cohen, Aaron

AU - Smalheiser, Neil R.

PY - 2014

Y1 - 2014

N2 - We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments.

AB - We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments.

UR - http://www.scopus.com/inward/record.url?scp=84892687588&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84892687588&partnerID=8YFLogxK

U2 - 10.1093/database/bat086

DO - 10.1093/database/bat086

M3 - Article

VL - 2014

JO - Database : the journal of biological databases and curation

JF - Database : the journal of biological databases and curation

SN - 1758-0463

M1 - bat086

ER -