ABSTRACT
The WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world's largest database of reports on suspected adverse drug reaction incidents that occur after drugs are introduced on the market. As in other post-marketing drug safety data sets, the presence of duplicate records is an important data quality problem and the detection of duplicates in the WHO drug safety database remains a formidable challenge, especially since the reports are anonymised before submitted to the database. However, to our knowledge no work has been published on methods for duplicate detection in post-marketing drug safety data. In this paper, we propose a method for probabilistic duplicate detection based on the hit-miss model for statistical record linkage described by Copas & Hilton. We present two new generalisations of the standard hit-miss model: a hit-miss mixture model for errors in numerical record fields and a new method to handle correlated record fields. We demonstrate the effectiveness of the hit-miss model for duplicate detection in the WHO drug safety database both at identifying the most likely duplicate for a given record (94.7% accuracy) and at discriminating duplicates from random matches (63% recall with 71% precision). The proposed method allows for more efficient data cleaning in post-marketing drug safety data sets, and perhaps other applications throughout the KDD community.
- A. Bate, M. Lindquist, I. R. Edwards, S. Olsson, R. Orre, A. Lansner, and R. M. De Freitas. A Bayesian neural network method for adverse drug reaction signal generation. European Journal of Clinical Pharmacology, 54:315--321, 1998.Google ScholarCross Ref
- T. Belin and D. Rubin. A method for calibrating false-match rates in record linkage. Journal of the American Statistical Association, 90:694--707, 1995.Google ScholarCross Ref
- M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 39--48. ACM Press, 2003. Google ScholarDigital Library
- M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 workshop on data cleaning, record linkage and object consolidation, pages 7--12, 2003.Google Scholar
- E. A. Bortnichak, R. P. Wise, M. E. Salive, and H. H. Tilson. Proactive safety surveillance. Pharmacoepidemiology and Drug Safety, 10:191--196, 2001.Google ScholarCross Ref
- A. D. Brinker and J. Beitz. Spontaneous reports of thrombocytopenia in association with quinine: clinical attributes and timing related to regulatory action. American Journal of Hematology, 70:313--317, 2002.Google ScholarCross Ref
- J. Copas and F. Hilton. Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society: Series A, 153(3):287--320, 1990.Google ScholarCross Ref
- I. R. Edwards. Adverse drug reactions: finding the needle in the haystack. British Medical Journal, 315(7107):500, 1997.Google ScholarCross Ref
- I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.Google ScholarCross Ref
- M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD '95: Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pages 127--138. ACM Press, 1995. Google ScholarDigital Library
- M. Lindquist. Data quality management in pharmacovigilance. Drug Safety, 27(12):857--870, 2004.Google ScholarCross Ref
- A. E. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Research Issues on Data Mining and Knowledge Discovery, 1997.Google Scholar
- H. B. Newcombe. Record linkage: the design of efficient systems for linking records into individual family histories. American Journal of Human Genetics, 19:335--359, 1967.Google Scholar
- J. N. Nkanza and W. Walop. Vaccine associated adverse event surveillance (VAEES) and quality assurance. Drug Safety, 27:951--952, 2004.Google Scholar
- R. Orre, A. Lansner, A. Bate, and M. Lindquist. Bayesian neural networks with confidence estimations applied to data mining. Computational Statistics & Data Analysis, 34:473--493, 2000. Google ScholarDigital Library
- M. D. Rawlins. Spontaneous reporting of adverse drug reactions. II: Uses. British Journal of Clinical Pharmacology, 1(26):7--11, 1988.Google ScholarCross Ref
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269--278. ACM Press, 2002. Google ScholarDigital Library
Index Terms
- A hit-miss model for duplicate detection in the WHO drug safety database
Recommendations
Duplicate detection in adverse drug reaction surveillance
AbstractThe WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world’s largest database of reports on suspected adverse drug reaction (ADR) incidents that occur after drugs are on the market. The ...
Duplicate Record Detection: A Survey
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription ...
Scalable Iterative Graph Duplicate Detection
Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships ...
Comments