short-paper

Explainable Data Imputation using Constraints

Authors:
Sandeep Hans

IBM Research, India

IBM Research, India

0000-0003-4986-0688
View Profile

,
Diptikalyan Saha

IBM Research, India

IBM Research, India

0000-0002-1583-5479
View Profile

,
Aniya Aggarwal

IBM Research, India

IBM Research, India

0000-0001-8883-0030
View Profile

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)January 2023Pages 128–132https://doi.org/10.1145/3570991.3571009

Published:04 January 2023Publication History

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

Pages 128–132

ABSTRACT

Data values in a dataset can be missing or anomalous due to mishandling or human error. Analysing data with missing values can create bias and affect the inferences. Several analysis methods, such as principle components analysis or singular value decomposition, require complete data. Many approaches impute numeric data and some do not consider dependency of attributes on other attributes, while some require human intervention and domain knowledge. We present a new algorithm for data imputation based on different data type values and their association constraints in data, which are not handled currently by any system. We show experimental results using different metrics comparing our algorithm with state of the art imputation techniques. Our algorithm not only imputes the missing values but also generates explanations describing the significance of attributes used for every imputation.

References

Last accessed 12th Jul, 2022. DataWig - Imputation for Tables. https://pypi.org/project/datawigGoogle Scholar
Gustavo E. A. P. A. Batista and Maria Carolina Monard. 2003. An Analysis of Four Missing Data Treatment Methods for Supervised Learning. Applied Artificial Intelligence 17, 5-6 (2003), 519–533. https://doi.org/10.1080/713827181Google ScholarCross Ref
Felix Biessmann, Tammo Rukat, Phillipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables. Journal of Machine Learning Research 20, 175 (2019), 1–6. http://jmlr.org/papers/v20/18-753.htmlGoogle Scholar
Felix Bießmann, David Salinas, Sebastian Schelter, Philipp Schmidt, and Dustin Lange. 2018. "Deep" Learning for Missing Value Imputationin Tables with Non-Numerical Data. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 2018, Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). ACM, 2017–2025. https://doi.org/10.1145/3269206.3272005Google ScholarDigital Library
Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: a commodity data cleaning system. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias (Eds.). ACM, 541–552. https://doi.org/10.1145/2463676.2465327Google ScholarDigital Library
Lovedeep Gondara and Ke Wang. 2018. MIDA: Multiple Imputation Using Denoising Autoencoders. In Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III(Lecture Notes in Computer Science, Vol. 10939), Dinh Q. Phung, Vincent S. Tseng, Geoffrey I. Webb, Bao Ho, Mohadeseh Ganji, and Lida Rashidi (Eds.). Springer, 260–272. https://doi.org/10.1007/978-3-319-93040-4_21Google ScholarDigital Library
Sandeep Hans, Diptikalyan Saha, and Aniya Aggarwal. 2022. Explainable Data Imputation using Constraints. CoRR abs/2205.04731(2022). https://doi.org/10.48550/arXiv.2205.04731 arXiv:2205.04731Google Scholar
Yehuda Koren, Robert M. Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. IEEE Computer 42, 8 (2009), 30–37. https://doi.org/10.1109/MC.2009.263Google ScholarDigital Library
R.J.A. Little and D.B. Rubin. 2002. Statistical analysis with missing data. Wiley. http://books.google.com/books?id=aYPwAAAAMAAJGoogle Scholar
Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA(Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 4413–4423. http://proceedings.mlr.press/v97/mattei19a.htmlGoogle Scholar
Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. 2010. Spectral Regularization Algorithms for Learning Large Incomplete Matrices. J. Mach. Learn. Res. 11(2010), 2287–2322. http://portal.acm.org/citation.cfm?id=1859931Google ScholarDigital Library
Alfredo Nazábal, Pablo M. Olmos, Zoubin Ghahramani, and Isabel Valera. 2018. Handling Incomplete Heterogeneous Data using VAEs. CoRR abs/1807.03653(2018). arxiv:1807.03653http://arxiv.org/abs/1807.03653Google Scholar
Shaoxu Song, Yu Sun, Aoqian Zhang, Lei Chen, and Jianmin Wang. 2018. Enriching data imputation under similarity rule constraints. IEEE transactions on knowledge and data engineering 32, 2(2018), 275–287.Google Scholar
Daniel J. Stekhoven and Peter Bühlmann. 2012. MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (2012), 112–118. https://doi.org/10.1093/bioinformatics/btr597Google ScholarDigital Library
Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (2001), 520–525. https://doi.org/10.1093/bioinformatics/17.6.520Google ScholarCross Ref
S. van Buuren. 2018. Flexible Imputation of Missing Data. CRC Press, Taylor & Francis Group. https://books.google.co.in/books?id=bLmItgEACAAJGoogle Scholar
Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: Missing Data Imputation using Generative Adversarial Nets. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018(Proceedings of Machine Learning Research, Vol. 80), Jennifer G. Dy and Andreas Krause (Eds.). PMLR, 5675–5684. http://proceedings.mlr.press/v80/yoon18a.htmlGoogle Scholar
Hongbao Zhang, Pengtao Xie, and Eric P. Xing. 2018. Missing Value Imputation Based on Deep Generative Models. CoRR abs/1808.01684(2018). arxiv:1808.01684http://arxiv.org/abs/1808.01684Google Scholar

Recommendations

Missing value imputation based on data clustering
Transactions on computational science I

We propose an efficient nonparametric missing value imputation method based on clustering, called CMI (Clustering-based Missing value Imputation), for dealing with missing values in target attributes. In our approach, we impute the missing values of an ...
Read More
Multiple Imputation for Missing Data Using Genetic Programming
GECCO '15: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation

Missing values are a common problem in many real world databases. Inadequate handing of missing data can lead to serious problems in data analysis. A common way to cope with this problem is to use imputation methods to fill missing values with plausible ...
Read More
An Evaluation of k-Nearest Neighbour Imputation Using Likert Data
METRICS '04: Proceedings of the Software Metrics, 10th International Symposium

Studies in many different fields of research suffer from the problem of missing data. With missing data, statistical tests will lose power, results may be biased, or analysis may not be feasible at all. There are several ways to handle the problem, for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
January 2023
357 pages
ISBN:9781450397971
DOI:10.1145/3570991

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate197of680submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 61
  Total Downloads
- Downloads (Last 12 months)46
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Explainable Data Imputation using Constraints

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

ABSTRACT

References

Cited By

Recommendations

Missing value imputation based on data clustering

Multiple Imputation for Missing Data Using Genetic Programming

An Evaluation of k-Nearest Neighbour Imputation Using Likert Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Explainable Data Imputation using Constraints

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

ABSTRACT

References

Cited By

Recommendations

Missing value imputation based on data clustering

Multiple Imputation for Missing Data Using Genetic Programming

An Evaluation of k-Nearest Neighbour Imputation Using Likert Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media