skip to main content
10.1145/3570991.3571009acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
short-paper

Explainable Data Imputation using Constraints

Published:04 January 2023Publication History

ABSTRACT

Data values in a dataset can be missing or anomalous due to mishandling or human error. Analysing data with missing values can create bias and affect the inferences. Several analysis methods, such as principle components analysis or singular value decomposition, require complete data. Many approaches impute numeric data and some do not consider dependency of attributes on other attributes, while some require human intervention and domain knowledge. We present a new algorithm for data imputation based on different data type values and their association constraints in data, which are not handled currently by any system. We show experimental results using different metrics comparing our algorithm with state of the art imputation techniques. Our algorithm not only imputes the missing values but also generates explanations describing the significance of attributes used for every imputation.

References

  1. Last accessed 12th Jul, 2022. DataWig - Imputation for Tables. https://pypi.org/project/datawigGoogle ScholarGoogle Scholar
  2. Gustavo E. A. P. A. Batista and Maria Carolina Monard. 2003. An Analysis of Four Missing Data Treatment Methods for Supervised Learning. Applied Artificial Intelligence 17, 5-6 (2003), 519–533. https://doi.org/10.1080/713827181Google ScholarGoogle ScholarCross RefCross Ref
  3. Felix Biessmann, Tammo Rukat, Phillipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables. Journal of Machine Learning Research 20, 175 (2019), 1–6. http://jmlr.org/papers/v20/18-753.htmlGoogle ScholarGoogle Scholar
  4. Felix Bießmann, David Salinas, Sebastian Schelter, Philipp Schmidt, and Dustin Lange. 2018. "Deep" Learning for Missing Value Imputationin Tables with Non-Numerical Data. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 2018, Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). ACM, 2017–2025. https://doi.org/10.1145/3269206.3272005Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: a commodity data cleaning system. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias (Eds.). ACM, 541–552. https://doi.org/10.1145/2463676.2465327Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lovedeep Gondara and Ke Wang. 2018. MIDA: Multiple Imputation Using Denoising Autoencoders. In Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III(Lecture Notes in Computer Science, Vol. 10939), Dinh Q. Phung, Vincent S. Tseng, Geoffrey I. Webb, Bao Ho, Mohadeseh Ganji, and Lida Rashidi (Eds.). Springer, 260–272. https://doi.org/10.1007/978-3-319-93040-4_21Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Sandeep Hans, Diptikalyan Saha, and Aniya Aggarwal. 2022. Explainable Data Imputation using Constraints. CoRR abs/2205.04731(2022). https://doi.org/10.48550/arXiv.2205.04731 arXiv:2205.04731Google ScholarGoogle Scholar
  8. Yehuda Koren, Robert M. Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. IEEE Computer 42, 8 (2009), 30–37. https://doi.org/10.1109/MC.2009.263Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R.J.A. Little and D.B. Rubin. 2002. Statistical analysis with missing data. Wiley. http://books.google.com/books?id=aYPwAAAAMAAJGoogle ScholarGoogle Scholar
  10. Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA(Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 4413–4423. http://proceedings.mlr.press/v97/mattei19a.htmlGoogle ScholarGoogle Scholar
  11. Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. 2010. Spectral Regularization Algorithms for Learning Large Incomplete Matrices. J. Mach. Learn. Res. 11(2010), 2287–2322. http://portal.acm.org/citation.cfm?id=1859931Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Alfredo Nazábal, Pablo M. Olmos, Zoubin Ghahramani, and Isabel Valera. 2018. Handling Incomplete Heterogeneous Data using VAEs. CoRR abs/1807.03653(2018). arxiv:1807.03653http://arxiv.org/abs/1807.03653Google ScholarGoogle Scholar
  13. Shaoxu Song, Yu Sun, Aoqian Zhang, Lei Chen, and Jianmin Wang. 2018. Enriching data imputation under similarity rule constraints. IEEE transactions on knowledge and data engineering 32, 2(2018), 275–287.Google ScholarGoogle Scholar
  14. Daniel J. Stekhoven and Peter Bühlmann. 2012. MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (2012), 112–118. https://doi.org/10.1093/bioinformatics/btr597Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (2001), 520–525. https://doi.org/10.1093/bioinformatics/17.6.520Google ScholarGoogle ScholarCross RefCross Ref
  16. S. van Buuren. 2018. Flexible Imputation of Missing Data. CRC Press, Taylor & Francis Group. https://books.google.co.in/books?id=bLmItgEACAAJGoogle ScholarGoogle Scholar
  17. Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: Missing Data Imputation using Generative Adversarial Nets. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018(Proceedings of Machine Learning Research, Vol. 80), Jennifer G. Dy and Andreas Krause (Eds.). PMLR, 5675–5684. http://proceedings.mlr.press/v80/yoon18a.htmlGoogle ScholarGoogle Scholar
  18. Hongbao Zhang, Pengtao Xie, and Eric P. Xing. 2018. Missing Value Imputation Based on Deep Generative Models. CoRR abs/1808.01684(2018). arxiv:1808.01684http://arxiv.org/abs/1808.01684Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
    January 2023
    357 pages
    ISBN:9781450397971
    DOI:10.1145/3570991

    Copyright © 2023 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 4 January 2023

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • short-paper
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate197of680submissions,29%
  • Article Metrics

    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)13

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format