ABSTRACT
Data values in a dataset can be missing or anomalous due to mishandling or human error. Analysing data with missing values can create bias and affect the inferences. Several analysis methods, such as principle components analysis or singular value decomposition, require complete data. Many approaches impute numeric data and some do not consider dependency of attributes on other attributes, while some require human intervention and domain knowledge. We present a new algorithm for data imputation based on different data type values and their association constraints in data, which are not handled currently by any system. We show experimental results using different metrics comparing our algorithm with state of the art imputation techniques. Our algorithm not only imputes the missing values but also generates explanations describing the significance of attributes used for every imputation.
- Last accessed 12th Jul, 2022. DataWig - Imputation for Tables. https://pypi.org/project/datawigGoogle Scholar
- Gustavo E. A. P. A. Batista and Maria Carolina Monard. 2003. An Analysis of Four Missing Data Treatment Methods for Supervised Learning. Applied Artificial Intelligence 17, 5-6 (2003), 519–533. https://doi.org/10.1080/713827181Google ScholarCross Ref
- Felix Biessmann, Tammo Rukat, Phillipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables. Journal of Machine Learning Research 20, 175 (2019), 1–6. http://jmlr.org/papers/v20/18-753.htmlGoogle Scholar
- Felix Bießmann, David Salinas, Sebastian Schelter, Philipp Schmidt, and Dustin Lange. 2018. "Deep" Learning for Missing Value Imputationin Tables with Non-Numerical Data. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 2018, Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). ACM, 2017–2025. https://doi.org/10.1145/3269206.3272005Google ScholarDigital Library
- Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: a commodity data cleaning system. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, Kenneth A. Ross, Divesh Srivastava, and Dimitris Papadias (Eds.). ACM, 541–552. https://doi.org/10.1145/2463676.2465327Google ScholarDigital Library
- Lovedeep Gondara and Ke Wang. 2018. MIDA: Multiple Imputation Using Denoising Autoencoders. In Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III(Lecture Notes in Computer Science, Vol. 10939), Dinh Q. Phung, Vincent S. Tseng, Geoffrey I. Webb, Bao Ho, Mohadeseh Ganji, and Lida Rashidi (Eds.). Springer, 260–272. https://doi.org/10.1007/978-3-319-93040-4_21Google ScholarDigital Library
- Sandeep Hans, Diptikalyan Saha, and Aniya Aggarwal. 2022. Explainable Data Imputation using Constraints. CoRR abs/2205.04731(2022). https://doi.org/10.48550/arXiv.2205.04731 arXiv:2205.04731Google Scholar
- Yehuda Koren, Robert M. Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. IEEE Computer 42, 8 (2009), 30–37. https://doi.org/10.1109/MC.2009.263Google ScholarDigital Library
- R.J.A. Little and D.B. Rubin. 2002. Statistical analysis with missing data. Wiley. http://books.google.com/books?id=aYPwAAAAMAAJGoogle Scholar
- Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA(Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 4413–4423. http://proceedings.mlr.press/v97/mattei19a.htmlGoogle Scholar
- Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. 2010. Spectral Regularization Algorithms for Learning Large Incomplete Matrices. J. Mach. Learn. Res. 11(2010), 2287–2322. http://portal.acm.org/citation.cfm?id=1859931Google ScholarDigital Library
- Alfredo Nazábal, Pablo M. Olmos, Zoubin Ghahramani, and Isabel Valera. 2018. Handling Incomplete Heterogeneous Data using VAEs. CoRR abs/1807.03653(2018). arxiv:1807.03653http://arxiv.org/abs/1807.03653Google Scholar
- Shaoxu Song, Yu Sun, Aoqian Zhang, Lei Chen, and Jianmin Wang. 2018. Enriching data imputation under similarity rule constraints. IEEE transactions on knowledge and data engineering 32, 2(2018), 275–287.Google Scholar
- Daniel J. Stekhoven and Peter Bühlmann. 2012. MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (2012), 112–118. https://doi.org/10.1093/bioinformatics/btr597Google ScholarDigital Library
- Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (2001), 520–525. https://doi.org/10.1093/bioinformatics/17.6.520Google ScholarCross Ref
- S. van Buuren. 2018. Flexible Imputation of Missing Data. CRC Press, Taylor & Francis Group. https://books.google.co.in/books?id=bLmItgEACAAJGoogle Scholar
- Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: Missing Data Imputation using Generative Adversarial Nets. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018(Proceedings of Machine Learning Research, Vol. 80), Jennifer G. Dy and Andreas Krause (Eds.). PMLR, 5675–5684. http://proceedings.mlr.press/v80/yoon18a.htmlGoogle Scholar
- Hongbao Zhang, Pengtao Xie, and Eric P. Xing. 2018. Missing Value Imputation Based on Deep Generative Models. CoRR abs/1808.01684(2018). arxiv:1808.01684http://arxiv.org/abs/1808.01684Google Scholar
Recommendations
Missing value imputation based on data clustering
Transactions on computational science IWe propose an efficient nonparametric missing value imputation method based on clustering, called CMI (Clustering-based Missing value Imputation), for dealing with missing values in target attributes. In our approach, we impute the missing values of an ...
Multiple Imputation for Missing Data Using Genetic Programming
GECCO '15: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary ComputationMissing values are a common problem in many real world databases. Inadequate handing of missing data can lead to serious problems in data analysis. A common way to cope with this problem is to use imputation methods to fill missing values with plausible ...
An Evaluation of k-Nearest Neighbour Imputation Using Likert Data
METRICS '04: Proceedings of the Software Metrics, 10th International SymposiumStudies in many different fields of research suffer from the problem of missing data. With missing data, statistical tests will lose power, results may be biased, or analysis may not be feasible at all. There are several ways to handle the problem, for ...
Comments