ABSTRACT
Data preprocessing (transformation) plays an important role in data mining and machine learning. In this study, we investigate the effect of four different preprocessing methods to fault-proneness prediction using nine datasets from NASA Metrics Data Programs (MDP) and ten classification algorithms. Our experiments indicate that log transformation rarely improves classification performance, but discretization affects the performance of many different algorithms. The impact of different transformations differs. Random forest algorithm, for example, performs better with original and log transformed data set. Boosting and NaiveBayes perform significantly better with discretized data. We conclude that no general benefit can be expected from data transformations. Instead, selected transformation techniques are recommended to boost the performance of specific classification algorithms.
- The R Project for Statistical Computing, available http://www.r-project.org/.Google Scholar
- Metric data program. NASA Independent Verification and Validation facility, Available from http://MDP.ivv.nasa.gov.Google Scholar
- L. Breiman. Random forests. Machine Learning, 45:5--32, 2001. Google ScholarDigital Library
- W. J. Conover. Practical Nonparametric Statistics. John Wiley and Sons, Inc., 1999.Google Scholar
- J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 2006. Google ScholarDigital Library
- J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In International Conference on Machine Learning, pages 194--202, 1995.Google ScholarCross Ref
- J. J. Faraway. Practical Regression and Anova using R. online, July 2002.Google Scholar
- U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning, pages 1022--1027, 1993.Google Scholar
- I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications. Springer, 2006. Google ScholarDigital Library
- Y. Jiang, B. Cukic, and T. Menzies. Fault prediction using early lifecycle data, pages 237--246. Software Reliability, 2007. ISSRE '07. The 18th IEEE International Symposium on, Nov. 2007. Google ScholarDigital Library
- I. Jolliffe. Principal Component Analysis. Springer, New York, 2002.Google Scholar
- T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1):2--13, January 2007. Google ScholarDigital Library
- S. Siegel. Nonparametric Satistics. New York: McGraw-Hill Book Company, Inc., 1956.Google Scholar
- I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, Los Altos, US, 2005. Google ScholarDigital Library
Index Terms
- Can data transformation help in the detection of fault-prone modules?
Recommendations
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation
Software quality models can give timely predictions of reliability indicators, for targeting software improvement efforts. In some cases, classification techniques are sufficient for useful software quality models.
The software engineering community has not ...
Help me describe my data: a demonstration of the open PHACTS VoID editor
ISWC-PD'14: Proceedings of the 2014 International Conference on Posters & Demonstrations Track - Volume 1272The Open PHACTS VoID Editor helps non-Semantic Web experts to create machine interpretable descriptions for their datasets. The web app guides the user, an expert in the domain of the data, through a series of questions to capture details of their ...
Hybrid cluster ensemble framework based on the random combination of data transformation operators
Given a dataset P represented by an nxm matrix (where n is the number of data points and m is the number of attributes), we study the effect of applying transformations to P and how this affects the performance of different ensemble algorithms. ...
Comments