skip to main content
10.1145/1390817.1390822acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

Can data transformation help in the detection of fault-prone modules?

Published:20 July 2008Publication History

ABSTRACT

Data preprocessing (transformation) plays an important role in data mining and machine learning. In this study, we investigate the effect of four different preprocessing methods to fault-proneness prediction using nine datasets from NASA Metrics Data Programs (MDP) and ten classification algorithms. Our experiments indicate that log transformation rarely improves classification performance, but discretization affects the performance of many different algorithms. The impact of different transformations differs. Random forest algorithm, for example, performs better with original and log transformed data set. Boosting and NaiveBayes perform significantly better with discretized data. We conclude that no general benefit can be expected from data transformations. Instead, selected transformation techniques are recommended to boost the performance of specific classification algorithms.

References

  1. The R Project for Statistical Computing, available http://www.r-project.org/.Google ScholarGoogle Scholar
  2. Metric data program. NASA Independent Verification and Validation facility, Available from http://MDP.ivv.nasa.gov.Google ScholarGoogle Scholar
  3. L. Breiman. Random forests. Machine Learning, 45:5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. J. Conover. Practical Nonparametric Statistics. John Wiley and Sons, Inc., 1999.Google ScholarGoogle Scholar
  5. J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In International Conference on Machine Learning, pages 194--202, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. J. Faraway. Practical Regression and Anova using R. online, July 2002.Google ScholarGoogle Scholar
  8. U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning, pages 1022--1027, 1993.Google ScholarGoogle Scholar
  9. I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Jiang, B. Cukic, and T. Menzies. Fault prediction using early lifecycle data, pages 237--246. Software Reliability, 2007. ISSRE '07. The 18th IEEE International Symposium on, Nov. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. Jolliffe. Principal Component Analysis. Springer, New York, 2002.Google ScholarGoogle Scholar
  12. T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1):2--13, January 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Siegel. Nonparametric Satistics. New York: McGraw-Hill Book Company, Inc., 1956.Google ScholarGoogle Scholar
  14. I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, Los Altos, US, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Can data transformation help in the detection of fault-prone modules?

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            DEFECTS '08: Proceedings of the 2008 workshop on Defects in large software systems
            July 2008
            48 pages
            ISBN:9781605580517
            DOI:10.1145/1390817

            Copyright © 2008 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 July 2008

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Upcoming Conference

            ISSTA '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader