research-article

Can data transformation help in the detection of fault-prone modules?

Authors:
Yue Jiang

West Virginia University, Morgantown, WV

West Virginia University, Morgantown, WV
View Profile

,
Bojan Cukic

West Virginia University, Morgantown, WV

West Virginia University, Morgantown, WV
View Profile

,
Tim Menzies

West Virginia University, Morgantown, WV

West Virginia University, Morgantown, WV
View Profile

DEFECTS '08: Proceedings of the 2008 workshop on Defects in large software systemsJuly 2008Pages 16–20https://doi.org/10.1145/1390817.1390822

Published:20 July 2008Publication History

DEFECTS '08: Proceedings of the 2008 workshop on Defects in large software systems

Pages 16–20

ABSTRACT

Data preprocessing (transformation) plays an important role in data mining and machine learning. In this study, we investigate the effect of four different preprocessing methods to fault-proneness prediction using nine datasets from NASA Metrics Data Programs (MDP) and ten classification algorithms. Our experiments indicate that log transformation rarely improves classification performance, but discretization affects the performance of many different algorithms. The impact of different transformations differs. Random forest algorithm, for example, performs better with original and log transformed data set. Boosting and NaiveBayes perform significantly better with discretized data. We conclude that no general benefit can be expected from data transformations. Instead, selected transformation techniques are recommended to boost the performance of specific classification algorithms.

References

The R Project for Statistical Computing, available http://www.r-project.org/.Google Scholar
Metric data program. NASA Independent Verification and Validation facility, Available from http://MDP.ivv.nasa.gov.Google Scholar
L. Breiman. Random forests. Machine Learning, 45:5--32, 2001. Google ScholarDigital Library
W. J. Conover. Practical Nonparametric Statistics. John Wiley and Sons, Inc., 1999.Google Scholar
J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 2006. Google ScholarDigital Library
J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In International Conference on Machine Learning, pages 194--202, 1995.Google ScholarCross Ref
J. J. Faraway. Practical Regression and Anova using R. online, July 2002.Google Scholar
U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning, pages 1022--1027, 1993.Google Scholar
I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and Applications. Springer, 2006. Google ScholarDigital Library
Y. Jiang, B. Cukic, and T. Menzies. Fault prediction using early lifecycle data, pages 237--246. Software Reliability, 2007. ISSRE '07. The 18th IEEE International Symposium on, Nov. 2007. Google ScholarDigital Library
I. Jolliffe. Principal Component Analysis. Springer, New York, 2002.Google Scholar
T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1):2--13, January 2007. Google ScholarDigital Library
S. Siegel. Nonparametric Satistics. New York: McGraw-Hill Book Company, Inc., 1956.Google Scholar
I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, Los Altos, US, 2005. Google ScholarDigital Library

Index Terms

Can data transformation help in the detection of fault-prone modules?

Recommendations

Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation

Software quality models can give timely predictions of reliability indicators, for targeting software improvement efforts. In some cases, classification techniques are sufficient for useful software quality models.
The software engineering community has not ...
Read More
Help me describe my data: a demonstration of the open PHACTS VoID editor
ISWC-PD'14: Proceedings of the 2014 International Conference on Posters & Demonstrations Track - Volume 1272

The Open PHACTS VoID Editor helps non-Semantic Web experts to create machine interpretable descriptions for their datasets. The web app guides the user, an expert in the domain of the data, through a series of questions to capture details of their ...
Read More
Hybrid cluster ensemble framework based on the random combination of data transformation operators

Given a dataset P represented by an nxm matrix (where n is the number of data points and m is the number of attributes), we study the effect of applying transformations to P and how this affects the performance of different ensemble algorithms. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DEFECTS '08: Proceedings of the 2008 workshop on Defects in large software systems
July 2008
48 pages
ISBN:9781605580517
DOI:10.1145/1390817
Conference Chairs:
Prem Devanbu
University of California
,
Brendan Murphy
Microsoft Research, UK
,
Nachiappan Nagappan
Microsoft Research
,
Thomas Zimmermann
University of Calgary, Canada
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 July 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference
Upcoming Conference
ISSTA '24

Sponsor:

sigsoft

33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

September 16 - 20, 2024

Vienna , Austria
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 485
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Can data transformation help in the detection of fault-prone modules?

DEFECTS '08: Proceedings of the 2008 workshop on Defects in large software systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation

Help me describe my data: a demonstration of the open PHACTS VoID editor

Hybrid cluster ensemble framework based on the random combination of data transformation operators