An application of Bayesian network for predicting object-oriented software maintainability

doi:10.1016/j.infsof.2005.03.002

Information and Software Technology

Volume 48, Issue 1, January 2006, Pages 59-67

https://doi.org/10.1016/j.infsof.2005.03.002 Get rights and content

Abstract

As the number of object-oriented software systems increases, it becomes more important for organizations to maintain those systems effectively. However, currently only a small number of maintainability prediction models are available for object-oriented systems. This paper presents a Bayesian network maintainability prediction model for an object-oriented software system. The model is constructed using object-oriented metric data in Li and Henry's datasets, which were collected from two different object-oriented systems. Prediction accuracy of the model is evaluated and compared with commonly used regression-based models. The results suggest that the Bayesian network model can predict maintainability more accurately than the regression-based models for one system, and almost as accurately as the best regression-based model for the other system.

Introduction

It is arguable that many object-oriented (OO) software systems are currently in use. It is also arguable that the growing popularity of OO programming languages, such as Java, as well as the increasing number of software development tools supporting the Unified Modelling Language (UML), encourages more OO systems to be developed at present and in the future. Hence it is important that those systems are maintained effectively and efficiently. A software maintainability prediction model enables organizations to predict maintainability of a software system and assists them with managing maintenance resource. In addition, if an accurate maintainability prediction model is available for a software system, a defensive design can be adopted. This would minimize, or at least reduce future maintenance effort of the system. Maintainability of a software system can be measured in different ways. In this paper, maintainability is measured as the number of changes made to the code during a maintenance period. Alternatively, maintainability may be measured as effort to make those changes. When maintainability is measured as effort, the predictive model is called a maintenance effort prediction model. It is unfortunate that the number of software maintainability prediction models including maintenance effort prediction models, is currently very small in the literature.

Programming an OO software system is different from programming a non-OO system due to the concepts that are specific to the OO paradigm, for example, objects, inheritance and encapsulation. This difference limits the applicability of well-known non-OO software effort prediction models, such as COCOMO [3], to OO software effort prediction, as well as non-OO software metrics, such as Function Points [1], to measuring the characteristics of OO software systems [23]. Hence a number of new software metrics were proposed specifically for OO systems. Some of those OO metrics were used to predict maintainability of OO systems. Examples of the OO metrics are Chidamber and Kemerer (C&K) metrics and Li and Henry (L&H) metrics [10], [25]. It was shown that the L&H metrics had a correlation with the number of changes made to the code of the OO software system [25]. It was also shown that multiple linear regression models consisting of the C&K, L&H and other OO metrics were able to predict software maintenance effort for some OO systems [17].

This paper constructs an OO software maintainability prediction model using a technique known as Bayesian network [14], [20], [22]. This technique allows a user to construct a predictive model based on Bayesian probability theory [12]. An application of Bayesian network to Software Engineering is currently limited to a small number of studies of development effort prediction [2], [11], [31], [34] and defect prediction [16], [28]. However, Bayesian network can also be a promising new technique for OO software maintainability prediction. This is due to the ability to explicitly represent uncertainty using probabilities, the ability to incorporate existing human expert's knowledge into empirical data, and the ability to update the model when new information becomes available. Hence this paper investigates a research problem of what prediction accuracy a Bayesian network OO software maintainability prediction model can achieve. The term prediction accuracy in this paper means how well a predictive model constructed using known data can predict the outcomes of unknown data. The Bayesian network model's prediction accuracy is evaluated using some accuracy measures, which are commonly found in the software effort prediction literature [15], [24]. Those measures are absolute residuals, the magnitude of relative error (MRE) and pred measures. Then, the Bayesian network model's prediction accuracy is compared with regression-based models, namely, a regression tree [4] model and two different types of multiple linear regression models.

The structure of the reminder of this paper is as follows. Section 2 describes the OO software datasets and the sampling method used. Section 3 describes the Bayesian network OO software maintainability prediction model. This is followed by Section 4, which describes the regression tree model and the multiple linear regression models. Section 5 describes the prediction accuracy measures used. Section 6 evaluates the Bayesian network model's prediction accuracy using those accuracy measures and compares it with the regression tree model and multiple linear regression models. Finally Section 7 presents conclusions and discussions about a direction of future studies.

Section snippets

Characteristics of datasets

This paper uses OO software datasets published by Li and Henry [25]. The datasets consist of five C&K metrics: DIT, NOC, RFC, LCOM and WMC, and four L&H metrics: MPC, DAC, NOM and SIZE2, as well as SIZE1, which is a traditional lines of code size metric. Those metric data were collected from a total of 110 classes in two OO software systems: User Interface Management System (UIMS) and Quality Evaluation System (QUES). The code was written in Classical-Ada™. The UIMS and QUES datasets contain 39

Bayesian network

A Bayesian network (also known as Bayes net, causal probabilistic network, Bayesian belief network, or simply belief network) is a directed acyclic graph (DAG) whose nodes represent events in a domain [22]. These events are connected with directed links, which represent an association or a causal relationship between them. When a link represents an association, the direction is defined according to the order of time in which the events happen, that is, the link starts from the preceding event.

Regression tree model

Regression tree is a tree-structured regression technique, which recursively partitions the data space of a given dataset with a number of regression surfaces, on each of which a constant estimate of the response variable is given according to a chosen regression method [4]. Fig. 2 shows an example regression tree. In Fig. 2, four sequential binary splits partition all cases in the dataset into five terminal nodes T₁,…,T₅, which are shown as five squares. Each terminal node consists of only the

Prediction accuracy measures

This paper evaluates and compares the OO software maintainability prediction models quantitatively, using the following prediction accuracy measures: absolute residual (Ab.Res.), the magnitude of relative error (MRE) and pred measures.

The Ab.Res. is the absolute value of residual given by: $Ab . Res . = | actual value - predicted value |$

In this paper, the sum of the absolute residuals (Sum Ab.Res.), the median of the absolute residuals (Med.Ab.Res.) and the standard deviation of the absolute residuals (SD

Results from UIMS dataset

Table 4 shows the values of the prediction accuracy measures achieved by each of the maintainability prediction models for the UIMS dataset. The values in this table are the mean of the values obtained from the 10 different test subsets, which were created using the sampling method described in Section 2.

Table 4 shows that the Bayesian network model has achieved the MMRE value of 0.972, the pred(0.25) value of 0.446 and the pred(0.30) value of 0.469. Although these values do not satisfy the

Conclusions

A Bayesian network OO software maintainability prediction model is constructed using the OO software metric data in Li and Henry datasets. The prediction accuracy of the model is evaluated and compared with the regression tree model and the multiple linear regression models using the prediction accuracy measures: the absolute residuals, MRE and pred measures. The results show that the Bayesian network model can predict maintainability of the OO software systems. For the UIMS dataset, the

Acknowledgements

The authors would like to acknowledge many valuable suggestions made by J. Harraway, Department of Mathematics and Statistics, University of Otago, New Zealand, with regard to the multiple linear regression models presented in this paper.

References (34)

J. Kaczmarek et al.
Size and effort estimation for applications written in java
Information and Software Technology
(2004)
W. Li et al.
Object-oriented metrics that predict maintainability
Journal of Systems and Software
(1993)
A. De Lucia et al.
Assessing effort estimation models for corrective maintenance through empirical studies
Information and Software Technology
(2005)
S.G. MacDonell
Establishing relationships between specification size and software process effort in case environment
Information and Software Technology
(1997)
I. Stamelos et al.
On the use of Bayesian belief networks for the prediction of software productivity
Information and Software Technology
(2003)
E. Stensrud
Alternative approaches to effort prediction of erp projects
Information and Software Technology
(2001)
A.J. Albrecht et al.
Software function, source lines of code, and development effort prediction: a software science validation
IEEE Transactions on Software Engineering
(1983)
J. Baik et al.
Disaggregating and calibrating the CASE tool variable in COCOMO II
IEEE Transactions on Software Engineering
(2002)
B.W. Boehm
Software Engineering Economics
(1981)
L. Breiman et al.
Classification and Regression Trees
(1993)

L.C. Briand, K.E. Emam, D. Surmann, I. Wieczorek, K.D. Maxwell, An assessment and comparison of common software cost...

L.C. Briand, T. Langley, I. Wieczorek, A replicated assessment and comparison of common software cost estimation...

L.C. Briand, J. Wüst, The impact of design properties on development cost in object-oriented systems, in: Proceedings...

L.C. Briand et al.

Modeling development effort in object-oriented systems using design properties

IEEE Transactions on Software Engineering

(2001)

W. Buntine

A guide to the literature on learning probabilistic networks from data

IEEE Transactions on Knowledge and Data Engineering

(1996)

S.R. Chidamber et al.

A metrics suite for object-oriented design

IEEE Transactions on Software Engineering

(1994)

S. Chulani et al.

Bayesian analysis of empirical software engineering cost models

IEEE Transactions on Software Engineering

(1999)

Cited by (162)

Progress on class integration test order generation approaches: A systematic literature review
2023, Information and Software Technology
Integration testing is an effective way to detect unit test results and ensure the correct and stable operation of software modules. One of the crucial problems in integration testing is the class integration test order (CITO) generation problem. Its purpose is to reasonably determine the test order of each class in a program to reduce test consumption. In recent years, the CITO generation problem has made a lot of progress but also faces more challenges.
The goal of this paper is to provide an overview of the research progress on the CITO generation problem. By summarizing applied techniques, evaluation indicators, and datasets, this paper aims to identify research challenges and suggest future opportunities.
We conduct a systematic literature review of CITO generation approaches, including the problems investigated, the solutions proposed, the techniques applied, the evaluation indicators used, and the datasets covered.
Based on research techniques and evaluation indicators, we classified and analyzed 30 papers published between 2011 and 2022. Our analysis reveals that more (47%) of the studies on the CITO generation problem still prefer to use search-based techniques, and the vast majority (90%) of the studies choose to use the stubbing complexity as the indicator to evaluate the stubbing cost of generating CITOs. We have extracted five challenges that the CITO generation problem is facing, corresponding to which we have given suggestions for future research.
In this paper, we have outlined the research status of CITO generation approaches, summarized the challenges, and proposed corresponding opportunities for future study. We expect this paper to better help software testing workers understand the CITO generation problem and improve efficiency in practical work.
Change impact analysis: A systematic mapping study
2021, Journal of Systems and Software
Change Impact Analysis (CIA) is the process of exploring the tentative effects of a change in other parts of a system. CIA is considered beneficial in practice, since it reduces cost of maintenance and the risk of software development failures. In this paper, we present a systematic mapping study that covers a plethora of CIA methods (by exploring 111 papers), putting special emphasis on how the CIA phenomenon can be quantified: to be efficiently managed. The results of our study suggest that: (a) the practical benefits of CIA cover any type of maintenance request (e.g., feature additions, bug fixing) and can help in reducing relevant cost; (b) CIA quantification relies on four parameters (instability, amount of change, change proneness, and changeability), whose assessment is supported by various metrics and predictors; and (c) in this vast research field, there are still some viewpoints that remain unexplored (e.g., the negative consequences of highly change prone artifacts), whereas others are over-researched (e.g., quantification of instability based on metrics). Based on our results, we provide: (a) useful information for practitioners—i.e., the expected benefits of CIA, and a list of CIA-related metrics, emphasizing on the provision of a detailed interpretation of their relation to CIA; and (b) interesting future research directions—i.e., over- and under-researched sub-fields of CIA.
A practical approach for technical debt prioritization based on class-level forecasting
2024, Journal of Software: Evolution and Process
A metrics-based approach for selecting among various refactoring candidates
2024, Empirical Software Engineering
A review on soft computing approaches for predicting maintainability of software: State-of-the-art, technical challenges, and future directions
2023, Expert Systems
Analysis of Bug Report Qualities with Fixing Time using a Bayesian Network
2023, ACM International Conference Proceeding Series

View all citing articles on Scopus

View full text

An application of Bayesian network for predicting object-oriented software maintainability

Abstract

Introduction

Section snippets

Characteristics of datasets

Bayesian network

Regression tree model

Prediction accuracy measures

Results from UIMS dataset

Conclusions

Acknowledgements

Information and Software Technology

Journal of Systems and Software

Information and Software Technology

Information and Software Technology

Information and Software Technology

Information and Software Technology

Software function, source lines of code, and development effort prediction: a software science validation

IEEE Transactions on Software Engineering

Disaggregating and calibrating the CASE tool variable in COCOMO II

IEEE Transactions on Software Engineering

Software Engineering Economics

Classification and Regression Trees

Modeling development effort in object-oriented systems using design properties

IEEE Transactions on Software Engineering

A guide to the literature on learning probabilistic networks from data

IEEE Transactions on Knowledge and Data Engineering

A metrics suite for object-oriented design

IEEE Transactions on Software Engineering

Bayesian analysis of empirical software engineering cost models

IEEE Transactions on Software Engineering