Applications of clustering techniques to software partitioning, recovery and restructuring

doi:10.1016/S0164-1212(03)00234-6

Journal of Systems and Software

Volume 73, Issue 2, October 2004, Pages 227-244

https://doi.org/10.1016/S0164-1212(03)00234-6 Get rights and content

Abstract

The artifacts constituting a software system are sometimes unnecessarily coupled with one another or may drift over time. As a result, support of software partitioning, recovery, and restructuring is often necessary. This paper presents studies on applying the numerical taxonomy clustering technique to software applications. The objective is to facilitate those activities just mentioned and to improve design, evaluation and evolution. Numerical taxonomy is mathematically simple and yet it is a useful mechanism for component clustering and software partitioning. The technique can be applied at various levels of abstraction or to different software life-cycle phases. We have applied the technique to: (1) software partitioning at the software architecture design phase; (2) grouping of components based on the source code to recover the software architecture in the reverse engineering process; (3) restructuring of a software to support evolution in the maintenance stage; and (4) improving cohesion and reducing coupling for source code. In this paper, we provide an introduction to the numerical taxonomy, discuss our experiences in applying the approach to various areas, and relate the technique to the context of similar work.

Introduction

Non-functional quality attributes such as maintainability and reliability are essential factors in controlling software life-cycle costs. It has been widely acknowledged that maintaining existing software accounts for as much as 60–80% of a software’s total cost. Cohesion and coupling are two properties that have great impact on some critical software quality attributes, including maintainability. Therefore, management of cohesion and coupling is of critical importance for system design and cost reduction.

Cohesion refers to a component’s internal strength, that is, the strength that holds the internal elements in a component together to perform a certain functionality. A component used in this paper is generic in that it could be a high-level architecture component; a module consisting of procedures; a procedure; a class; or even a variable. While cohesion is an intra-component property, coupling measures the interdependence among components. A desirable system partitioning should achieve high cohesion and low coupling, so that all the elements in one component are closely related for the realization of a certain feature, and changes made to that component will have as little impact as possible on other components. Alexander (1964) also postulated that the major design principle which is common to all engineering disciplines is the relative isolation of one component from other components.

Software engineering is a relatively new area compared to other well-established disciplines, such as mechanical engineering and manufacturing. Software partitioning is usually conducted in an ad hoc manner and is primarily based on the designer’s experience. However, software systems may be either ill-designed, or often drift or erode over time due to changes in requirements and technology (Perry and Wolf, 1992). In other words, software evolves over time and is non-static, as a result of requirement changes. The resulting system could be highly coupled, which in turn creates problems for downstream software phases or evolution. Thus, effective partitioning or re-partitioning is needed. Effective partitioning or clustering is also a paramount goal in other disciplines. Clustering techniques have been used successfully in many areas to assist grouping of similar components and support partitioning of a system. In this research, clustering and partitioning are viewed as two sides of a coin. Partitioning is similar to a top–down approach to decomposing a system into smaller subsystems. Clustering, on the other hand, is a bottom–up method. With clustering, similar components are grouped together to form clusters or subsystems. Those clusters or subsystems are partitions which constitute a system.

In fact, partitioning or clustering analysis has been of long-standing interest and is a fundamental method used in science and engineering. The technique can facilitate better understanding of the observations and the subsequent construction of complex knowledge structures from features and component clusters. For instance, the technique has been used to classify botanical species and mechanical parts. The key concept of clustering is to group similar things together to form a set of clusters, such that intra-cluster similarity (cohesion) is high and inter-cluster (coupling) similarity is low. The objective––high cohesion and low coupling––is similar in software design.

Various clustering techniques have also been studied in software engineering. In this paper, we borrow some clustering ideas from established disciplines, and tailor them to software partitioning, recovery, and restructuring. The clustering techniques adopted in this paper are based on numerical taxonomy or agglomerative hierarchical approaches. Numerical taxonomy uses numerical methods to classify components. There are several reasons for adopting numerical taxonomy. The first is its conceptual and mathematical simplicity, as will be demonstrated in Section 2. Although its concept is simple, no scientific study has shown that numerical taxonomy is inferior to other, more complex multiversity methods (Romesburg, 1990). Another reason is that existing clustering techniques used in software engineering are often limited to only the reverse engineering process, based on source code. The approach presented in this paper can also easily be applied to various levels of abstraction and be used in round-trip engineering (e.g., both forward engineering and reverse engineering processes). Furthermore, the technique can provide more added value by facilitating software (design or code) restructuring, rather than simply design recovery. Lastly, the computation time is fast, which is an important factor if it is applied interactively or incrementally.

The objective of this paper is to examine existing numerical clustering techniques used in other well-established disciplines, tailor those techniques for various software applications, and present empirical studies of the techniques in software engineering. The approach has been applied to several projects at Nortel Networks and some of the results are presented in this paper. The rest of paper is organized as follows: Section 2 presents an overview of the clustering technique and discusses the method adopted for this research and the rationale behind it. Section 3 demonstrates several practical applications of the clustering technique to software partitioning, recovery, and restructuring. Section 4 discusses some lessons learned from applying the approach to various projects. Section 5 highlights some related work in software engineering. Finally, Section 6 presents the summary and discusses future directions.

Section snippets

Clustering

This section first describes the general concept behind the numerical taxonomy clustering technique. Following that, we will discuss the method adopted in this research.

Applications of clustering to software

This section presents applications of the Sorenson method and demonstrates different ways to define and obtain the contents of the input matrix. By defining the matrix differently, we show the various uses of the clustering method. Specifically, examples include software partitioning, recovery, restructuring, and decoupling. A generic example is given first as an illustration of the technique. Some measures have also been adopted which can be used to quantitatively evaluate clustering

Lessons learned

As discussed earlier in this paper, the approach has been applied to various software applications. Specifically, we have applied the approach to software architecture partitioning at the design stage, design recovery in the reverse engineering process, software restructuring to support evolution, and source code decoupling. The approach provides a useful method for revealing the degree of similarity or coupling among components.

When applied to partitioning, as discussed in Section 3.1, there

Related work

Applications of the clustering concept specific to the software partitioning have been studied. Andreu and Madnick (1977) applied the partitioning concept to a database management system in order to minimize coupling. The requirements and their interdependencies were first identified and were converted to a graph problem. Various alternatives for partitioning were examined and a quantitative metric was calculated for each alternative. The alternative with the lowest value of coupling was chosen

Summary and future work

This paper presented a clustering method and demonstrated how it can be applied to software partitioning, recovery, restructuring, and decoupling. The key value of this approach is that it can support a rapid and effective evaluation of a system based on the relationships between components and features, or component interdependencies at various levels of abstraction. System partitioning is usually performed by designers based on their experiences. The proposed method can help designers quickly

Acknowledgements

Much of the research effort presented in this paper has been carried out while the authors were affiliated with Nortel Networks. We are grateful to Kalai Kalaichelvan and Rama Munikoti for their support. We would also like to express our gratitude to the reviewers and editors for their time and helpful comments.

References (35)

H. Dhama
Quantitative models of cohesion and coupling in software
J. Syst. Software
(1995)
A. Lakhotia
A unified framework for expressing software subsystem classification techniques
J. Syst. Software
(1997)
C. Alexander
Notes on the Synthesis of Form
(1964)
M.R. Anderberg
Cluster Analysis for Applications
(1973)
Andreu, R.C., Madnick, S.E., 1977. A systematic approach to the design of complex systems: application to DBMS design...
Anquetil, N., Lethbridge, T., 1998. Extracting concepts from file names: a new file clustering criterion. In:...
Anquetil, N., Lethbridge, T., 2000. Experiments with clustering as a software remodularization method. In: Proceedings...
R.S. Arnold
Software Reengineering
(1993)
J.M. Bieman et al.
Measuring design-level cohesion
IEEE Trans. Software Eng.
(1998)
J.M. Bieman et al.
Measuring functional cohesion
IEEE Trans. Software Eng.
(1994)

Briand, L., Devanbu, P., Melo, W., 1997. An investigation into coupling measures in C++. In: Proceedings of...

D.N. Card et al.

Measuring Software Design Quality

(1990)

Davey, J., Burd, E., 2000. Evaluating the suitability of data clustering for software remodularization. In: Proceedings...

R.G. Dromey

Cornering the Chimera

IEEE Software

(1996)

B. Everitt

Cluster Analysis

(1980)

Heyliger, G., 1994. Coupling, In: Marciniak, J. (Ed.), Encyclopedia of Software...

D. Hutchens et al.

System structure analysis: clustering with data bindings

IEEE Trans. Software Eng.

(1985)

Cited by (66)

Automatic clustering constraints derivation from object-oriented software using weighted complex network with graph theory analysis
2017, Journal of Systems and Software
Constrained clustering or semi-supervised clustering has received a lot of attention due to its flexibility of incorporating minimal supervision of domain experts or side information to help improve clustering results of classic unsupervised clustering techniques. In the domain of software remodularisation, classic unsupervised software clustering techniques have proven to be useful to aid in recovering a high-level abstraction of the software design of poorly documented or designed software systems. However, there is a lack of work that integrates constrained clustering for the same purpose to help improve the modularity of software systems. Nevertheless, due to time and budget constraints, it is laborious and unrealistic for domain experts who have prior knowledge about the software to review each and every software artifact and provide supervision on an on-demand basis. We aim to fill this research gap by proposing an automated approach to derive clustering constraints from the implicit structure of software system based on graph theory analysis of the analysed software. Evaluations conducted on 40 open-source object-oriented software systems show that the proposed approach can serve as an alternative solution to derive clustering constraints in situations where domain experts are non-existent, thus helping to improve the overall accuracy of clustering results.
A search engine for finding and reusing architecturally significant code
2017, Journal of Systems and Software
Citation Excerpt :
In addition, some public repositories (e.g., GitHub and SourceForge) support the search of code snippets and/or software projects. Several researchers have developed techniques for identifying architectural information in the source code (Tsantalis et al., 2006; Velasco-Elizondo et al., 2016; Lung et al., 2004; Cai et al., 2013; Jansen et al., 2008; Huang et al., 2006), however, to the best of our knowledge, no one has attempt to build a search engine to help developers find and reuse such architectural information. Our work is the first attempt to develop a tactic search engine that can assist developers find and reuse tactical code snippets.
Architectural tactics are the building blocks of software architecture. They describe solutions for addressing specific quality concerns, and are prevalent across many software systems. Once a decision is made to utilize a tactic, the developer must generate a concrete plan for writing code and implementing the tactic. Unfortunately, this is a non-trivial task even for experienced developers. Often, developers resort to using search engines, crowd-sourcing websites, or discussion forums to find sample code snippets to implement a tactic. A fundamental problem of finding implementation for architectural tactics/patterns is the mismatch between the high-level intent reflected in the descriptions of these patterns and the low-level implementation details of them. To reduce this mismatch, we created a novel Tactic Search Engine called ArchEngine (ARCHitecture search ENGINE). ArchEngine can replace this manual internet-based search process and help developers find and reuse tactical code from a wide range of open source systems. ArchEngine helps developers find implementation examples of an architectural tactic for a given technical context. It uses information retrieval and program analysis techniques to retrieve applications that implement these design concepts. Furthermore, it lists and rank the code snippets where the patterns/tactics are located. Our case study with 21 graduate students (with experience level of junior software developers) shows that ArchEngine is more effective than other search engines (e.g., Krugle and Koders) in helping programmers to quickly find implementations of architectural tactics/patterns.
Efficient software clustering technique using an adaptive and preventive dendrogram cutting approach
2013, Information and Software Technology
Citation Excerpt :
An example of a class diagram is shown in Fig. 2, while the features matrix is presented in Table 2. The Sorensen–Dice coefficient is used in some research on software clustering [23–25]. Using Fig. 2 and Table 2 as an example, if Simple Matching is used, the coefficient between Class 1 and Class 5 (a = 1, b = 1, c = 1, d = 2) will be 0.6 and will be 0.5 if Sorensen–Dice is used.
Software clustering is a key technique that is used in reverse engineering to recover a high-level abstraction of the software in the case of limited resources. Very limited research has explicitly discussed the problem of finding the optimum set of clusters in the design and how to penalize for the formation of singleton clusters during clustering.
This paper attempts to enhance the existing agglomerative clustering algorithms by introducing a complementary mechanism. To solve the architecture recovery problem, the proposed approach focuses on minimizing redundant effort and penalizing for the formation of singleton clusters during clustering while maintaining the integrity of the results.
An automated solution for cutting a dendrogram that is based on least-squares regression is presented in order to find the best cut level. A dendrogram is a tree diagram that shows the taxonomic relationships of clusters of software entities. Moreover, a factor to penalize clusters that will form singletons is introduced in this paper. Simulations were performed on two open-source projects. The proposed approach was compared against the exhaustive and highest gap dendrogram cutting methods, as well as two well-known cluster validity indices, namely, Dunn’s index and the Davies-Bouldin index.
When comparing our clustering results against the original package diagram, our approach achieved an average accuracy rate of 90.07% from two simulations after the utility classes were removed. The utility classes in the source code affect the accuracy of the software clustering, owing to its omnipresent behavior. The proposed approach also successfully penalized the formation of singleton clusters during clustering.
The evaluation indicates that the proposed approach can enhance the quality of the clustering results by guiding software maintainers through the cutting point selection process. The proposed approach can be used as a complementary mechanism to improve the effectiveness of existing clustering algorithms.
Incorporating Use History in Information System Remodularization
2024, IEEE Transactions on Engineering Management
Software Architecture in Practice: Challenges and Opportunities
2023, ESEC/FSE 2023 - Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Software Architecture in Practice: Challenges and Opportunities
2023, arXiv

View all citing articles on Scopus

View full text

Applications of clustering techniques to software partitioning, recovery and restructuring

Abstract

Introduction

Section snippets

Clustering

Applications of clustering to software

Lessons learned

Related work

Summary and future work

Acknowledgements

J. Syst. Software

J. Syst. Software

Notes on the Synthesis of Form

Cluster Analysis for Applications

Software Reengineering

Measuring design-level cohesion

IEEE Trans. Software Eng.

Measuring functional cohesion

IEEE Trans. Software Eng.

Measuring Software Design Quality

Cornering the Chimera

IEEE Software

Cluster Analysis

System structure analysis: clustering with data bindings

IEEE Trans. Software Eng.