Applications of clustering techniques to software partitioning, recovery and restructuring

https://doi.org/10.1016/S0164-1212(03)00234-6Get rights and content

Abstract

The artifacts constituting a software system are sometimes unnecessarily coupled with one another or may drift over time. As a result, support of software partitioning, recovery, and restructuring is often necessary. This paper presents studies on applying the numerical taxonomy clustering technique to software applications. The objective is to facilitate those activities just mentioned and to improve design, evaluation and evolution. Numerical taxonomy is mathematically simple and yet it is a useful mechanism for component clustering and software partitioning. The technique can be applied at various levels of abstraction or to different software life-cycle phases. We have applied the technique to: (1) software partitioning at the software architecture design phase; (2) grouping of components based on the source code to recover the software architecture in the reverse engineering process; (3) restructuring of a software to support evolution in the maintenance stage; and (4) improving cohesion and reducing coupling for source code. In this paper, we provide an introduction to the numerical taxonomy, discuss our experiences in applying the approach to various areas, and relate the technique to the context of similar work.

Introduction

Non-functional quality attributes such as maintainability and reliability are essential factors in controlling software life-cycle costs. It has been widely acknowledged that maintaining existing software accounts for as much as 60–80% of a software’s total cost. Cohesion and coupling are two properties that have great impact on some critical software quality attributes, including maintainability. Therefore, management of cohesion and coupling is of critical importance for system design and cost reduction.

Cohesion refers to a component’s internal strength, that is, the strength that holds the internal elements in a component together to perform a certain functionality. A component used in this paper is generic in that it could be a high-level architecture component; a module consisting of procedures; a procedure; a class; or even a variable. While cohesion is an intra-component property, coupling measures the interdependence among components. A desirable system partitioning should achieve high cohesion and low coupling, so that all the elements in one component are closely related for the realization of a certain feature, and changes made to that component will have as little impact as possible on other components. Alexander (1964) also postulated that the major design principle which is common to all engineering disciplines is the relative isolation of one component from other components.

Software engineering is a relatively new area compared to other well-established disciplines, such as mechanical engineering and manufacturing. Software partitioning is usually conducted in an ad hoc manner and is primarily based on the designer’s experience. However, software systems may be either ill-designed, or often drift or erode over time due to changes in requirements and technology (Perry and Wolf, 1992). In other words, software evolves over time and is non-static, as a result of requirement changes. The resulting system could be highly coupled, which in turn creates problems for downstream software phases or evolution. Thus, effective partitioning or re-partitioning is needed. Effective partitioning or clustering is also a paramount goal in other disciplines. Clustering techniques have been used successfully in many areas to assist grouping of similar components and support partitioning of a system. In this research, clustering and partitioning are viewed as two sides of a coin. Partitioning is similar to a top–down approach to decomposing a system into smaller subsystems. Clustering, on the other hand, is a bottom–up method. With clustering, similar components are grouped together to form clusters or subsystems. Those clusters or subsystems are partitions which constitute a system.

In fact, partitioning or clustering analysis has been of long-standing interest and is a fundamental method used in science and engineering. The technique can facilitate better understanding of the observations and the subsequent construction of complex knowledge structures from features and component clusters. For instance, the technique has been used to classify botanical species and mechanical parts. The key concept of clustering is to group similar things together to form a set of clusters, such that intra-cluster similarity (cohesion) is high and inter-cluster (coupling) similarity is low. The objective––high cohesion and low coupling––is similar in software design.

Various clustering techniques have also been studied in software engineering. In this paper, we borrow some clustering ideas from established disciplines, and tailor them to software partitioning, recovery, and restructuring. The clustering techniques adopted in this paper are based on numerical taxonomy or agglomerative hierarchical approaches. Numerical taxonomy uses numerical methods to classify components. There are several reasons for adopting numerical taxonomy. The first is its conceptual and mathematical simplicity, as will be demonstrated in Section 2. Although its concept is simple, no scientific study has shown that numerical taxonomy is inferior to other, more complex multiversity methods (Romesburg, 1990). Another reason is that existing clustering techniques used in software engineering are often limited to only the reverse engineering process, based on source code. The approach presented in this paper can also easily be applied to various levels of abstraction and be used in round-trip engineering (e.g., both forward engineering and reverse engineering processes). Furthermore, the technique can provide more added value by facilitating software (design or code) restructuring, rather than simply design recovery. Lastly, the computation time is fast, which is an important factor if it is applied interactively or incrementally.

The objective of this paper is to examine existing numerical clustering techniques used in other well-established disciplines, tailor those techniques for various software applications, and present empirical studies of the techniques in software engineering. The approach has been applied to several projects at Nortel Networks and some of the results are presented in this paper. The rest of paper is organized as follows: Section 2 presents an overview of the clustering technique and discusses the method adopted for this research and the rationale behind it. Section 3 demonstrates several practical applications of the clustering technique to software partitioning, recovery, and restructuring. Section 4 discusses some lessons learned from applying the approach to various projects. Section 5 highlights some related work in software engineering. Finally, Section 6 presents the summary and discusses future directions.

Section snippets

Clustering

This section first describes the general concept behind the numerical taxonomy clustering technique. Following that, we will discuss the method adopted in this research.

Applications of clustering to software

This section presents applications of the Sorenson method and demonstrates different ways to define and obtain the contents of the input matrix. By defining the matrix differently, we show the various uses of the clustering method. Specifically, examples include software partitioning, recovery, restructuring, and decoupling. A generic example is given first as an illustration of the technique. Some measures have also been adopted which can be used to quantitatively evaluate clustering

Lessons learned

As discussed earlier in this paper, the approach has been applied to various software applications. Specifically, we have applied the approach to software architecture partitioning at the design stage, design recovery in the reverse engineering process, software restructuring to support evolution, and source code decoupling. The approach provides a useful method for revealing the degree of similarity or coupling among components.

When applied to partitioning, as discussed in Section 3.1, there

Related work

Applications of the clustering concept specific to the software partitioning have been studied. Andreu and Madnick (1977) applied the partitioning concept to a database management system in order to minimize coupling. The requirements and their interdependencies were first identified and were converted to a graph problem. Various alternatives for partitioning were examined and a quantitative metric was calculated for each alternative. The alternative with the lowest value of coupling was chosen

Summary and future work

This paper presented a clustering method and demonstrated how it can be applied to software partitioning, recovery, restructuring, and decoupling. The key value of this approach is that it can support a rapid and effective evaluation of a system based on the relationships between components and features, or component interdependencies at various levels of abstraction. System partitioning is usually performed by designers based on their experiences. The proposed method can help designers quickly

Acknowledgements

Much of the research effort presented in this paper has been carried out while the authors were affiliated with Nortel Networks. We are grateful to Kalai Kalaichelvan and Rama Munikoti for their support. We would also like to express our gratitude to the reviewers and editors for their time and helpful comments.

References (35)

  • H. Dhama

    Quantitative models of cohesion and coupling in software

    J. Syst. Software

    (1995)
  • A. Lakhotia

    A unified framework for expressing software subsystem classification techniques

    J. Syst. Software

    (1997)
  • C. Alexander

    Notes on the Synthesis of Form

    (1964)
  • M.R. Anderberg

    Cluster Analysis for Applications

    (1973)
  • Andreu, R.C., Madnick, S.E., 1977. A systematic approach to the design of complex systems: application to DBMS design...
  • Anquetil, N., Lethbridge, T., 1998. Extracting concepts from file names: a new file clustering criterion. In:...
  • Anquetil, N., Lethbridge, T., 2000. Experiments with clustering as a software remodularization method. In: Proceedings...
  • R.S. Arnold

    Software Reengineering

    (1993)
  • J.M. Bieman et al.

    Measuring design-level cohesion

    IEEE Trans. Software Eng.

    (1998)
  • J.M. Bieman et al.

    Measuring functional cohesion

    IEEE Trans. Software Eng.

    (1994)
  • Briand, L., Devanbu, P., Melo, W., 1997. An investigation into coupling measures in C++. In: Proceedings of...
  • D.N. Card et al.

    Measuring Software Design Quality

    (1990)
  • Davey, J., Burd, E., 2000. Evaluating the suitability of data clustering for software remodularization. In: Proceedings...
  • R.G. Dromey

    Cornering the Chimera

    IEEE Software

    (1996)
  • B. Everitt

    Cluster Analysis

    (1980)
  • Heyliger, G., 1994. Coupling, In: Marciniak, J. (Ed.), Encyclopedia of Software...
  • D. Hutchens et al.

    System structure analysis: clustering with data bindings

    IEEE Trans. Software Eng.

    (1985)
  • Cited by (66)

    • A search engine for finding and reusing architecturally significant code

      2017, Journal of Systems and Software
      Citation Excerpt :

      In addition, some public repositories (e.g., GitHub and SourceForge) support the search of code snippets and/or software projects. Several researchers have developed techniques for identifying architectural information in the source code (Tsantalis et al., 2006; Velasco-Elizondo et al., 2016; Lung et al., 2004; Cai et al., 2013; Jansen et al., 2008; Huang et al., 2006), however, to the best of our knowledge, no one has attempt to build a search engine to help developers find and reuse such architectural information. Our work is the first attempt to develop a tactic search engine that can assist developers find and reuse tactical code snippets.

    • Efficient software clustering technique using an adaptive and preventive dendrogram cutting approach

      2013, Information and Software Technology
      Citation Excerpt :

      An example of a class diagram is shown in Fig. 2, while the features matrix is presented in Table 2. The Sorensen–Dice coefficient is used in some research on software clustering [23–25]. Using Fig. 2 and Table 2 as an example, if Simple Matching is used, the coefficient between Class 1 and Class 5 (a = 1, b = 1, c = 1, d = 2) will be 0.6 and will be 0.5 if Sorensen–Dice is used.

    • Incorporating Use History in Information System Remodularization

      2024, IEEE Transactions on Engineering Management
    • Software Architecture in Practice: Challenges and Opportunities

      2023, ESEC/FSE 2023 - Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
    View all citing articles on Scopus
    View full text