Systematic evolution of model-based spreadsheet applications

https://doi.org/10.1016/j.jvlc.2011.11.009Get rights and content

Abstract

Using spreadsheets is the preferred method to calculate, display or store anything that fits into a table-like structure. They are often used by end users to create applications, although they have one critical drawback—spreadsheets are very error-prone. Recent research has developed methods to reduce this error-proneness by introducing a new way of object-oriented modeling of spreadsheets before using them. These spreadsheet models, termed ClassSheets, are used to generate concrete spreadsheets on the instance level. By this approach sources of errors are reduced and spreadsheet applications become easier to understand.

As usual for almost every other application, requirements on spreadsheets change due to the changing environment. Thus, the problem of evolution of spreadsheets arises. The update and evolution of spreadsheets is the uttermost source of errors that may have severe impact.

In this paper, we will introduce a model-based approach to spreadsheet evolution by propagating updates on spreadsheet models (i.e. ClassSheets) to spreadsheets. To this end, update commands for the ClassSheet layer are automatically transformed to those for the spreadsheet layer. We describe spreadsheet model update propagation using a formal framework and present an integrated tool suite that allows the easy creation and safe update of spreadsheet models. The presented approach greatly contributes to the problem of software evolution and maintenance for spreadsheets and thus avoids many errors that might have severe impacts.

Highlights

► ClassSheets are formally defined using Graph Structures. ► Evolution of ClassSheets represented in these Graph Structures is formally defined. ► Propagation of ClassSheet updates to Spreadsheet Instances is formally described. ► Full Tool Support is given that enables full model-based round-trip engineering of spreadsheets.

Introduction

Spreadsheets are one of the most popular and important programming languages used in business applications today. Estimates say that “each year tens of millions of managers and professionals around the world create hundreds of millions of spreadsheets” [1]. Reasons for this wide-spread use of spreadsheets are the ease of creating highly sophisticated spreadsheet applications using the simple and intuitive two-dimensional tabular layout and of course the fast and broad availability of spreadsheet applications. Due to their availability to non-experts, spreadsheets belong to the category of end-user development environments. However, the simplicity of spreadsheets is misleading. Although, the spreadsheet user group usually is able to develop complex spreadsheets, the users often do not have the knowledge to prevent errors leading to error-prone and unstructured spreadsheets [2]. Studies estimate rates of at least 80 percent of erroneous spreadsheets [3], [4]. How costly these errors can be is evident in recent news stories. For example, in 2006, the Office of Government Commerce Buying Solutions erred in informing 29 suppliers that they had been successful in the public sector tendering process. In a subsequent letter, they stated: “Unfortunately, [we are] not, as we had hoped, in a position to accept your tender at this time. This is because an error in the original evaluation spreadsheet has been identified, necessitating rescoring of all tenders for this project…this error has now been corrected and this has caused a small number of changes to the original award decision” [5].

Reasons for these failures are manifold, the most important reason being the missing business model. A business model of spreadsheets specifies which business entities are represented by the specific spreadsheet. For instance, a spreadsheet for budget calculation may comprise entities like category or year (cf. Fig. 1, categories in rows 4 and 5 and years in columns C up to H). A spreadsheet's business model is not given explicitly to the spreadsheet application, but usually is kept in the developer's idea and captured implicitly in the spreadsheet's layout and data. Taking into account the complexity of today's spreadsheets, the gap between the implicit business model of a spreadsheet and the resulting implementation (i.e. the spreadsheet itself) is too large. Fig. 2 shows the current approach of spreadsheet development. The user has an idea of the spreadsheet's business model in mind and develops the spreadsheet accordingly. However, spreadsheet development is very low-level and current spreadsheet applications do not allow to implement all elements of the business model. The semantic gap between the implicit business model and the spreadsheet leads to misleading error reports (e.g. the spreadsheet application complains about a formula that differs from surrounding formulas) or even missed errors. Spreadsheet applications like Microsoft Excel fail to sufficiently mitigate this situation. The reason for this failure is that the business model is not described explicitly and thus cannot be used to automatically validate the current state of a spreadsheet (i.e. the inserted data, formulas, and references). Bridging the described gap between the business model (captured in the developer's idea) and the IT implementation is recently known under Business/IT Alignment. The need for Business/IT alignment was emphasized by the Sarbanes-Oxley Compliance (see [6]) leading to plenty of tools that are concerned with increasing the quality of use and development of spreadsheets. However, these approaches rather provide process guidance and support security aspects but do not tackle the problem at its root, the gap between business and IT. But how can the business model help to prevent errors?

Usually, errors are produced while changing a spreadsheet. We distinguish two different kinds of spreadsheet changes, namely instance evolution and model evolution. Both kinds of changes are part of spreadsheet evolution. Instance evolution describes changes that concern a spreadsheet's data but not its representation and interrelations. For instance, inserting a new category in the budget spreadsheet mentioned above is part of instance evolution. In turn, model evolution describes changes that concern the relation between data (e.g. formulas) or the insertion and deletion of data types. For instance, the insertion of a new column that holds a new type of data (e.g. an exchange rate for given costs) is a change at the underlying business model. Those changes must be applied to all inserted data, e.g. the exchange rate must be inserted for all categories and years. See Fig. 3 for the distinction of instance evolution and model evolution. For reasons of understandability, we chose UML class diagrams [7] to represent the business model. Of course every other representation is applicable, e.g. the Entity Relationship Model [8]. Both instance and model evolutions are sources of errors. The most frequent error in instance evolution is a missed or wrong update of a formula, such as the missed update of the total formulas in Fig. 3. While Microsoft Excel supports updating formulas that range over connected regions (e.g. A17:A42) it lacks support for automatically updating formulas that range over unconnected regions, e.g. every second cell in a column (A1;A3;A5). Thus, insertion of data in complex large spreadsheets that contain many formulas over regions that are not connected is very prone to errors. The problem of model evolution is that a change at the implicit business model cause necessary changes on each inserted data record. For instance, adding the mentioned exchange rate to the budget sheet (cf. Fig. 3 on the left) requires the insertion of multiple columns and the change of multiple formulas as shown in Fig. 4. For complex large spreadsheets, this task is not only burdensome but also very error-prone.

To solve the problem of instance evolution, an approach proposed by Engels and Erwig [9] introduces ClassSheets using proved and well-tried techniques from modern Software Engineering like object-orientation and model-driven development to yield Business/IT alignment. ClassSheets are explicit specifications of business models which can be translated to spreadsheets automatically. These explicit specifications support a better understanding of the underlying business model, yield less erroneous spreadsheets due to the distinction of business model and implementation, and insure the correctness of spreadsheets regarding their business model (i.e. ClassSheet) with fully automated tool support. But why are object-orientation and model-driven development a good choice for bridging the gap between business and IT?

Object-orientation, on the one hand, improves the design and development of software by providing a presentation of the domain that imitates the real world in terms of interacting objects and their properties. Programming languages evolved from assembler to high-level languages supporting object orientation and thereby enabled efficient development of complex applications with high quality. Today's requirements put into spreadsheets require a similar evolution in the spreadsheet paradigm. On the other hand, model-driven development evolved into an established development paradigm bridging the gap between business and IT by using models. Abstraction provides an easy understanding leading to less redundant models that are focused on the problem space. Working with a clear comprehensible concept of the business domain speeds up the design and development of software, which is further supported by using partial automation, e.g. generation of software artifacts.

A major benefit of using ClassSheets is that they can be handled by machines and therefore provide spreadsheet applications (e.g. Microsoft Excel) with more information about the spreadsheet in use. Thus, the creation and use of spreadsheets is less prone to errors since the spreadsheet application constrains possible spreadsheet updates according to the underlying ClassSheet. Fig. 5 illustrates the current approach of developing spreadsheets on the left and the ClassSheets approach on the right side, where the generated spreadsheet is typed over the corresponding ClassSheet. In contrast to the current approach, the user does not develop low-levelspreadsheets, but creates a ClassSheet (i.e. business model) that reflects allnecessary business entities. This information is provided to the spreadsheetapplication to generate spreadsheets and avoid the mentioned errors, such asmissed formula updates.

Let us investigate ClassSheets by example. Fig. 6 shows the ClassSheet for the budget calculation spreadsheet. ClassSheets look like spreadsheets with some extra functionality, e.g. qualified references and specification of repetitions. Bold bolders indicate class boundaries, e.g. the whole ClassSheet consists of one embracing class (A1:F5) containing several other classes. Attribute definitions like year=2010 assign a name and a default value to a cell. Each attribute belongs to one class and may be accessed by its quantified name, e.g. Year.year. The budget ClassSheet defines a (blue) association class (cells C3:E4), associating the two (red and orange) bracketing classes Category (rows 3 and 4) and Year (columns C, D, and E). The association class contains entries for quantity, cost, and total per year and category. The two aggregated total values for a summation over years and categories are defined in the corresponding bracketing classes. The overall total is defined in an enclosing (black) bracket class Budget. The budget ClassSheet allows the expression of an arbitrary number of categories and years, which is denoted using the three dots in the axis description. The dots refer to all preceding rows/cols that are not divided by bars, i.e. for every year three columns are repeated (C, D, and E) while for every category, only row 4 is repeated. In Section 2.1, we discuss ClassSheets in more detail. Fig. 7 shows an instance of the budget ClassSheet indicating its instance evolution, i.e. new data is inserted while the structure of composed classes remains. In the example the user decides to insert a new year (i.e. an instance of the Year class) which is supported by the spreadsheet application by inserting the three corresponding columns and the initial values. The example also demonstrates the automatic update of all involved formulas (marked yellow in the last column). Moving the user's decision (i.e. inserting a bunch of new columns) to a higher level (i.e. inserting a new instance of the class Year) does not only simplify working with spreadsheets, but also provide the spreadsheet application with a better understanding of the user's intention and thus yields a lower error rate due to the enabled automation. Therefore, ClassSheets solve the described problem of inserting errors in the process of instance evolution.

Fig. 8 shows the life cycle of a spreadsheet using ClassSheets. First, a ClassSheet is created using a supporting tool. The use of ClassSheets at this stage simplifies the design of spreadsheets since well-known techniques from object-orientation can be used. For example, cells can be grouped to classes and formulas can refer to specific cells by using meaningful names. The ClassSheet approach comes with a rule system to identify correct ClassSheets, such that illegal ClassSheets cannot be created [9]. The second step in the life cycle is the generation of an empty ClassSheet instance (i.e. a spreadsheet) and its evolution. The ClassSheet tool suite comes with an Microsoft Excel plug-in that support instance evolution regarding the corresponding ClassSheet. The third step comprises the change of a ClassSheet in the presence of one or more corresponding instances. Those model changes must be replayed on all instances. Currently, there is no support for systematic model evolution in the spreadsheet paradigm.

This paper presents a model evolution method that is based on ClassSheets and therefore is coined ClassSheet evolution. ClassSheet evolution is a special case of model evolution and comparable to schema evolution known from databases. However, the difficulty of ClassSheet Evolution is that the spreadsheet's spatial layout has to be considered. This is usually not the case for database schema evolution or other types of model evolution.

The core problem of model evolution is the need to replay the changes on existing instances of the model. For instance, changing a XML schema definition requires the change of all XML files that are described by the schema accordingly. The same applies to the paradigm of databases, where a database schema update must be reflected on every database build upon this schema. Many solutions for this problem had been proposed, including model-to-model transformations and graph grammar-based solutions [10], [11], [12]. However, for the spreadsheet paradigm, this problem is yet unsolved. Fig. 9 describes the problem for ClassSheets in more detail. Whenever the user updates an existing ClassSheet, the problem of how to propagate this update to existing ClassSheet instances (i.e. spreadsheets) occurs. Manually updating the ClassSheet instances is definitely not advisable since this task is very prone to errors. Thus, an approach to automatically propagate ClassSheet updates to spreadsheets updates has to be developed.

The rest of this paper is structured as follows. In Section 2, ClassSheets will be discussed in more detail, providing the reader with a detailed example on ClassSheet evolution in Section 2.2. In Section 3 we describe the implementation of the ClassSheet propagation approach. Therefore, we formally describe the used data structures and propagation rules. After discussing related work in Section 4, we conclude this paper in Section 5 giving a short summary and pointing at some interesting future work.

Section snippets

Basics

ClassSheets are two-dimensional grids that consist of classes, attributes, and labels. Attributes may have a concrete value or describe a formula (cf. Fig. 6). The underlying business model of a ClassSheet (i.e. its conceptual information) can be expressed using UML class diagrams or similar representations. Fig. 10 shows a UML class diagram that expresses the business model of the budget ClassSheet. We use this representation to show the parallelism of ClassSheets and object-oriented design.

Formalization and implementation

There are at least two approaches for propagating ClassSheet changes to spreadsheets. The first one identifies changes on ClassSheets and reconstructs these changes on the spreadsheets. This approach is difficult to realize since there are countless changes that could be applied to a ClassSheet and especially for a whole sequence of changes it is difficult to reconstruct the atomic updates.1

Related work

The problem of erroneous spreadsheets has been addressed from many perspectives. Adopting software engineering techniques, such as testing [17] and debugging [18], for spreadsheets has been successful in supporting end users in the effective localization of errors. But even though these approaches cost less time and are an improvement over the more traditional auditing and code inspection approaches [19], [20], [21], a challenge is still how to motivate the users to spend time and effort on the

Conclusion and future work

ClassSheets present an effective way of designing spreadsheet models using several popular paradigms from actual computer science research. ClassSheets define object-oriented models that are used as templates and allow the use of an MDE style development process for spreadsheets. To deal with the problem of maintainability, we propose a formally defined graph data structure for ClassSheets to enable the easy definition of propagation rules that allow the semantics-preserving propagation of

References (50)

  • J. Sajaniemi

    Modeling spreadsheet audit: a rigorous approach to automatic visualization

    Journal of Visual Languages and Computing

    (2000)
  • R. Abraham et al.

    UCheck: a spreadsheet unit checker for end users

    Journal of Visual Languages and Computing

    (2007)
  • R.R. Panko, Spreadsheet errors: what we know, what we think we can do, in: Symposium of the European Spreadsheet Risks...
  • B. Boehm et al.

    Software defect reduction top 10 list

    Computer

    (2001)
  • R.R. Panko, The spreadsheet research website,...
  • L. Pryor, Spreadsheet research website,...
  • C. Weekly, Celebrations halted,...
  • L.P. Johnson, M.A. Sides, The Sarbanes-Oxley Act and Fiduciary Duties, SSRN eLibrary,...
  • MOF, Unified Modeling Language (UML),...
  • P. Chen

    The entity-relationship model—toward a unified view of data

    ACM Transactions on Database Systems

    (1976)
  • G. Engels, M. Erwig, ClassSheets: automatic generation of spreadsheet applications from object-oriented specifications,...
  • J. Bézivin, G. Dupé, F. Jouault, G. Pitette, J.E. Rougui, First experiments with the atl model transformation language:...
  • MOF, Query/Views/Transformations (QVT),...
  • C. Amelunxen, F. Klar, A. Königs, T. Rötschke, A. Schürr, Metamodel-based tool integration with moflon, in: ICSE ’08:...
  • M. Erwig et al.

    Gencel—a program generator for correct spreadsheets

    Journal of Functional Programming

    (2006)
  • G.E.J.-C. Bals, F. Christ, M. Erwig, Classsheets—model based, object oriented design of spreadsheet applications,...
  • J. Bals, F. Christ, ClassSheets—modellbasierter, werkzeuggestuetzter Entwurf von Spreadsheet-Anwendungen, Diploma...
  • M. Luckey, Automatic Propagation of Model Updates in the Spreadsheet Paradigm, Bachelor Thesis, University of...
  • G. Rothermel et al.

    A methodology for testing spreadsheets

    ACM Transactions on Software Engineering and Methodology

    (2001)
  • R. Abraham, M. Erwig, GoalDebug: a spreadsheet debugger for end users, in: 29th IEEE International Conference on...
  • R.R. Panko

    Applying code inspection to spreadsheet testing

    Journal of Management Information Systems

    (1999)
  • R. Mittermeir, M. Clermont, Finding high-level structures in spreadsheet programs, in: 9th Working Conference on...
  • M. Erwig, M.M. Burnett, Adding apples and oranges, in: 4th International Symposium on Practical Aspects of Declarative...
  • Y. Ahmad, T. Antoniu, S. Goldwater, S. Krishnamurthi, A type system for statically detecting spreadsheet errors, in:...
  • T. Antoniu, P.A. Steckler, S. Krishnamurthi, E. Neuwirth, M. Felleisen, Validating the unit correctness of spreadsheet...
  • Cited by (0)

    This paper has been recommended for acceptance by Shi Kho Chang.

    View full text