A novel three-level architecture for large data warehouses

https://doi.org/10.1016/S1383-7621(02)00056-5Get rights and content

Abstract

Classical architectures proposed so far for data warehouses show some drawbacks when adopted to work over large numbers of heterogeneous operational sources. In this paper we propose a variant of a three-level architecture for data warehouses that overcomes these drawbacks. However, in the application context under consideration, having a suitable architecture may be not enough for the design purposes. Indeed, data warehouse design in very large operational environments can be a quite hard problem to attack with traditional manual methodologies. In this paper, automatic techniques are also illustrated that are capable to produce the data warehouse design according to the proposed architecture, with a limited human intervention.

Introduction

Data warehousing is recognized as a key technology for the exploitation of massive amounts of data nowadays available electronically in many organizations. Recent analyses confirm the role of data warehousing as one of the most pervasive technologies in modern database applications. Such increasing success, also determined by the availability of efficient hw/sw support technologies, is due to the fact that data warehousing allows to look at organizational data at the right level of abstraction and aggregation whereby providing a convenient support to decision making within complex organizations.

Intuitively, a data warehouse is a repository of data and related information whose purpose is to allow the extraction, the reconciliation and the re-organization of data stored in traditional “operational” databases, and it usually requires the re-materialization of data stored in operational databases [3], [19]. Data warehouses are query-oriented systems (as opposed to operational, transaction-oriented databases), where updating only takes place as a massive, relatively rare and off-line process by which operational data are migrated into the data warehouse.

A data warehouse is obtained by defining its architecture and mappings by which source data are extracted, filtered, integrated and stored into the defined structure. Populating the data warehouse by new source data takes place according to a pace decided as part of the design process [19].

In the literature, various conceptual architectures have been proposed for data warehouses. They can be classified into three groups, depending on the number of levels they are characterized by.

In a one-level architecture [8] each piece of data is stored once and only once: a “middleware” level operates as an interface between users and operational databases. Therefore, there is no materialized data warehouse but, rather, it is “simulated” using views (virtual data warehouse). This kind of architecture allows for a rather quick development of a data warehouse with reduced costs. However, it induces the need of planning activities (such as data source identification, data transformation and so on) to be carried out for each query,1 the lack of data historicizing and, last but not the least, a fairly unpredictable access time for the end user.

On the contrary, two-level architectures [5], [8], [12], [19] are based on the idea of separating source data from derived ones. In such architectures, the “first” level contains source data whereas derived data are stored in the “second” level. Derived data can be either a simple copy of source data or obtained from them by some abstraction/aggregation process. Two-level architectures are convenient especially when operational sources are basically homogeneous but have an important disadvantage in that significant data duplication is usually implied. Indeed, since each decision support application has its own derived data, there is no possibility of storing information to several decision support applications as a common unique copy. Therefore, one copy of these common derived data exists for each decision support application exploiting them.

Three-level architectures [7] are obtained by considering that the derivation of data for decision support is performed in two phases: (i) the reconciliation of operational data belonging to different sources; (ii) the derivation of decision support data from reconciled ones. In this sense, three-level conceptual architectures comprise both operational and reconciled and derived data. The classical three-level architecture of a data warehouse is represented in Fig. 1.

In this model, the first level stores operational data, the second one stores reconciled data, whereas the latter one stores support decision data [7]. Therefore, a three-level data warehouse can be represented by a 4-tupleDW=〈OS,BDW,BIWS,Mwhere (i) OS indicates the Operational Systems that manage data, schemes, and applications relative to operational data; (ii) BDW denotes the Business Data Warehouse that has in charge data, schemes and applications relative to reconciled data; (iii) BIWS represents the set of Business Information Warehouses (sometimes referred as data marts), that manages data, schemes and applications for decision support, and (iv) M indicates the metadata, providing information about sources, schemes and interactions among the three levels of the data warehouse.

The reconciled data level represents a common view of data available in the whole organization. Reconciling activities eliminate inconsistencies and irregularities of operational data and integrate them in a rational way. Once reconciled data have been constructed, decision support data can be derived from them rather than from operational data.

The presence of a reconciled data level has several important advantages, summarized below:

  • the cost paid for obtaining derived data is smaller in three-level architectures than in other models, because reconciliation operations are performed once and for all at the beginning and must not be repeated for each derived data set;

  • the presence of reconciled data can help in the re-engineering of operational applications: for instance, it may happen that some information, such as historic data, have been stored in operational data simply because the organization had no other place for storing them; the presence of a reconciled data level allows to transfer these data from the operational level to the reconciliation one, thus simplifying the operational systems;

  • some decision support information may exist that cannot be extracted in a simple manner from operational data each time they are needed (for example because the extraction is expensive or because transactions in the operational systems must be suspended for carrying it out); in a two-level architecture these information must be stored with each derived data set that uses it whereby inducing a significant amount of data replication; in a three-level architecture storing these information is done once and for all in the reconciled data level.


Apparently, the main drawback of three-level architectures is, again, data replication in the reconciled level. However, introducing the third level does not actually increase significantly needed storing resources, since (i) the reconciled data level stores only one copy of each information, (ii) the additional space needed to store data at the reconciled level is balanced considering that space is spared in storing common data and difficult-to-derive data once and for all in the reconciled level (instead of duplicating them within the derived data workspace associated to each application using them).

However, in spite of its advantages, classical three-level data warehouse architectures show important limitations within application environments (like those typical of Central Public Administrations) comprising lots of complex heterogeneous operational source systems. In such cases, the flat global scheme used within the reconciled level, and obtained by integrating schemes of operational databases, presumably encodes an enormous number and variety of objects, and becomes far too complex to be effectively used. As a result, the derivation of decision support information from this level may become too complex a task. The application context just described has not a mere speculative relevance, since this is precisely the situation one encounters nowadays in many organizations. We point out that the problem here is due to data source complexity and heterogeneity and not to the data warehouse architecture itself: analogous difficulties arise also with one- and two-level architectures.

A first purpose of this paper is to provide an alternative three-level data warehouse architecture overcoming the above mentioned limitation. In our approach, we do not directly integrate operational schemes to construct a global flat scheme. Rather, we first collect subsets of operational schemes into homogeneous clusters and then integrate schemes on a cluster-by-cluster basis. Each integrated scheme thus obtained is then abstracted, to construct a global scheme representing the cluster. The aforementioned process is iterated over the set of cluster schemes we have obtained, until to only one scheme is left. If such singleton scheme is by itself too complex, further abstraction steps can be carried out. In this way, a hierarchical structure is obtained, which is called a data repository [13]. Thus, a data repository is a sort of complex database catalogue organizing the description of involved data and of all scheme information at various abstraction levels. It encodes an appropriate structured view of available data for the purpose of deriving decision support data. By adopting such a data repository as the global scheme of the reconciled level, we obtain a DW architecture as represented in Fig. 2.

In classical data warehouse architectures metadata, which are fundamental both for the efficient derivation of data and for their appropriate exploitation in decision making, are derived and updated separately from operational, reconciled and derived data; therefore, they are often not well related to corresponding schemes (e.g., they can be out of date). A data repository not only serves the purpose of collecting database schemes but can also store metadata they are related to. In fact, in our approach, metadata are obtained during the derivation of reconciled data; each time the data repository (and consequently the reconciled data level) is updated also the corresponding metadata are updated. This marks an important advantage of our architecture over classical ones in its capability to maintain the consistency among reconciled data, metadata and derived data.

When the number of involved data sources is large, manual methodologies for data warehouse design are difficult to be applied. This is due to the presence of presumably hundreds of objects (entities, relationships, attributes) belonging to involved database schemes to be analyzed for the data warehouse design purposes. Our approach for the construction of a data warehouse tackles this difficulty being semi-automatic and requiring a limited contribution of human experts.

In order to validate our techniques, we are conducting a case study on the database schemes of Italian Central Public Administration whose details are illustrated in Section 6.

Summarizing, the main contributions of this paper are:

  • a new three-level data warehouse architecture which avoids some of the problems characterizing traditional three-level architectures arising when the number and the dimension of involved sources are large;

  • methods that allow a simple maintenance of consistency among reconciled data, metadata and derived data;

  • a semi-automatic technique for supporting the design and the construction of a data warehouse.

The plan of the paper is as follows. In the next section we illustrate the new three-level architecture we propose for data warehouses and highlight the differences between this one and traditional three-level architectures. Section 3 provides a general description of our techniques for semi-automatic synthesis of data warehouses. Section 4 illustrates in details the construction of the BDW whereas the construction of BIWs is the argument of Section 5. Our case-study is overviewed in Section 6 whereas, in Section 7, we draw our conclusions. The issue of consistency maintenance pointed out above is dealt with throughout the sections.

Section snippets

General characteristics

In our approach, the BDW can be thought of as being a tripletBDW=〈Rep,SoM,SoVThe repository BDW.Rep consists of a hierarchy of clusters. Each cluster represents a group of homogeneous schemes and is, in its turn, represented by a scheme (hereafter called C-scheme2). The cluster hierarchy consists of several levels so that each cluster belongs to one-level; clusters of level n, n>0, are obtained grouping C-schemes

Automatic synthesis of data warehouses

In this section we present the technique for obtaining, in a fairly automatic manner, a data warehouse designed over a set of source operational databases. The proposed technique takes in input a list of database schemes S (here and in the following we suppose that schemes are represented by the E/R model), a dictionary LSPD of lexical synonymy properties, and returns a data warehouse DW as the output.

The LSPD stores lexical synonymies indicating that two objects have different names but the

Deriving the business data warehouse

The function δ takes a list S of database schemes and a dictionary IPD of interscheme properties in input. It constructs the BDW and enriches the IPD by adding interscheme properties derived during the construction of the BDW; finally, it yields in output a pair formed by the BDW and the modified IPD.

More in particular, the function groups schemes into clusters; for each cluster, involved schemes are integrated to obtain a global scheme; this is eventually abstracted for producing the cluster

Deriving business information warehouses

The function ϱ takes in input a pair 〈BDW,IPD〉, constructs a set of BIWs and returns it. Actually, our approach derives the BIWs schemes and procedures needed to construct them. Other components typically present in BIWs (e.g., the reporting components) can be easily constructed from their schemes and procedures. BIWs are obtained from involved database schemes by applying the technique of [5]; this technique requires the definition of the set of facts of interest which forms the basis for the

A case study

We have experimented our technique on a case study which has been exploited also for tuning constants and thresholds. This concerns the Italian Central Government Office (ICGO) database schemes produced by “Information Systems Authority for Public Administration” (AIPA). A recent study developed at AIPA brought to the identification, classification and representation of about 300 databases owned by various Italian Ministries. Databases classified by AIPA are based on a variety of data models

Conclusions

In this paper, a new three-level architecture for data warehouses has been proposed. It retains the advantages of classical three-level architectures yet avoiding the problems steming from the necessity of managing, for several interesting application contexts, a very large global scheme. We have also illustrated a technique for constructing a data warehouse according to the proposed architecture. In particular, the approach constructs the Business Data Warehouse first; then, Business

Acknowledgements

The authors gratefully acknowledge the Italian Information System Authority for Public Administration (AIPA) for kindly providing ICGO schemes and technical support with them.

Luigi Palopoli received the Laurea Degree in Computer Science from the University of Pisa in 1986. From 1988 to 1991 he was a member of the research division at CRAI, a research consortium in computer science. From July 1991 to October 1998 he was assistant professor of Computer Engineering at University of Calabria. From October 1998 to October 2000 he was an Associate Professor at the same University. From October 2000 he is a Full Professor at the Mediterranean University of Reggio Calabria.

References (19)

  • L. Palopoli et al.

    Intensional and extensional integration and abstraction of heterogeneous databases

    Data Knowl. Eng.

    (2000)
  • C. Batini et al.

    A methodology for data schema integration in the entity relationship model

    IEEE Trans. Software Eng.

    (1984)
  • S. Castano et al.

    Semantic dictionary design for database interoperability

  • S. Chaudhuri et al.

    An overview of data warehousing and OLAP technology

    ACM SIGMOD RECORD

    (1997)
  • Z. Galil

    Efficient algorithms for finding maximum matching in graphs

    ACM Comput. Surv.

    (1986)
  • M. Golfarelli, D. Maio, S. Rizzi. Conceptual design of data warehouses from E/R schemes, in: Proceedings of the Hawaii...
  • W. Gotthard et al.

    System-guided view integration for object-oriented databases

    IEEE Trans. Knowl. Data Eng.

    (1992)
  • IBM, Information Warehouse Architecture I, IBM Corporation,...
  • W.H. Inmon

    What is a data warehouse?

    Prism Tech. Topic

    (1997)
There are more references available in the full text version of this article.

Cited by (0)

Luigi Palopoli received the Laurea Degree in Computer Science from the University of Pisa in 1986. From 1988 to 1991 he was a member of the research division at CRAI, a research consortium in computer science. From July 1991 to October 1998 he was assistant professor of Computer Engineering at University of Calabria. From October 1998 to October 2000 he was an Associate Professor at the same University. From October 2000 he is a Full Professor at the Mediterranean University of Reggio Calabria. His research interests include Knowledge Representation, Databases and Cooperative Information Systems, Bioinformatics, Artificial Intelligence.

Luigi Pontieri received the Laurea Degree in Computer Engineering from the University of Calabria in July 1996. From April 1997 to January 1998 he was a member of the Knowledge Engineering group at DEIS, University of Calabria. From January 1998 to January 2001 he was a Ph.D. Student at University of Calabria. He received the Ph.D. in System Engineering and Computer Science from the the University of Calabria in April 2001. From January 2001 he is senior researcher at the Institute for Systems Analysis and Information Technology of the National Research Council of Italy. His research interests include Information Integration, Data Warehousing, Data Mining and Data Compression.

Giorgio Terracina received the Laurea Degree in Computer Engineering from the University of Calabria in April 1999. From May 1999 to March 2000 he was a member of the Knowledge Engineering group at DEIS. From March 2000 he is a Ph.D. Student at the Mediterranean University of Reggio Calabria. His research interests include Knowledge extraction and representation, scheme integration and abstraction, Cooperative Information Systems, Data Warehouses and semi-structured data, Bioinformatics.

Domenico Ursino received the Laurea Degree in Computer Engineering from the University of Calabria in July 1995. From September 1995 to January 1997 he was a member of the Knowledge Engineering group at DEIS. From January 1997 to January 2000 he was a Ph.D. Student at University of Calabria. From January 2000 to October 2000 he was a member of the Knowledge Engineering group at DEIS and of the database group at ISI-CNR. From October 2000 he is an Assistant Professor at the Mediterranean University of Reggio Calabria. His research interests include knowledge extraction and representation, Information Source Integration and Abstraction, Cooperative Information Systems, Data Warehouses, Intelligent Agents.

A preliminary version of the material presented in this paper appears under the title “Semi-automatic construction of a Data Warehouse from numerous large databases” in the Proceedings of the “Conference on Re-Technologies for Information Systems (RETIS'00)”, Zurich (Switzerland), 2000.

View full text