Materialized view selection under the maintenance time constraint

doi:10.1016/S0169-023X(01)00007-6

Data & Knowledge Engineering

Volume 37, Issue 2, May 2001, Pages 203-216

https://doi.org/10.1016/S0169-023X(01)00007-6 Get rights and content

Abstract

A data warehouse is a data repository which collects and maintains a large amount of data from multiple distributed, autonomous and possibly heterogeneous data sources. Often the data is stored in the form of materialized views in order to provide fast access to the integrated data. One of the most important decisions in designing a data warehouse is the selection of views for materialization. The objective is to select an appropriate set of views that minimizes the total query response time with the constraint that the total maintenance time for these materialized views is within a given bound. This view selection problem is totally different from the view selection problem under the disk space constraint. In this paper the view selection problem under the maintenance time constraint is investigated. Two efficient, heuristic algorithms for the problem are proposed. The key to devising the proposed algorithms is to define good heuristic functions and to reduce the problem to some well-solved optimization problems. As a result, an approximate solution of the known optimization problem will give a feasible solution of the original problem.

Introduction

A data warehouse is a repository of integrated information available for query and analysis. One of the major motivations for constructing data warehouses is for queries that can be answered using the information stored there without a need to be translated and shipped to remote sources for execution. Also, the warehouse data is always available for users, even when the remote sources are not accessible at the times of the local source's maintenance period. Often the data in the data warehouse is stored in a form of materialized views in order to accelerate on-line analytical processing (OLAP).

The selection of views for materialization is one of the most important issues in the design of data warehouses. Given the disk space constraint, substantial effort in selecting a set of views to materialize with different optimization objectives has been taken over the past several years [1], [3], [4], [5], [7], [11]. For example, Harinarayan et al. [7] provide an efficient, polynomial greedy algorithm to select views for materialization in order to minimize the total query response time for a data cube that delivers a nearly optimal solution. Gupta et al. [5] extend the results to the selection of views and indices in data cubes. Baralis et al. [1], Gupta [4], Ross et al. [9] and Yang et al. [11] present algorithms for the selection of materialized views by taking into account the view maintenance cost and aim to minimize both the total query response time and the view maintenance overhead under the given disk space constraint. Theodoratos and Sellis [10] reformulate the view selection problem as a state space optimization problem by providing various optimization frameworks and heuristics.

It is obvious that, in a data warehouse, its query performance will be improved tremendously as more and more views are materialized. With the ratio $$/disk volume constantly dropping, the disk storage constraint is no longer the limiting factor of the materialized view selection. The view maintenance time window is limited (usually the data warehouse maintenance is carried out at night or over weekends) due to too many materialized views in the data warehouse which need to be maintained. More view materialization implies that a larger time maintenance window is needed. Therefore, the view maintenance window is a critical parameter in the design of a data warehouse, which constrains the number of views to be materialized, and thereby determines the scalability and the functionality of the data warehouse in the end.

Recently Gupta and Mumick [6], and Kotidis and Roussopoulos [8] first explored the view selection problem under a given maintenance time constraint. Kotidis and Roussopoulos [8] considered the data cube case where a horizontal fragment of a cubeid of the data cube is stored as a materialized view. Here a cubeid is an aggregate table derived from the fact table. Thus, a cubeid can be partitioned into several fragments and the fragments are stored as materialized views. To answer a user's aggregate query, it is obvious that several materialized views are often required. To this end, they suggest a heuristic approach for selecting fragments to materialize. Gupta and Mumick [6] introduce a theoretical framework by modeling the view selection problem under the maintenance time constraint as an OR-graph for the case of data cubes, and an AND–OR graph for a more general setting, with presented heuristic algorithms for the problem. Their algorithm for the OR-graph, however, is less efficient than a naive algorithm which is shown as follows. For the OR-graph, they adopted a greedy strategy. That is, the solution is built incrementally. Initially, the set of materialized views MV=∅. Then, the algorithm repeats the following step until the total maintenance time for the views in MV is beyond the given maintenance time limit. At each step all possible inverted trees in the OR-graph are considered and one of the inverted trees T is chosen such that: (i) T has the largest benefit; and (ii) the maintenance time for views in MV∪V(T) is within the given maintenance time limit, where V(T) is the view set in T. Let m be the number of vertices in the OR-graph, then the time complexity of their algorithm is $Ω ∑_{k=1}^{m−1} m k =Ω(2^{m})$ because there are $Ω(2^{m})$ possible inverted trees in the graph. There is a simple naive algorithm for finding the set of materialized views as follows. For each different k, 1⩽k<m, choose k views as a set of potential materialized views from the m views. There are $(m k)$ such sets for a given k. Once a set of k views satisfies that (i) all the queries can be answered using the views in the set, and (ii) the total maintenance time for the views in the set is within the given maintenance time limit, that view set will become a candidate of the feasible solution of the problem. Finally, one feasible solution leading to the minimum query response time is chosen from these possible feasible solutions. Thus, the time complexity of the naive algorithm is $O ∑_{k=1}^{m−1} m k =O(2^{m}),$ which is no worse than the one in [6].

Although there are some similarities between the view selection problems under the disk space constraint and under the maintenance time constraint, they are, in fact, significantly different, which is explained as follows. For the space constraint version, the optimization objective is the total amount of space used for the view maintenance, while the total disk space occupied by a set of views always increases when more views are materialized. This is the so-called monotonicity of the optimization function. However, under the maintenance time constraint, the optimization objective is the total amount of maintenance time spent for a set of materialized views, while this optimization objective may not have the monotonicity property. In some cases, it is possible that the maintenance time for a set of views will decrease when more views are materialized. This non-monotonic nature of the maintenance time increases the complexity of designing efficient algorithms for the view selection problem under the maintenance time constraint. It is based on the above discussions. In this paper we dedicate ourselves to develop efficient algorithms for the view selection problem under the maintenance time constraint.

In this paper two efficient heuristic algorithms for the view selection problem under the maintenance time constraint are proposed. The two-phase algorithm consists of two phases. It optimizes the total query response time in the first phase, and chooses views for materialization under the given maintenance time constraint in the second phase. Several possible improvements to this algorithm are also discussed. The integrated algorithm takes into account both the maintenance time and the query response time simultaneously. So, the solution it delivers is better than that delivered by the two-phase algorithm as well as its variant of the two-phase algorithm. The integrated algorithm, however, takes a longer time to obtain a solution.

The key to devising the proposed algorithms is to define good heuristic functions and to reduce the problem to some well-solved optimization problems. Accordingly, any approximate solution of the known optimization problem will give a feasible solution of the problem under discussion.

Although both proposed algorithms are based on an assumption of the static setting in which the number of views and queries are stabilized and fixed up during the selection of views for materialization, these algorithms can easily be adopted in a dynamic setting through minor modifications. In the dynamic environment, whether or not a view is in the set of materialized views can be changed dynamically, and so too can a query in the query set.

The rest of the paper is organized as follows. Section 2 defines the problem precisely. Section 3 presents the two-phase algorithm and its variant. Section 4 proposes the integrated algorithm using different heuristics. Section 5 concludes the paper.

Section snippets

Preliminaries

Following [6], in this paper we assume that the views in the data warehouse form a directed graph G(V,E) which is an OR-graph DAG, where V is the set of views and E is the set of directed edges. A directed edge 〈u,v〉∈E from u to v implies that vertex (view) u can be derived from vertex (view) v. Many practical applications can be modeled as an OR-graph including the data cube. For example, Fig. 1 shows such an OR-graph. View x can be derived from u, while view y can be derived from either u or v

A two-phase view selection algorithm

In this section we propose a heuristic algorithm for finding a feasible solution for the problem. The proposed algorithm consists of two phases. In the first phase it finds a set S (⊂V) of views such that the total query response time is minimized. If the maintenance time for the views in S is also within the given maintenance time bound, then S is the solution of the problem, and it has been done. Otherwise, in the second phase, a subset S^′ (S^′⊂S) of S will be chosen such that the total

An integrated algorithm

Although the two-phase algorithm and its variants in the preceding section give feasible solutions for the problem, they do not take into account both of the optimization objectives (the maintenance time and the query response time) simultaneously. Here an algorithm called integrated algorithm is presented, which will take these two optimization objectives into consideration at the same time.

The proposed approach is still a greedy approach. Initially, assume all the other views are virtual

Conclusions

The view selection problem under the maintenance time constraint is a fundamental problem in the design of data warehousing. In this paper, several heuristic algorithms for the problem have been proposed. The novelty of the proposed algorithms is finding good heuristic functions and reducing the problem to some well-established optimization problems. In turn, an approximate solution for the well known optimization problem will give a feasible solution for the problem. It must be mentioned that

Acknowledgements

We would like to thank Jeffrey X. Yu for his helpful comments and suggestions about the proposed algorithms. The research by Weifa Liang is partially supported by a research grant from The Australian Research Council under a small grant schema (Grant No: F00025).

Weifa Liang received his Ph.D. degree in computer science from The Australian National University in 1998. He received his M.E. degree in Computer Science from University of Science and Technology of China in 1989 and his B.S. degree in Computer Science from Wuhan University, China in 1984. He is currently a Lecturer in the Department of Computer Science at The Australian National University. His research interests include parallel processing, parallel and distributed algorithms, data

References (12)

E. Baralis, S. Paraboschi, E. Teniente. Materialized view selection in a multidimensional databases, in: Proceedings of...
T.H. Cormen et al.
Introduction to Algorithms
(1994)
A. Gupta et al.
Maintenance of materialized views: problems, techniques, and applications
IEEE Data Eng. Bull.
(1995)
H. Gupta. Selection of views to materialize in a data warehouse, in: Proceedings of the Sixth ICDT, 1997, pp....
H. Gupta, V. Harinarayan, A. Rajaraman, J.D. Ullman. Index selection for OLAP, in: Proceedings of the International...
H. Gupta, I.S. Mumick. Selection of views to materialize under a maintenance cost constraint, in: Proceedings of the...

There are more references available in the full text version of this article.

Cited by (46)

A maintenance centric approach to the view selection problem
2013, Information Systems
Citation Excerpt :
Finally, we wrap up with our concluding remarks and plans for future research in Section 7. There are three main VSP formulation categories presented in literature, each with two sub-formulations: (1) unbounded (UVSP) [13,18–30], (2) space-constrained (SCVSP) [10,13,15,19,22,31–39], and (3) update-time-constrained (MCVSP) [10,16,34,40–45]. The description of each will be discussed in this section with a more formal mathematical definition presented in Section 3.
The View Selection Problem is an optimization problem designed to enhance query performance through the pre-computation and storage of select views given resource constraints. Assuring the materialized views can be updated within a reasonable time frame has become a chief concern for recent models. However, these methods are crafted simply to fit a solution within a feasible range and not to minimize the resource intensive maintenance process. In this paper, we submit two novel advances in terms of model formulation and solution generation to reduce maintenance costs. Our proposed model, the Minimum-Maintenance View Selection Problem, combines previous techniques to minimize and constrain update costs. Furthermore, we define a series of maintenance time reducing principles in solution generation embodied in a constructor heuristic. The model and constructor heuristic are evaluated using an existing clinical data warehouse and state-of-the-art heuristics. Our analysis shows our model produces the lowest-cost solution relative to extant models. Also, they indicate algorithms seeded with our constructor heuristic to be superior solutions to all other methods tested.
Simultaneous determination of view selection and update policy with stochastic query and response time constraints
2008, Information Sciences
Citation Excerpt :
Lastly, the query arrivals can be assumed as with constant rates, or as stochastic processes with variations which can be modeled by some probability distributions such as Poisson distribution. Most published results resolve the issues on systems without query-triggered updates and with fixed query arrival rates [5–11,15,21,22,24,31–35]. Only limited research has considered the query-triggered enabling systems with stochastic process [27].
Data warehouses are built to reply query searches efficiently from integrated data of various systems. To improve the performance of the system, the issue of materializing views within data warehouses must be explored. This involves to pre-compute a set of selected views which are fact and dimension tables, under given resource and quality constraints. The quality constraints include query processing time, data maintenance time and the freshness of data when queries are placed. Then there is the policy of updating, which treats the time issue of data reloading in data warehouses. A model is proposed to determine the view selection and update policy when the arrival of queries follows Poisson processes with the constraints of system response time, storage space and query dependent currency of data (on systems capable of periodic and query-triggered updates). To the best of the researchers’ knowledge, no other research has considered all these factors in their models. A two-phase greedy algorithm was developed to determine the optimal update policy for the view selection problem. Numerous experiments were performed to explore the sensitivity of the proposed model under various constraints and system parameter settings. The results show that the model has reasonable responses to the tunings and that the proposed algorithm can rapidly find acceptable solutions.
Efficient approaches for materialized views selection in a data warehouse
2007, Information Sciences
Citation Excerpt :
Ross et al. [18] considered the use of additional views to reduce maintenance cost. Liang et al. [12] proposed a two-stage algorithm that optimizes the total query response time in the first stage, and chooses views for materialization under the given maintenance time constraint in the second stage. Unfortunately, if storage space is inadequate for containing all the views that are dependent on other views, then the greedy algorithm has the worst performance.
View materialization is an effective method to increase query efficiency in a data warehouse and improve OLAP query performance. However, one encounters the problem of space insufficiency if all possible views are materialized in advance. Reducing query time by means of selecting a proper set of materialized views with a lower cost is crucial for efficient data warehousing. In addition, the costs of data warehouse creation, query, and maintenance have to be taken into account while views are materialized. In this paper, we propose efficient algorithms to select a proper set of materialized views, constrained by storage and cost considerations, to help speed up the entire data warehousing process. We derive a cost model for data warehouse query and maintenance as well as efficient view selection algorithms that effectively exploit the gain and loss metrics. The main contribution of our paper is to speed up the selection process of materialized views. Concurrently, this will greatly reduce the overall cost of data warehouse query and maintenance.
Construction and distribution of materialized views in Non-binary data space
2021, Innovations in Systems and Software Engineering
Construction of Materialized Views in Non-Binary Data Space
2021, Lecture Notes in Networks and Systems
Taster: Self-tuning, elastic and online approximate query processing
2019, Proceedings - International Conference on Data Engineering

View all citing articles on Scopus

Hui Wang is currently a Master (by research) student in computer science at School of Computer Science and Electrical Engineering in The University of Queensland. She received her B.S. degree in applied mathematics from Anhui University, China in 1984. Before coming to Australia, as a software engineer, she had been working in an institution in China for a decade to conduct research and development of application software in the simulation of VLSI circuits. Her current research interests include design and analysis of data warehousing, the consistency control of views in data warehousing, relational database applications, and Web development applications.

Maria E. Orlowska is currently the Professor in Information Systems at The University of Queensland in Australia. Since 1992 she has also acted as Distributed Unit Leader in the Cooperative Research Centre for Distributed Systems Technology (DSTC). She graduated with a Ph.D. (Computer Science) in June 1980 from the Institute of Applied Mathematics, Technical University of Warsaw. She is a trustee of the VLDB Endowment, and is a regular contributor to many other international conferences. She has published over 130 papers in international journals and conference proceedings. Her research expertise lies in the areas of: the theory of relational databases, distributed databases, various aspects of information systems design methodologies (including distributed systems), enhancement of semantic data modelling techniques by rigorous factors, transaction processing in distributed systems, concurrency control, distributed and federated database systems, and workflows technology.

View full text