Materialized view selection under the maintenance time constraint
Introduction
A data warehouse is a repository of integrated information available for query and analysis. One of the major motivations for constructing data warehouses is for queries that can be answered using the information stored there without a need to be translated and shipped to remote sources for execution. Also, the warehouse data is always available for users, even when the remote sources are not accessible at the times of the local source's maintenance period. Often the data in the data warehouse is stored in a form of materialized views in order to accelerate on-line analytical processing (OLAP).
The selection of views for materialization is one of the most important issues in the design of data warehouses. Given the disk space constraint, substantial effort in selecting a set of views to materialize with different optimization objectives has been taken over the past several years [1], [3], [4], [5], [7], [11]. For example, Harinarayan et al. [7] provide an efficient, polynomial greedy algorithm to select views for materialization in order to minimize the total query response time for a data cube that delivers a nearly optimal solution. Gupta et al. [5] extend the results to the selection of views and indices in data cubes. Baralis et al. [1], Gupta [4], Ross et al. [9] and Yang et al. [11] present algorithms for the selection of materialized views by taking into account the view maintenance cost and aim to minimize both the total query response time and the view maintenance overhead under the given disk space constraint. Theodoratos and Sellis [10] reformulate the view selection problem as a state space optimization problem by providing various optimization frameworks and heuristics.
It is obvious that, in a data warehouse, its query performance will be improved tremendously as more and more views are materialized. With the ratio $$/disk volume constantly dropping, the disk storage constraint is no longer the limiting factor of the materialized view selection. The view maintenance time window is limited (usually the data warehouse maintenance is carried out at night or over weekends) due to too many materialized views in the data warehouse which need to be maintained. More view materialization implies that a larger time maintenance window is needed. Therefore, the view maintenance window is a critical parameter in the design of a data warehouse, which constrains the number of views to be materialized, and thereby determines the scalability and the functionality of the data warehouse in the end.
Recently Gupta and Mumick [6], and Kotidis and Roussopoulos [8] first explored the view selection problem under a given maintenance time constraint. Kotidis and Roussopoulos [8] considered the data cube case where a horizontal fragment of a cubeid of the data cube is stored as a materialized view. Here a cubeid is an aggregate table derived from the fact table. Thus, a cubeid can be partitioned into several fragments and the fragments are stored as materialized views. To answer a user's aggregate query, it is obvious that several materialized views are often required. To this end, they suggest a heuristic approach for selecting fragments to materialize. Gupta and Mumick [6] introduce a theoretical framework by modeling the view selection problem under the maintenance time constraint as an OR-graph for the case of data cubes, and an AND–OR graph for a more general setting, with presented heuristic algorithms for the problem. Their algorithm for the OR-graph, however, is less efficient than a naive algorithm which is shown as follows. For the OR-graph, they adopted a greedy strategy. That is, the solution is built incrementally. Initially, the set of materialized views MV=∅. Then, the algorithm repeats the following step until the total maintenance time for the views in MV is beyond the given maintenance time limit. At each step all possible inverted trees in the OR-graph are considered and one of the inverted trees T is chosen such that: (i) T has the largest benefit; and (ii) the maintenance time for views in MV∪V(T) is within the given maintenance time limit, where V(T) is the view set in T. Let m be the number of vertices in the OR-graph, then the time complexity of their algorithm isbecause there are possible inverted trees in the graph. There is a simple naive algorithm for finding the set of materialized views as follows. For each different k, 1⩽k<m, choose k views as a set of potential materialized views from the m views. There are such sets for a given k. Once a set of k views satisfies that (i) all the queries can be answered using the views in the set, and (ii) the total maintenance time for the views in the set is within the given maintenance time limit, that view set will become a candidate of the feasible solution of the problem. Finally, one feasible solution leading to the minimum query response time is chosen from these possible feasible solutions. Thus, the time complexity of the naive algorithm iswhich is no worse than the one in [6].
Although there are some similarities between the view selection problems under the disk space constraint and under the maintenance time constraint, they are, in fact, significantly different, which is explained as follows. For the space constraint version, the optimization objective is the total amount of space used for the view maintenance, while the total disk space occupied by a set of views always increases when more views are materialized. This is the so-called monotonicity of the optimization function. However, under the maintenance time constraint, the optimization objective is the total amount of maintenance time spent for a set of materialized views, while this optimization objective may not have the monotonicity property. In some cases, it is possible that the maintenance time for a set of views will decrease when more views are materialized. This non-monotonic nature of the maintenance time increases the complexity of designing efficient algorithms for the view selection problem under the maintenance time constraint. It is based on the above discussions. In this paper we dedicate ourselves to develop efficient algorithms for the view selection problem under the maintenance time constraint.
In this paper two efficient heuristic algorithms for the view selection problem under the maintenance time constraint are proposed. The two-phase algorithm consists of two phases. It optimizes the total query response time in the first phase, and chooses views for materialization under the given maintenance time constraint in the second phase. Several possible improvements to this algorithm are also discussed. The integrated algorithm takes into account both the maintenance time and the query response time simultaneously. So, the solution it delivers is better than that delivered by the two-phase algorithm as well as its variant of the two-phase algorithm. The integrated algorithm, however, takes a longer time to obtain a solution.
The key to devising the proposed algorithms is to define good heuristic functions and to reduce the problem to some well-solved optimization problems. Accordingly, any approximate solution of the known optimization problem will give a feasible solution of the problem under discussion.
Although both proposed algorithms are based on an assumption of the static setting in which the number of views and queries are stabilized and fixed up during the selection of views for materialization, these algorithms can easily be adopted in a dynamic setting through minor modifications. In the dynamic environment, whether or not a view is in the set of materialized views can be changed dynamically, and so too can a query in the query set.
The rest of the paper is organized as follows. Section 2 defines the problem precisely. Section 3 presents the two-phase algorithm and its variant. Section 4 proposes the integrated algorithm using different heuristics. Section 5 concludes the paper.
Section snippets
Preliminaries
Following [6], in this paper we assume that the views in the data warehouse form a directed graph G(V,E) which is an OR-graph DAG, where V is the set of views and E is the set of directed edges. A directed edge 〈u,v〉∈E from u to v implies that vertex (view) u can be derived from vertex (view) v. Many practical applications can be modeled as an OR-graph including the data cube. For example, Fig. 1 shows such an OR-graph. View x can be derived from u, while view y can be derived from either u or v
A two-phase view selection algorithm
In this section we propose a heuristic algorithm for finding a feasible solution for the problem. The proposed algorithm consists of two phases. In the first phase it finds a set S (⊂V) of views such that the total query response time is minimized. If the maintenance time for the views in S is also within the given maintenance time bound, then S is the solution of the problem, and it has been done. Otherwise, in the second phase, a subset S′ (S′⊂S) of S will be chosen such that the total
An integrated algorithm
Although the two-phase algorithm and its variants in the preceding section give feasible solutions for the problem, they do not take into account both of the optimization objectives (the maintenance time and the query response time) simultaneously. Here an algorithm called integrated algorithm is presented, which will take these two optimization objectives into consideration at the same time.
The proposed approach is still a greedy approach. Initially, assume all the other views are virtual
Conclusions
The view selection problem under the maintenance time constraint is a fundamental problem in the design of data warehousing. In this paper, several heuristic algorithms for the problem have been proposed. The novelty of the proposed algorithms is finding good heuristic functions and reducing the problem to some well-established optimization problems. In turn, an approximate solution for the well known optimization problem will give a feasible solution for the problem. It must be mentioned that
Acknowledgements
We would like to thank Jeffrey X. Yu for his helpful comments and suggestions about the proposed algorithms. The research by Weifa Liang is partially supported by a research grant from The Australian Research Council under a small grant schema (Grant No: F00025).
Weifa Liang received his Ph.D. degree in computer science from The Australian National University in 1998. He received his M.E. degree in Computer Science from University of Science and Technology of China in 1989 and his B.S. degree in Computer Science from Wuhan University, China in 1984. He is currently a Lecturer in the Department of Computer Science at The Australian National University. His research interests include parallel processing, parallel and distributed algorithms, data
References (12)
- E. Baralis, S. Paraboschi, E. Teniente. Materialized view selection in a multidimensional databases, in: Proceedings of...
- et al.
Introduction to Algorithms
(1994) - et al.
Maintenance of materialized views: problems, techniques, and applications
IEEE Data Eng. Bull.
(1995) - H. Gupta. Selection of views to materialize in a data warehouse, in: Proceedings of the Sixth ICDT, 1997, pp....
- H. Gupta, V. Harinarayan, A. Rajaraman, J.D. Ullman. Index selection for OLAP, in: Proceedings of the International...
- H. Gupta, I.S. Mumick. Selection of views to materialize under a maintenance cost constraint, in: Proceedings of the...
Cited by (46)
A maintenance centric approach to the view selection problem
2013, Information SystemsCitation Excerpt :Finally, we wrap up with our concluding remarks and plans for future research in Section 7. There are three main VSP formulation categories presented in literature, each with two sub-formulations: (1) unbounded (UVSP) [13,18–30], (2) space-constrained (SCVSP) [10,13,15,19,22,31–39], and (3) update-time-constrained (MCVSP) [10,16,34,40–45]. The description of each will be discussed in this section with a more formal mathematical definition presented in Section 3.
Simultaneous determination of view selection and update policy with stochastic query and response time constraints
2008, Information SciencesCitation Excerpt :Lastly, the query arrivals can be assumed as with constant rates, or as stochastic processes with variations which can be modeled by some probability distributions such as Poisson distribution. Most published results resolve the issues on systems without query-triggered updates and with fixed query arrival rates [5–11,15,21,22,24,31–35]. Only limited research has considered the query-triggered enabling systems with stochastic process [27].
Efficient approaches for materialized views selection in a data warehouse
2007, Information SciencesCitation Excerpt :Ross et al. [18] considered the use of additional views to reduce maintenance cost. Liang et al. [12] proposed a two-stage algorithm that optimizes the total query response time in the first stage, and chooses views for materialization under the given maintenance time constraint in the second stage. Unfortunately, if storage space is inadequate for containing all the views that are dependent on other views, then the greedy algorithm has the worst performance.
Construction and distribution of materialized views in Non-binary data space
2021, Innovations in Systems and Software EngineeringConstruction of Materialized Views in Non-Binary Data Space
2021, Lecture Notes in Networks and SystemsTaster: Self-tuning, elastic and online approximate query processing
2019, Proceedings - International Conference on Data Engineering
Weifa Liang received his Ph.D. degree in computer science from The Australian National University in 1998. He received his M.E. degree in Computer Science from University of Science and Technology of China in 1989 and his B.S. degree in Computer Science from Wuhan University, China in 1984. He is currently a Lecturer in the Department of Computer Science at The Australian National University. His research interests include parallel processing, parallel and distributed algorithms, data warehousing and OLAP, query optimization, routing protocol design and graph theory.
Hui Wang is currently a Master (by research) student in computer science at School of Computer Science and Electrical Engineering in The University of Queensland. She received her B.S. degree in applied mathematics from Anhui University, China in 1984. Before coming to Australia, as a software engineer, she had been working in an institution in China for a decade to conduct research and development of application software in the simulation of VLSI circuits. Her current research interests include design and analysis of data warehousing, the consistency control of views in data warehousing, relational database applications, and Web development applications.
Maria E. Orlowska is currently the Professor in Information Systems at The University of Queensland in Australia. Since 1992 she has also acted as Distributed Unit Leader in the Cooperative Research Centre for Distributed Systems Technology (DSTC). She graduated with a Ph.D. (Computer Science) in June 1980 from the Institute of Applied Mathematics, Technical University of Warsaw. She is a trustee of the VLDB Endowment, and is a regular contributor to many other international conferences. She has published over 130 papers in international journals and conference proceedings. Her research expertise lies in the areas of: the theory of relational databases, distributed databases, various aspects of information systems design methodologies (including distributed systems), enhancement of semantic data modelling techniques by rigorous factors, transaction processing in distributed systems, concurrency control, distributed and federated database systems, and workflows technology.