Extracting k most important groups from data efficiently

https://doi.org/10.1016/j.datak.2008.04.001Get rights and content

Abstract

We study an important data analysis operator, which extracts the k most important groups from data (i.e., the k groups with the highest aggregate values). In a data warehousing context, an example of the above query is “find the 10 combinations of product-type and month with the largest sum of sales”. The problem is challenging as the potential number of groups can be much larger than the memory capacity. We propose on-demand methods for efficient top-k groups processing, under limited memory size. In particular, we design top-k groups retrieval techniques for three representative scenarios as follows. For the scenario with data physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory for efficient top-k groups computation. Regarding the scenario with unordered data, we develop the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions. Next, we design the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes. Extensive experiments with real and synthetic datasets demonstrate the applicability and efficiency of the proposed algorithms.

Introduction

Aggregate queries summarize information in a database, by dividing tuples into groups, where some target attributes agree on their values, and applying an aggregate function (e.g., COUNT, SUM, MAX, etc.) to each group. As an example, consider a data warehouse [1] that stores detailed information about the transactions of a company in a huge fact table [15] with schema Sales(TID, ProdID, StoreID, Time, Quantity, Price). Assume the following online analytical processing (OLAP) query:

  • SELECT ProdID, StoreID, SUM(Quantity)

  • FROM Sales

  • GROUP BY ProdID, StoreID

In practice, the number of product/store combinations can be large and the results could overwhelm the user. Besides, the aim of the original analysis should be to identify the most important groups, rather than viewing the distribution of aggregates in all product/store combinations. We could express ‘importance’ by a HAVING clause that selects only those groups whose aggregate value exceeds a threshold t. This class of aggregate queries were identified in [7] and named iceberg queries.

Nevertheless, from the user’s perspective, it is hard to specify an appropriate value for t so the iceberg query is far from being user-friendly. When t is too large, no results may be returned at all. On the other hand, for small values of t, numerous groups are retrieved eventually. A better way for capturing the group interestingness is to consider a ranking of the groups based on their aggregate values and select the k groups in this order. This way, the user is able to control the result size naturally. As an example, a top-k groups query can be expressed in SQL, by adding to the statement above the following lines:

  • ORDER BY SUM(Quantity) DESC

  • STOP AFTER k

Apart from finding heavy groups in data warehouses, the top-k groups query also finds application in other data mining tasks. For example, the problem of extracting top-k frequent patterns [11] (which is a variant of frequent itemset mining [2]) can be viewed as a top-k groups query. Here, all combinations of items are candidate groups and the objective is to find the ones with the largest COUNT in a transactional database. Another application (from information retrieval) is to retrieve the Web documents with the largest number of incoming (or outgoing) links. Finally, as demonstrated in our experiments, top-k queries can be used to identify pairs of network ports with high volume of information flow, from traffic traces of TCP packets.

The evaluation of top-k groups queries could be facilitated by exploiting materialized views [12] over the base data. However, the selection of attributes in such queries could be ad hoc. Pre-computation, materialization, and maintenance of group-bys for all possible combinations of attributes incur prohibitively high cost, especially for cases where the data are updated frequently. Therefore, in this paper, we study on-demand processing of top-k groups query on a very large base table. Specifically, we focus on the class of distributive aggregate functions (say, COUNT, SUM, MAX, MIN), and ignore holistic aggregate functions (e.g., MEDIAN).

A straightforward solution is to keep a counter for each group in memory and update the corresponding count while scanning the base table. This method only requires one pass of the data. However, it may be infeasible to keep a counter for each group in memory even though today’s machines have large memory size. For example, it is not uncommon to have attributes with domain size in the order of 1000. For a query with four group-by attributes on a peta-byte warehouse, the number of required counters is (1000)4, which translates to 4 tera-bytes (assuming 4 bytes per counter).

The traditional method (by an RDBMS) for evaluating iceberg and top-k groups queries (using limited memory) is to compute the aggregate scores for all groups and select the qualifying ones according to the threshold t, or the ranking parameter k. This method (implemented by hashing or sorting with early aggregation) can be quite expensive, since the group-by operation may apply multiple passes over the data to compute the aggregates for all groups, while most of them are expected to be eventually pruned. Previous work on iceberg queries [7], [17] employed sampling- and/or hash-based techniques to eliminate groups having small aggregates early and minimize the number of passes over the base data. The extension of these methods for top-k groups queries is not straightforward, since they rely on a fixed threshold t, which cannot be determined a priori.

Top-k groups retrieval has been studied in [20], albeit in a specialized context. First, the group-by attributes are regarded as ad hoc ranges in continuous or spatial domains. Second, aggregate extensions of multidimensional indexes [16] were presumed on all relevant attributes of the query. Li et al. [18] study top-k groups queries in RDBMS in the presence of a striding index [13] for the group-by attributes. Existence of multidimensional indexes for all group-by attributes is a rather strong assumption, since the number of attributes in a relation (or combination of relations) could be arbitrary and the query may involve any subset of them. There is also a number of theoretical studies on one-pass approximate top-k groups retrieval from data streams, given limited memory (e.g., see [5], [22]). Nevertheless, such techniques are not directly applicable to situations where exact retrieval of groups and their aggregates is essential.

The goal of this paper is to provide solutions for on-demand and exact top-k groups extraction, under bounded memory size. The key contribution of this paper is a set of algorithms to efficiently handle top-k groups queries in the realistic scenario where neither specialized multidimensional indexes are available, nor approximation of results is acceptable. Specifically, we investigate three representative scenarios and develop comprehensive techniques for them:

  • For the case where tuples are physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory to compute top-k groups efficiently.

  • Under the scenario of unordered data, we study the recursive hashing algorithm (RHA), coupled with branch-and-bound techniques and derivation heuristics for tight aggregation bounds of partitions.

  • For the special case where the tuples to be aggregated are clustered according to a subset of group-by attributes, we develop the clustered groups algorithm (CGA), which accelerates top-k groups processing.

Our algorithms are cross-compared with traditional RDBMS approaches for on-demand top-k groups retrieval, using real and synthetic data.

The remainder of the paper is organized as follows. Section 2 reviews related work. Section 3 discusses the different problem settings for a top-k groups query. Section 4 presents top-k groups processing algorithms for the case where the input is physically ordered according to the aggregated measure. In Section 5, we propose techniques for the case of unordered input. Section 6 discusses the case, where the input is clustered according to the values of some group-by attributes. Experimental evaluation of all above techniques are presented in the respective sections. Finally, Section 7 concludes the paper.

Section snippets

Related work

Our work is related to top-k aggregates retrieval, iceberg queries, and top-k algorithms for middleware. In this section, we review past research on these problems and discuss their relationship to our work.

Problem settings

In this section, we define the problem of top-k groups queries for various cases and motivate the design of the proposed algorithms. Then, we discuss the setting used for empirically evaluating our algorithms.

Algorithms for inputs ordered by measure

In this section, we propose techniques for top-k groups queries, applicable when the tuples are physically ordered based on the aggregated value v. In Section 4.1 we propose an algorithm that applies on a descending order of the tuples by v, assuming that the memory can fit a counter per group. The algorithm is extended in Sections 4.2 Multi-pass sorted access algorithm, 4.3 Write-optimized MSA for cases of limited memory.

These algorithms enable early termination as long as the lower bound

Algorithms for inputs with random order

In this section, we study the evaluation of top-k queries for generic inputs T, where the tuples appear in random order. We assume that memory is bounded such that we cannot accommodate a counter for each group. Thus, multiple passes over the data might be required to derive the exact set of top-k groups. We present (i) an adaptation of an effective iceberg query algorithm [7] for our problem and (ii) an extension of the hash-based aggregation algorithm [9] that minimizes the number of accesses

Data clustered by a subset of group-by attributes

In this section, we show how the case of data ordered or clustered based on some group-by attributes can be reduced to the random-order case, where the same algorithms discussed in the previous section can be used as modules for query evaluation. At the end, we present an experimental study for the proposed method.

Conclusions

We studied an important data analysis operator that retrieves the k groups with the highest aggregate values. In data warehousing, this operator can be used directly by data analysts to identify important groups. It can also be used in frequent itemsets mining applications, as well as data mining tasks in information retrieval. The main challenge of this problem is to find the top-k groups in the case where the distinct number of groups exceeds the number of counters that can fit in memory. In

Man Lung Yiu received the Bachelor Degree in Computer Engineering and the Ph.D. degree in Computer Science from the University of Hong Kong in 2002 and 2006, respectively. He is currently an assistant professor at Department of Computer Science, Aalborg University. His research interests include databases and data mining, especially advanced query processing and mining techniques for complex types of data.

References (23)

  • P. Flajolet et al.

    Probabilistic counting algorithms for database applications

    J. Comput. Syst. Sci.

    (1985)
  • Data Warehousing and OLAP: A Research-Oriented Bibliography....
  • R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: Proceedings of the VLDB,...
  • K.S. Beyer, R. Ramakrishnan, Bottom-Up Computation of Sparse and Iceberg CUBEs, in: Proceedings of the ACM SIGMOD,...
  • N. Bruno, L. Gravano, A. Marian, Evaluating top-k queries over web-accessible databases, in: Proceedings of ICDE,...
  • G. Cormode, S. Muthukrishnan, What’s hot and what’s not: tracking most frequent items dynamically, in: Proceedings of...
  • R. Fagin, A. Lotem, M. Naor, Optimal aggregation algorithms for middleware, in: Proceedings of ACM PODS,...
  • M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, J.D. Ullman, Computing iceberg queries efficiently, in:...
  • G. Graefe

    Query evaluation techniques for large databases

    ACM Comput. Surv.

    (1993)
  • A. Guttman, R-trees: a dynamic index structure for spatial searching, in: Proceedings of ACM SIGMOD,...
  • J. Han, J. Wang, Y. Lu, P. Tzvetkov, Mining top-k frequent closed patterns without minimum support, in: Proceedings of...
  • Cited by (3)

    Man Lung Yiu received the Bachelor Degree in Computer Engineering and the Ph.D. degree in Computer Science from the University of Hong Kong in 2002 and 2006, respectively. He is currently an assistant professor at Department of Computer Science, Aalborg University. His research interests include databases and data mining, especially advanced query processing and mining techniques for complex types of data.

    Nikos Mamoulis received the diploma in computer engineering and informatics in 1995 from the University of Patras, Greece, and the Ph.D. degree in computer science in 2000 from the Hong Kong University of Science and Technology. Since September 2001, he has been a faculty member of the Department of Computer Science at the University of Hong Kong, currently an associate professor. In the past, he has worked as a postdoctoral researcher at the Centrum voor Wiskunde en Informatica (CWI), the Netherlands. His research interests include complex data management, data mining, advanced indexing and query processing, and constraint satisfaction problems. He has published more than 75 articles in reputable international conferences and journals and served in the program committees of numerous database and data mining conferences.

    Vagelis Hristidis received his B.S. in Electrical and Computer Engineering at the National Technical University of Athens in 1999. He received his M.Sc. and Ph.D., degrees in Computer Science in 2000 and 2004, respectively, at the Computer Science and Engineering Department of the University of California, San Diego (UCSD). Since 2004 he has been an Assistant Professor at the School of Computing and Information Sciences at Florida International University. His areas of expertise are Databases and Information Retrieval.

    View full text