Skip to main content
Log in

Quantiles over data streams: experimental comparisons, new analyses, and further improvements

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

A fundamental problem in data management and analysis is to generate descriptions of the distribution of data. It is most common to give such descriptions in terms of the cumulative distribution, which is characterized by the quantiles of the data. The design and engineering of efficient methods to find these quantiles has attracted much study, especially in the case where the data are given incrementally, and we must compute the quantiles in an online, streaming fashion. While such algorithms have proved to be extremely useful in practice, there has been limited formal comparison of the competing methods, and no comprehensive study of their performance. In this paper, we remedy this deficit by providing a taxonomy of different methods and describe efficient implementations. In doing so, we propose new variants that have not been studied before, yet which outperform existing methods. To illustrate this, we provide detailed experimental comparisons demonstrating the trade-offs between space, time, and accuracy for quantile computation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Note that floating-point numbers in standard representations (e.g., IEEE 754) can be mapped to integers in a fixed universe in an order-preserving fashion.

  2. http://www.minorplanetcenter.net/iau/ecs/mpcat-obs/mpcat-obs.html.

  3. Right ascension is an astronomical term used to locate a point (a minor planet in this case) in the equatorial coordinate system.

  4. http://www.ncfloodmaps.com.

  5. It is possible for the error to be affected due to more duplicates in smaller universes, but we found this effect negligible in practice.

References

  1. Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J.M., Wei, Z., Yi, K.: Mergeable summaries. ACM Trans. Database Syst. 38, 26 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  2. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999). doi:10.1006/jcss.1997.1545. http://www.sciencedirect.com/science/article/B6WJ0-45JCBTJ-D/2/2a71f12f1f0112bc83447b9d48eba529

  3. Arasu, A., Manku, G.: Approximate counts and quantiles over sliding windows. In: Proceedings of the ACM Symposium on Principles of Database Systems (2004)

  4. Blum, M., Floyd, R.W., Pratt, V., Rievest, R.L., Tarjan, R.E.: Time bounds for selection. J. Comput. Syst. Sci. 7, 448–461 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  5. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages, and Programming (2002)

  6. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: Proceedings of the International Conference on Very Large Data Bases (2008)

  7. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  8. Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)

  9. Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)

  10. Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: Proceedings of the ACM Symposium on Principles of Database Systems (2006)

  11. Felber, D., Ostrovsky, R.: A randomized online quantile summary in O(1/\(\varepsilon \) log 1/\(\varepsilon \)) words (2015). CoRR abs/1503.01156, http://arxiv.org/abs/1503.01156

  12. Ganguly, S., Majumder, A.: CR-precis: A deterministic summary structure for update data streams. In: ESCAPE (2007)

  13. Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of the International Conference on Very Large Data Bases (2002)

  14. Govindaraju, N.K., Raghuvanshi, N., Manocha, D.: Fast and approximate stream mining of quantiles and frequencies using graphics processors. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)

  15. Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2001)

  16. Greenwald, M., Khanna, S.: Power conserving computation of order-statistics over sensor networks. In: Proceedings of the ACM Symposium on Principles of Database Systems (2004)

  17. Huang, Z., Wang, L., Yi, K., Liu, Y.: Sampling based algorithms for quantile computation in sensor networks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2011)

  18. Hung, R.Y.S., Ting, H.F.: An \(\varOmega (\frac{1}{\varepsilon }\log \frac{1}{\varepsilon })\) space lower bound for finding \(\varepsilon \)-approximate quantiles in a data stream. In: FAW (2010)

  19. Lagrange, J.L.: Mécanique analytique, vol. 1. Mallet-Bachelier, Paris (1853)

    Google Scholar 

  20. Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing histogram queries under differential privacy (2009). CoRR abs/0912.4742, http://arxiv.org/abs/0912.4742

  21. Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1998)

  22. Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1999)

  23. Munro, J.I., Paterson, M.S.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  24. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Dyn. Grids Worldw. Comput. 13(4), 277–298 (2005)

    Google Scholar 

  25. Rao, C.R.: Linear Statistical Inference and its Applications, vol. 22. Wiley, New Jersey (2009)

    Google Scholar 

  26. Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the ACM SenSys (2004)

  27. Suri, S., Toth, C., Zhou, Y.: Range counting over multidimensional data streams. Discrete Comput. Geom. 36, 633–655 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  28. Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971)

    Article  MathSciNet  MATH  Google Scholar 

  29. Wang, L., Luo, G., Yi, K., Cormode, G.: Quantiles over data streams: an experimental study. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2013)

  30. Yi, K., Zhang, Q.: Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65(1), 206–223 (2013)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The first three authors are supported by HKRGC Grants GRF-621413, GRF-16211614, and GRF-16200415, and by a Microsoft Grant MRA14EG05. The work of GC is supported in part by European Research Council Grant ERC-2014-CoG 647557, the Yahoo Faculty Research and Engagement Program and a Royal Society Wolfson Research Merit Award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ge Luo.

Additional lemmas and proofs

Additional lemmas and proofs

1.1 Size of the truncated tree \(\hat{T}\)

Lemma 1

The truncated tree \(\hat{T}\) has size \(O({1 \over \varepsilon } \log u)\) in expectation.

Proof

Recall that only nodes whose estimated frequency is above \(\eta \varepsilon n\) are added to the truncated tree \(\hat{T}\), where \(\eta \) is a constant. We classify these nodes into heavy nodes and non-heavy nodes. A node is heavy if its true frequency is above \({1\over 2}\eta \varepsilon n\), and non-heavy otherwise.

We first observe that, on any level of the dyadic structure, there are at most \({n \over {1\over 2} \eta \varepsilon n} = O(1/\varepsilon )\) heavy nodes, treating \(\eta \) as a (small) constant. So even if they are all added to \(\hat{T}\), there are only \(O({1\over \varepsilon } \log u)\) of them in total.

For a non-heavy node v, the Count-Sketch may overestimate its frequency to above \(\eta \varepsilon n\) with a constant probability, say 1 / 4. Note that there can be \(\varTheta (u)\) non-heavy nodes being overestimated in expectation, but we will argue below that only \(O({1\over \varepsilon }\log u)\) will be added during the top-down construction of \(\hat{T}\).

Note that for a non-heavy node to be added, its parent must be a heavy node or another non-heavy node that has been overestimated. Thus, all the non-heavy nodes in \(\hat{T}\) make up a number of subtrees, where the root of each subtree must be a child of a heavy node. Let \({\mathbf {t}}\) be any such non-heavy subtree, and we will bound \({\mathrm {E}}[|{\mathbf {t}}|]\). Let r be the root of \({\mathbf {t}}\). For any node v below r, let \(I_v\) be an indicator variable where \(I_v=1\) if \(v\in {\mathbf {t}}\) and \(I_v=0\) otherwise. Let d(v) be the depth of v in \({\mathbf {t}}\), and we define \(d(r) = 0\). For v to be added to \({\mathbf {t}}\), all of its d(v) ancestors must have been added, so \({\mathrm {E}}[I_v] \le (1/4)^{d(v)+1}\) since each level uses an independent Count-Sketch. We then have

$$\begin{aligned} \begin{aligned} E[|{\mathbf {t}}|]&=\sum _{v \text { below } r}E[I_v] \\&=\sum _{v \text { below } r } (1/4)^{d(v)+1} \\&\le \sum _{d = 0}^{\log u} 2^{d}(1/4)^{d+1}\le 1. \end{aligned} \end{aligned}$$

Finally, we observe that at most two such \({\mathbf {t}}\)’s can be attached to a heavy node, and there are only \(O({1\over \varepsilon }\log u)\) heavy nodes, so we conclude that there are \(O({1\over \varepsilon }\log u)\) non-heavy nodes in \(\hat{T}\) in expectation. \(\square \)

1.2 Constraints on the \(x^*_i\)’s

Lemma 2

Let \(x^*\) be the BLUE of (4). Let \(\lambda _v\) be any solution to (5). For any leaf w, let \(Z_w= \lambda _w \sum _{z\in \text {anc}(w)\setminus r}y_z/\sigma ^2_z\), and for any internal node v, \(Z_v= \sum _{w\prec v}\lambda _w Z_w\). For any node v except the root r, let \(F_v = \sum _{w\in \text {anc}(v)\setminus \{r\}}x^*_w/\sigma ^2_w\). Let \(\varDelta = (Z_r-y_r\pi _{r\text {'s left child}})/\lambda _r\). Then, we have

$$\begin{aligned} \left\{ \begin{aligned}&x^*_r=y_r, \\&x^*_v = \left( Z_v - \lambda _v F_{\text {parent}(v)} - \lambda _v \varDelta \right) /\pi _v,\quad {\mathrm{for\;all\;nodes}}\;v\ne r. \end{aligned} \right. \end{aligned}$$

Proof

We will follow the method of Lagrange multiplier [19] to find the BLUE of (4). Since only the subtree root is known exactly, we introduce a single Lagrange multiplier \(\eta \). We set \(\sigma ^2_r=1/\eta \) instead of 0, and will later take the limit as \(\eta \) goes to \(\infty \). Denote by \({\text {diag}}(1/\sigma _v)\) the diagonal matrix with \(1/\sigma _v\) at entry (vv). We further define \(Z={\text {diag}}(1/\sigma _v){y}\) and \(U={\text {diag}}(1/\sigma _v)A\). Then, the Lagrange function can be rewritten as \((Z-Ux^*)^T(Z-Ux^*)\). By differentiation, we can derive (i) \(y_r=x_r^*\); and (ii)

$$\begin{aligned} U^TUx^* = U^TZ. \end{aligned}$$
(7)

This is sufficient to define a solution; we can solve for \(x^*\) by premultiplying by \((U^TU)^{-1}\), at the cost of a computing a matrix inverse. In the following, to derive the equations stated in the lemma from (7), which leads to a much more efficient algorithm to compute \(x^*\).

Let \(\text {anc}(u,v)=\text {anc}(u)\cap \text {anc}(v)\). Let u be any node of \(T_r\). We also use \([\tau ]\) denote the \(\tau \) leaves of \(T_r\). Then, by simple calculation, we can see that \((U^TU)_{u,w}=\sum _{v\in \text {anc}(u,w)}\sigma _v^{-2}\), and \((U^TZ)_u=\sum _{v\in \text {anc}(u)}y_v/\sigma ^2_v\).

First, we take the weighted sum of corresponding rows on the LHS of (7) to obtain \(\sum _{u\prec v}\lambda _u(U^TU)_ux^*=\)

$$\begin{aligned} \begin{aligned}&\sum _{u\prec v}\sum _{z\in [\tau ]}\left( \sum _{ w\in \text {anc}(u,z) \setminus \text {anc}(v)}\frac{\lambda _ux^*_z}{\sigma ^2_w} + \sum _{w\in \text {anc}(u,z)\cap \text {anc}(v)}\frac{\lambda _ux^*_z}{\sigma ^2_w}\right) \\&\quad = \sum _{u\prec v}\sum _{z\prec v}\sum _{w\in \text {anc}(u,z)\setminus \text {anc}(v)}\frac{\lambda _ux^*_z}{\sigma ^2_w} + \sum _{w\in \text {anc}(v)}\sum _{u\prec v}\sum _{z\prec w}\frac{\lambda _ux^*_z}{\sigma ^2_w} \\&\quad = \sum _{u\prec v}\sum _{w\in \text {anc}(u)\setminus \text {anc}(v)}\frac{\lambda _u}{\sigma ^2_w}\sum _{z\prec w}x^*_z + \sum _{w\in \text {anc}(v)}\sum _{u\prec v}\frac{x^*_w}{\sigma ^2_w}\lambda _u \\&\quad = \sum _{u\prec v}\sum _{w\in \text {anc}(u)\setminus \text {anc}(v)}\lambda _ux^*_w/\sigma ^2_w + \lambda _v\sum _{w\in \text {anc}(v)}x^*_w/\sigma ^2_w. \\ \end{aligned} \end{aligned}$$
(8)

Note that in the last line of (8), the second component can be written as

$$\begin{aligned} \lambda _v\sum _{w\in \text {anc}(v)}\frac{x^*_w}{\sigma ^2_w} = \lambda _v\left( F_{\text {parent}(v)} + \frac{x^*_v}{\sigma ^2_v}+\frac{x^*_r}{\sigma ^2_r}\right) . \end{aligned}$$

We can also derive that the first component is

$$\begin{aligned} \sum _{u\prec v}\sum _{w\in \text {anc}(u)\setminus \text {anc}(v)}\frac{\lambda _ux^*_w}{\sigma ^2_w} = \left( \pi _v-\frac{\lambda _v}{\sigma ^2_v}\right) x^*_v . \end{aligned}$$

To see this, let us assume that this holds for any descendant of v. Then, we can derive \(\sum _{u\prec v}\sum _{w\in \text {anc}(u)\setminus \text {anc}(v)}\frac{\lambda _ux^*_w}{\sigma ^2_w}\)

$$\begin{aligned} \begin{aligned}&=\sum _{\{s\text { is a child of }v\}}\sum _{u\prec s}\left( \frac{\lambda _ux^*_s}{\sigma ^2_s} + \sum _{w\in \text {anc}(u)\setminus \text {anc}(s)}\frac{\lambda _ux^*_w}{\sigma ^2_w} \right) \\&= \sum _{\{s\text { is a child of }v\}} \left( \frac{\lambda _sx^*_s}{\sigma ^2_s} +\sum _{u\prec s}\sum _{w\in \text {anc}(u)\setminus \text {anc}(s)}\frac{\lambda _ux^*_w}{\sigma ^2_w} \right) \\&= \sum _{\{s\text { is a child of }v\}}\left( \frac{\lambda _s x^*_s}{\sigma ^2_s}+ \left( \pi _s-\frac{\lambda _s}{\sigma ^2_s}\right) x^*_s \right) \\&= \sum _{\{s\text { is a child of }v\}} \pi _sx^*_s =\pi _s x^*_v = \left( \pi _v-\frac{\lambda _v}{\sigma ^2_v}\right) x^*_v. \end{aligned} \end{aligned}$$

Combining the above two results, we have

$$\begin{aligned} \sum _{u\prec v}\lambda _u\left( U^TU\right) _ux^*=\pi _vx^*_v + \lambda _v\left( \frac{x^*_r}{\sigma ^2_r} +F_{\text {parent}(v)}\right) . \end{aligned}$$
(9)

Secondly, we take the weighted sum of corresponding rows on the RHS of (7) to obtain

$$\begin{aligned} \begin{aligned} \sum _{u\prec v}\lambda _u\left( U^TZ\right) _u&= \sum _{u\prec v}\lambda _u\sum _{w\in \text {anc}(u)}\frac{y_w}{\sigma ^2_w} \\&=\sum _{u\prec v}\lambda _u\left( Z_u +\frac{y_r}{\sigma ^2_r}\right) \\&=Z_v+\frac{\lambda _vy_r}{\sigma ^2_r}. \end{aligned} \end{aligned}$$
(10)

Finally, by combing (9) and (10), we derive that \(\forall v\),

$$\begin{aligned} \pi _vx^*_v = Z_v-\lambda _vF_{\text {parent}(v)} -\lambda _v\left( x^*_r-y_r\right) \eta . \end{aligned}$$
(11)

Substituting v by r in (11), we can derive

$$\begin{aligned} x^*_r = \frac{\left( \frac{Z_r}{\eta }+y_r\lambda _r\right) }{ \left( \frac{\pi _r}{\eta }+\lambda _r\right) }. \end{aligned}$$

We already have \(x^*_r=y_r\). Then, we can conclude that either \(\eta =+\infty \) or \(y_r\pi _r=Z_r\). As we know, \(y_r\) is given and irrelevant to \(\pi _rZ_r\) which implies \(\eta =+\infty \). To handle this infinity, we first express

$$\begin{aligned} \begin{aligned} \varDelta (\eta ) = \left( x^*_r-y_r\right) \eta = \left( Z_r - y_r\pi _s\right) \left( \pi _r/\eta +\lambda _r\right) , \end{aligned} \end{aligned}$$

where s is a child of root r. Now we take limit of \(\varDelta (\eta )\) and derive that

$$\begin{aligned} \varDelta = \lim _{\eta \rightarrow +\infty }\varDelta (\eta ) = \left( Z_r-y_r\pi _s\right) /\lambda _r. \end{aligned}$$

Finally by taking limit on (11) for any \(v\ne r\), we derive \(x^*_v=(Z_v-\lambda _v F_{\text {parent}(v)}-\lambda _v\varDelta )/\pi _v \), and the lemma is proved. \(\square \)

1.3 Covariance analysis

Lemma 3

Suppose we build a Count-Sketch with \(d=1\) row and w columns on a vector x. For any two elements \(u\ne v\), let \(y_u\) and \(y_v\) be the estimators for \(x_u\) and \(x_v\). Then, \({\mathrm {Cov}}(y_u,y_v) = x_u x_v/w\).

Proof

Recall that in the Count-Sketch, for any element v, the estimator is computed as \(y_v = g(v)\sum _{z}g(z)x_zI_v(z)\), where \(I_v(z)=1\) if \(h(v)=h(z)\) and 0 otherwise. So we have

$$\begin{aligned} \begin{aligned} {\mathrm {Cov}}(y_u,y_v)&={\mathrm {E}}(y_u y_v)-{\mathrm {E}}(y_u)E(y_v)\\&={\mathrm {E}}\left[ \left( g(u)\sum _{h(i)=u}g(i)x_i\right) \left( g(v)\sum _{h(j)=v}g(j)x_j\right) \right] \\&\quad -x_u x_v\\&=\sum _{i,j}{\mathrm {E}}[g(u)g(v)g(i)g(j)]x_ix_j {\mathrm {E}}[I_u(i)I_v(j)]-x_u x_v \end{aligned} \end{aligned}$$

Let \(f(i,j)={\mathrm {E}}[g(u)g(v)g(i)g(j)]x_ix_j{\mathrm {E}}[I_u(i)I_v(j)]\). We find that \(f(u,v)=x_ux_v\) and \(f(v,u)= x_u x_v/w\) (note, the last term in f(ij) is not symmetric in i and j). If \(\{i,j\}\ne \{u,v\}\), then \(f(i,j)=0\), since \(g(\cdot )\) is 4-wise independent hash function. Therefore, we derive that \({\mathrm {Cov}}(y_u,y_v)=x_u x_v/w\). \(\square \)

On the other hand, prior analysis on the Count-Sketch shows that \({\mathrm {Var}}(y_v) = {1\over w} \sum _{i} x_i^2\). Thus, the covariance is usually order of magnitude smaller than the variance.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luo, G., Wang, L., Yi, K. et al. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal 25, 449–472 (2016). https://doi.org/10.1007/s00778-016-0424-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-016-0424-7

Keywords

Navigation