Abstract
Linear sketching algorithms have been widely used for processing large-scale distributed and streaming datasets. Their popularity is largely due to the fact that linear sketches can be naturally composed in the distributed model and be efficiently updated in the streaming model. The errors of linear sketches are typically expressed in terms of the sum of coordinates of the input vector excluding those largest ones, or, the mass on the tail of the vector. Thus, the precondition for these algorithms to perform well is that the mass on the tail is small, which is, however, not always the case - in many real-world datasets the coordinates of the input vector have a bias, which will generate a large mass on the tail.
In this paper we propose linear sketches that are bias- aware. We rigorously prove that they achieve strictly better error guarantees than the corresponding existing sketches, and demonstrate their practicality and superiority via an extensive experimental evaluation on both real and synthetic datasets.
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29. ACM, 1996. Google ScholarDigital Library
- M. Arlitt and T. Jin. World cup web site access logs, august 1998. URL http://ita.ee.lbl.gov/html/contrib/WorldCup.html, 1998.Google Scholar
- P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5, 2014.Google Scholar
- Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. In FOCS, pages 209--218, 2002. Google ScholarDigital Library
- E. J. Candès, J. K. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489--509, 2006. Google ScholarDigital Library
- M. Charikar, K. C. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693--703, 2002. Google ScholarDigital Library
- J. Chen and Q. Zhang. Bias-aware sketches. CoRR, abs/1610.07718, 2016.Google Scholar
- G. Cormode. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers, 2011.Google Scholar
- G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB, pages 13--24. VLDB Endowment, 2005. Google ScholarDigital Library
- G. Cormode and M. Hadjieleftheriou. Methods for finding frequent items in data streams. VLDB J., 19(1):3--20, 2010. Google ScholarDigital Library
- G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic udafs at streaming speeds. In SIGMOD, pages 35--46. ACM, 2004. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. Combinatorial algorithms for compressed sensing. In SIROCCO, pages 280--294, 2006. Google ScholarDigital Library
- F. Deng and D. Rafiei. New estimation algorithms for streaming data: Count-min can do more. Technical report, 2007.Google Scholar
- D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289--1306, 2006. Google ScholarDigital Library
- D. L. Donoho, M. Elad, and V. N. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Information Theory, 52(1):6--18, 2006. Google ScholarDigital Library
- C. Estan and G. Varghese. New directions in traffic measurement and accounting. Computer Communication Review, 32(1):75, 2002. Google ScholarDigital Library
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985. Google ScholarDigital Library
- A. C. Gilbert and P. Indyk. Sparse recovery using sparse matrices. Proceedings of the IEEE, 98(6):937--947, 2010.Google ScholarCross Ref
- A. C. Gilbert, S. Muthukrishnan, and M. Strauss. Approximation of functions over redundant dictionaries using coherence. In SODA, pages 243--252, 2003. Google ScholarDigital Library
- A. Goyal, H. D. III, and G. Cormode. Sketch algorithms for estimating point queries in NLP. In EMNLP-CoNLL, pages 1093--1103, 2012. Google ScholarDigital Library
- J. F. E. IV, F. Fogelman-Soulié, P. A. Flach, and M. J. Zaki, editors. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 -- July 1, 2009. ACM, 2009.Google Scholar
- O. Keyes. Wiki-Pageviews, english wikipedia pageviews by second. http://datahub.io/dataset/english-wikipedia-pageviews-by-second, April, 2015.Google Scholar
- Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter braids: a novel counter architecture for per-flow measurement. In SIGMETRICS, pages 121--132, 2008. Google ScholarDigital Library
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Communications of the ACM, 54(6):114--123, 2011. Google ScholarDigital Library
- R. Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840--842, 1978. Google ScholarDigital Library
- X. Niu, X. Sun, H. Wang, S. Rong, G. Qi, and Y. Yu. Zhishi.me - weaving Chinese linking open data. In Proc. Int. Semantic Web Conf., pages 205--220, 2011. Google ScholarDigital Library
- G. Pitel and G. Fouquier. Count-Min-Log sketch: Approximately counting with approximate counters. ArXiv e-prints, Feb. 2015.Google Scholar
- D. Van Gucht, R. Williams, D. P. Woodruff, and Q. Zhang. The communication complexity of distributed set-joins with applications to matrix multiplication. In PODS, pages 199--212. ACM, 2015. Google ScholarDigital Library
- Y. Yan, J. Zhang, B. Huang, X. Sun, J. Mu, Z. Zhang, and T. Moscibroda. Distributed outlier detection using compressive sensing. In SIGMOD, pages 3--16. ACM, 2015. Google ScholarDigital Library
Recommendations
Summarizing data using bottom-k sketches
PODC '07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computingA Bottom-sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that ...
Bottom-k sketches: better and more efficient estimation of aggregates
SIGMETRICS '07 Conference ProceedingsA Bottom-k sketch is a summary of a set of items with nonnegative weights. Each such summary allows us to compute approximate aggregates over the set of items. Bottom-k sketches are obtained by associating with each item in a ground set an independent ...
Comments