skip to main content
research-article

Bias-aware sketches

Published:01 May 2017Publication History
Skip Abstract Section

Abstract

Linear sketching algorithms have been widely used for processing large-scale distributed and streaming datasets. Their popularity is largely due to the fact that linear sketches can be naturally composed in the distributed model and be efficiently updated in the streaming model. The errors of linear sketches are typically expressed in terms of the sum of coordinates of the input vector excluding those largest ones, or, the mass on the tail of the vector. Thus, the precondition for these algorithms to perform well is that the mass on the tail is small, which is, however, not always the case - in many real-world datasets the coordinates of the input vector have a bias, which will generate a large mass on the tail.

In this paper we propose linear sketches that are bias- aware. We rigorously prove that they achieve strictly better error guarantees than the corresponding existing sketches, and demonstrate their practicality and superiority via an extensive experimental evaluation on both real and synthetic datasets.

References

  1. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Arlitt and T. Jin. World cup web site access logs, august 1998. URL http://ita.ee.lbl.gov/html/contrib/WorldCup.html, 1998.Google ScholarGoogle Scholar
  3. P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5, 2014.Google ScholarGoogle Scholar
  4. Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. In FOCS, pages 209--218, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. J. Candès, J. K. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489--509, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Charikar, K. C. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693--703, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Chen and Q. Zhang. Bias-aware sketches. CoRR, abs/1610.07718, 2016.Google ScholarGoogle Scholar
  8. G. Cormode. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers, 2011.Google ScholarGoogle Scholar
  9. G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB, pages 13--24. VLDB Endowment, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Cormode and M. Hadjieleftheriou. Methods for finding frequent items in data streams. VLDB J., 19(1):3--20, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic udafs at streaming speeds. In SIGMOD, pages 35--46. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Cormode and S. Muthukrishnan. Combinatorial algorithms for compressed sensing. In SIROCCO, pages 280--294, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Deng and D. Rafiei. New estimation algorithms for streaming data: Count-min can do more. Technical report, 2007.Google ScholarGoogle Scholar
  15. D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289--1306, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. L. Donoho, M. Elad, and V. N. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Information Theory, 52(1):6--18, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Estan and G. Varghese. New directions in traffic measurement and accounting. Computer Communication Review, 32(1):75, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. C. Gilbert and P. Indyk. Sparse recovery using sparse matrices. Proceedings of the IEEE, 98(6):937--947, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  20. A. C. Gilbert, S. Muthukrishnan, and M. Strauss. Approximation of functions over redundant dictionaries using coherence. In SODA, pages 243--252, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Goyal, H. D. III, and G. Cormode. Sketch algorithms for estimating point queries in NLP. In EMNLP-CoNLL, pages 1093--1103, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. F. E. IV, F. Fogelman-Soulié, P. A. Flach, and M. J. Zaki, editors. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 -- July 1, 2009. ACM, 2009.Google ScholarGoogle Scholar
  23. O. Keyes. Wiki-Pageviews, english wikipedia pageviews by second. http://datahub.io/dataset/english-wikipedia-pageviews-by-second, April, 2015.Google ScholarGoogle Scholar
  24. Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter braids: a novel counter architecture for per-flow measurement. In SIGMETRICS, pages 121--132, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Communications of the ACM, 54(6):114--123, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840--842, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. X. Niu, X. Sun, H. Wang, S. Rong, G. Qi, and Y. Yu. Zhishi.me - weaving Chinese linking open data. In Proc. Int. Semantic Web Conf., pages 205--220, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Pitel and G. Fouquier. Count-Min-Log sketch: Approximately counting with approximate counters. ArXiv e-prints, Feb. 2015.Google ScholarGoogle Scholar
  29. D. Van Gucht, R. Williams, D. P. Woodruff, and Q. Zhang. The communication complexity of distributed set-joins with applications to matrix multiplication. In PODS, pages 199--212. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Yan, J. Zhang, B. Huang, X. Sun, J. Mu, Z. Zhang, and T. Moscibroda. Distributed outlier detection using compressive sensing. In SIGMOD, pages 3--16. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 9
    May 2017
    73 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 May 2017
    Published in pvldb Volume 10, Issue 9

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader