research-article

Bias-aware sketches

Authors:
Jiecao Chen

Indiana University

Indiana University
View Profile

,
Qin Zhang

Indiana University

Indiana University
View Profile

Proceedings of the VLDB Endowment Volume 10 Issue 9pp 961–972https://doi.org/10.14778/3099622.3099627

Published:01 May 2017Publication History

Proceedings of the VLDB Endowment

Abstract

Linear sketching algorithms have been widely used for processing large-scale distributed and streaming datasets. Their popularity is largely due to the fact that linear sketches can be naturally composed in the distributed model and be efficiently updated in the streaming model. The errors of linear sketches are typically expressed in terms of the sum of coordinates of the input vector excluding those largest ones, or, the mass on the tail of the vector. Thus, the precondition for these algorithms to perform well is that the mass on the tail is small, which is, however, not always the case - in many real-world datasets the coordinates of the input vector have a bias, which will generate a large mass on the tail.

In this paper we propose linear sketches that are bias- aware. We rigorously prove that they achieve strictly better error guarantees than the corresponding existing sketches, and demonstrate their practicality and superiority via an extensive experimental evaluation on both real and synthetic datasets.

References

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29. ACM, 1996. Google ScholarDigital Library
M. Arlitt and T. Jin. World cup web site access logs, august 1998. URL http://ita.ee.lbl.gov/html/contrib/WorldCup.html, 1998.Google Scholar
P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5, 2014.Google Scholar
Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. In FOCS, pages 209--218, 2002. Google ScholarDigital Library
E. J. Candès, J. K. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489--509, 2006. Google ScholarDigital Library
M. Charikar, K. C. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693--703, 2002. Google ScholarDigital Library
J. Chen and Q. Zhang. Bias-aware sketches. CoRR, abs/1610.07718, 2016.Google Scholar
G. Cormode. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers, 2011.Google Scholar
G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB, pages 13--24. VLDB Endowment, 2005. Google ScholarDigital Library
G. Cormode and M. Hadjieleftheriou. Methods for finding frequent items in data streams. VLDB J., 19(1):3--20, 2010. Google ScholarDigital Library
G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic udafs at streaming speeds. In SIGMOD, pages 35--46. ACM, 2004. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. Combinatorial algorithms for compressed sensing. In SIROCCO, pages 280--294, 2006. Google ScholarDigital Library
F. Deng and D. Rafiei. New estimation algorithms for streaming data: Count-min can do more. Technical report, 2007.Google Scholar
D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289--1306, 2006. Google ScholarDigital Library
D. L. Donoho, M. Elad, and V. N. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Information Theory, 52(1):6--18, 2006. Google ScholarDigital Library
C. Estan and G. Varghese. New directions in traffic measurement and accounting. Computer Communication Review, 32(1):75, 2002. Google ScholarDigital Library
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985. Google ScholarDigital Library
A. C. Gilbert and P. Indyk. Sparse recovery using sparse matrices. Proceedings of the IEEE, 98(6):937--947, 2010.Google ScholarCross Ref
A. C. Gilbert, S. Muthukrishnan, and M. Strauss. Approximation of functions over redundant dictionaries using coherence. In SODA, pages 243--252, 2003. Google ScholarDigital Library
A. Goyal, H. D. III, and G. Cormode. Sketch algorithms for estimating point queries in NLP. In EMNLP-CoNLL, pages 1093--1103, 2012. Google ScholarDigital Library
J. F. E. IV, F. Fogelman-Soulié, P. A. Flach, and M. J. Zaki, editors. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 -- July 1, 2009. ACM, 2009.Google Scholar
O. Keyes. Wiki-Pageviews, english wikipedia pageviews by second. http://datahub.io/dataset/english-wikipedia-pageviews-by-second, April, 2015.Google Scholar
Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter braids: a novel counter architecture for per-flow measurement. In SIGMETRICS, pages 121--132, 2008. Google ScholarDigital Library
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Communications of the ACM, 54(6):114--123, 2011. Google ScholarDigital Library
R. Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840--842, 1978. Google ScholarDigital Library
X. Niu, X. Sun, H. Wang, S. Rong, G. Qi, and Y. Yu. Zhishi.me - weaving Chinese linking open data. In Proc. Int. Semantic Web Conf., pages 205--220, 2011. Google ScholarDigital Library
G. Pitel and G. Fouquier. Count-Min-Log sketch: Approximately counting with approximate counters. ArXiv e-prints, Feb. 2015.Google Scholar
D. Van Gucht, R. Williams, D. P. Woodruff, and Q. Zhang. The communication complexity of distributed set-joins with applications to matrix multiplication. In PODS, pages 199--212. ACM, 2015. Google ScholarDigital Library
Y. Yan, J. Zhang, B. Huang, X. Sun, J. Mu, Z. Zhang, and T. Moscibroda. Distributed outlier detection using compressive sensing. In SIGMOD, pages 3--16. ACM, 2015. Google ScholarDigital Library

Recommendations

Architectural Drawing Using Pencil Sketches and AutoCAD
Read More
Summarizing data using bottom-k sketches
PODC '07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing

A Bottom-sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that ...
Read More
Bottom-k sketches: better and more efficient estimation of aggregates
SIGMETRICS '07 Conference Proceedings

A Bottom-k sketch is a summary of a set of items with nonnegative weights. Each such summary allows us to compute approximate aggregates over the set of items. Bottom-k sketches are obtained by associating with each item in a ground set an independent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 10, Issue 9
May 2017
73 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 May 2017
Published in pvldb Volume 10, Issue 9
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 79
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bias-aware sketches

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Architectural Drawing Using Pencil Sketches and AutoCAD

Summarizing data using bottom-k sketches

Bottom-k sketches: better and more efficient estimation of aggregates

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Bias-aware sketches

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Architectural Drawing Using Pencil Sketches and AutoCAD

Summarizing data using bottom-k sketches

Bottom-k sketches: better and more efficient estimation of aggregates

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media