The average-case complexity of counting distinct elements

Author:
David P. Woodruff

IBM Almaden, San Jose, CA

IBM Almaden, San Jose, CA
View Profile

ICDT '09: Proceedings of the 12th International Conference on Database TheoryMarch 2009Pages 284–295https://doi.org/10.1145/1514894.1514928

Published:23 March 2009Publication History

ICDT '09: Proceedings of the 12th International Conference on Database Theory

Pages 284–295

ABSTRACT

We continue the study of approximating the number of distinct elements in a data stream of length n to within a (1 ± ε) factor. It is known that if the stream may consist of arbitrary data arriving in an arbitrary order, then any 1-pass algorithm requires Ω(1/ε²) bits of space to perform this task. To try to bypass this lower bound, the problem was recently studied in a model in which the stream may consist of arbitrary data, but it arrives to the algorithm in a random order. However, even in this model an Ω(1/ε²) lower bound was established. This is because the adversary can still choose the data arbitrarily. This leaves open the possibility that the problem is only hard under a pathological choice of data, which would be of little practical relevance.

We study the average-case complexity of this problem under certain distributions. Namely, we study the case when each successive stream item is drawn independently and uniformly at random from an unknown subset of d items for an unknown value of d. This captures the notion of random uncorrelated data. For a wide range of values of d and n, we design a 1-pass algorithm that bypasses the Ω(1/ε²) lower bound that holds in the adversarial and random-order models, thereby showing that this model admits more space-efficient algorithms. Moreover, the update time of our algorithm is optimal. Despite these positive results, for a certain range of values of d and n we show that estimating the number of distinct elements requires Ω(1/ε²) bits of space even in this model. Our lower bound subsumes previous bounds, showing that even for natural choices of data the problem is hard.

References

A. Akella, A. Bharambe, M. Reiter, and S. Seshan. Detecting DDoS attacks on ISP networks. In ACM SIGMOD/PODS Workshop on Management and Processing of Data Streams (MPDS) FCRC, 2003.Google Scholar
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and Systems Sciences, 58(1):137--147, 1999. Google ScholarDigital Library
N. Alon and J. Spencer. The Probabilistic Method. John Wiley, 1992.Google Scholar
Z. Bar-Yossef. The Complexity of Massive Data Set Computations. PhD thesis, U.C. Berkeley, 2002. Google ScholarDigital Library
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, pages 1--10, 2002. Google ScholarDigital Library
T. Batu, S. Dasgupta, R. Kumar, and R. Rubinfeld. The complexity of approximating the entropy. SIAM J. Comput., 35(1):132--150, 2005. Google ScholarDigital Library
J. Bunge. Bibliography on estimating the number of classes in a population. Manuscript, 2007.Google Scholar
Q. L. Burrell and M. R. Fenton. Yes, the GIGP really does work - and is workable! JASIS, 44(2):61--69, 1993.Google Scholar
A. Chakrabarti, G. Cormode, and A. McGregor. Robust lower bounds for communication and stream computation. In STOC, pages 641--650, 2008. Google ScholarDigital Library
A. Chakrabarti, T. S. Jayram, and M. Pǎtraşcu. Tight lower bounds for selection in randomly ordered streams. In SODA, pages 720--729, 2008. Google ScholarDigital Library
M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In PODS, pages 268--279, 2000. Google ScholarDigital Library
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693--703, 2002. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.Google ScholarCross Ref
T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991. Google ScholarDigital Library
W. Feller. An Introduction to Probability Theory and its Applications, volume 1. John Wiley and Sons, 3 edition, 1968.Google Scholar
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31:182--209, 1985. Google ScholarDigital Library
P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In SODA, pages 909--910, 1999. Google ScholarDigital Library
P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst., 27(3):261--298, 2002. Google ScholarDigital Library
S. Guha and A. McGregor. Approximate quantiles and the order of the stream. In PODS, pages 273--279, 2006. Google ScholarDigital Library
S. Guha and A. McGregor. Lower bounds for quantile estimation in random-order and multi-pass streaming. In ICALP, pages 704--715, 2007. Google ScholarDigital Library
S. Guha and A. Mcgregor. Space-efficient sampling. In AISTATS, pages 169--176, 2007.Google Scholar
T. S. Jayram, R. Kumar, and D. Sivakumar. The one-way communication complexity of gap hamming distance. Manuscript, 2007.Google Scholar
A. Kamath, R. Motwani, K. V. Palem, and P. G. Spirakis. Tail bounds for occupancy and the satisfiability threshold conjecture. Random Structures and Algorithms, 7(1):59--80, 1995. Google ScholarDigital Library
R. Kumar. Story of distinct elements. IITK Workshop on Algorithms for Data Streams, 2006.Google Scholar
R. Kumar and R. Panigrahy. On finding frequent elements in a data stream. In APPROX-RANDOM, pages 584--595, 2007. Google ScholarDigital Library
E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997. Google ScholarDigital Library
M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. Google ScholarDigital Library
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
R. Motwani and S. Vassilvitskii. Distinct value estimators in power law distributions. In ANALCO, 2006.Google ScholarCross Ref
S. Muthukrishnan. Data streams: algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2003. Google ScholarDigital Library
S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. In FOCS, pages 559--569, 2007. Google ScholarDigital Library
J. H. van Lint. An Introduction to Coding Theory. New York: Springer-Verlag, 1992. Google ScholarDigital Library
D. Woodruff. Optimal space lower bounds for all frequency moments. In SODA, pages 167--175, 2004. Google ScholarDigital Library
D. Woodruff. Efficient and Private Distance Approximation in the Communication and Streaming Models. PhD thesis, MIT, 2007. Google ScholarDigital Library

Index Terms

The average-case complexity of counting distinct elements
1. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Relations between Average-Case and Worst-Case Complexity

The consequences of the worst-case assumption NP=P are very well understood. On the other hand, we only know a few consequences of the analogous average-case assumption “NP is easy on average.” In this paper we establish several new results on the worst-...
Read More
Average case computational complexity theory
Read More
On the average-case complexity of MCSP and its variants
CCC '17: Proceedings of the 32nd Computational Complexity Conference

We prove various results on the complexity of MCSP (Minimum Circuit Size Problem) and the related MKTP (Minimum Kolmogorov Time-Bounded Complexity Problem):

• We observe that under standard cryptographic assumptions, MCSP has a pseudorandom self-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICDT '09: Proceedings of the 12th International Conference on Database Theory
March 2009
334 pages
ISBN:9781605584232
DOI:10.1145/1514894
Editor:
Ronald Fagin
IBM Research
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 March 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data streams
distinct elements
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 231
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The average-case complexity of counting distinct elements

ICDT '09: Proceedings of the 12th International Conference on Database Theory

ABSTRACT

References

Cited By

Index Terms

Recommendations

Relations between Average-Case and Worst-Case Complexity

Average case computational complexity theory

On the average-case complexity of MCSP and its variants