ABSTRACT
We continue the study of approximating the number of distinct elements in a data stream of length n to within a (1 ± ε) factor. It is known that if the stream may consist of arbitrary data arriving in an arbitrary order, then any 1-pass algorithm requires Ω(1/ε2) bits of space to perform this task. To try to bypass this lower bound, the problem was recently studied in a model in which the stream may consist of arbitrary data, but it arrives to the algorithm in a random order. However, even in this model an Ω(1/ε2) lower bound was established. This is because the adversary can still choose the data arbitrarily. This leaves open the possibility that the problem is only hard under a pathological choice of data, which would be of little practical relevance.
We study the average-case complexity of this problem under certain distributions. Namely, we study the case when each successive stream item is drawn independently and uniformly at random from an unknown subset of d items for an unknown value of d. This captures the notion of random uncorrelated data. For a wide range of values of d and n, we design a 1-pass algorithm that bypasses the Ω(1/ε2) lower bound that holds in the adversarial and random-order models, thereby showing that this model admits more space-efficient algorithms. Moreover, the update time of our algorithm is optimal. Despite these positive results, for a certain range of values of d and n we show that estimating the number of distinct elements requires Ω(1/ε2) bits of space even in this model. Our lower bound subsumes previous bounds, showing that even for natural choices of data the problem is hard.
- A. Akella, A. Bharambe, M. Reiter, and S. Seshan. Detecting DDoS attacks on ISP networks. In ACM SIGMOD/PODS Workshop on Management and Processing of Data Streams (MPDS) FCRC, 2003.Google Scholar
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and Systems Sciences, 58(1):137--147, 1999. Google ScholarDigital Library
- N. Alon and J. Spencer. The Probabilistic Method. John Wiley, 1992.Google Scholar
- Z. Bar-Yossef. The Complexity of Massive Data Set Computations. PhD thesis, U.C. Berkeley, 2002. Google ScholarDigital Library
- Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, pages 1--10, 2002. Google ScholarDigital Library
- T. Batu, S. Dasgupta, R. Kumar, and R. Rubinfeld. The complexity of approximating the entropy. SIAM J. Comput., 35(1):132--150, 2005. Google ScholarDigital Library
- J. Bunge. Bibliography on estimating the number of classes in a population. Manuscript, 2007.Google Scholar
- Q. L. Burrell and M. R. Fenton. Yes, the GIGP really does work - and is workable! JASIS, 44(2):61--69, 1993.Google Scholar
- A. Chakrabarti, G. Cormode, and A. McGregor. Robust lower bounds for communication and stream computation. In STOC, pages 641--650, 2008. Google ScholarDigital Library
- A. Chakrabarti, T. S. Jayram, and M. Pǎtraşcu. Tight lower bounds for selection in randomly ordered streams. In SODA, pages 720--729, 2008. Google ScholarDigital Library
- M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In PODS, pages 268--279, 2000. Google ScholarDigital Library
- M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693--703, 2002. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.Google ScholarCross Ref
- T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991. Google ScholarDigital Library
- W. Feller. An Introduction to Probability Theory and its Applications, volume 1. John Wiley and Sons, 3 edition, 1968.Google Scholar
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31:182--209, 1985. Google ScholarDigital Library
- P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In SODA, pages 909--910, 1999. Google ScholarDigital Library
- P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst., 27(3):261--298, 2002. Google ScholarDigital Library
- S. Guha and A. McGregor. Approximate quantiles and the order of the stream. In PODS, pages 273--279, 2006. Google ScholarDigital Library
- S. Guha and A. McGregor. Lower bounds for quantile estimation in random-order and multi-pass streaming. In ICALP, pages 704--715, 2007. Google ScholarDigital Library
- S. Guha and A. Mcgregor. Space-efficient sampling. In AISTATS, pages 169--176, 2007.Google Scholar
- T. S. Jayram, R. Kumar, and D. Sivakumar. The one-way communication complexity of gap hamming distance. Manuscript, 2007.Google Scholar
- A. Kamath, R. Motwani, K. V. Palem, and P. G. Spirakis. Tail bounds for occupancy and the satisfiability threshold conjecture. Random Structures and Algorithms, 7(1):59--80, 1995. Google ScholarDigital Library
- R. Kumar. Story of distinct elements. IITK Workshop on Algorithms for Data Streams, 2006.Google Scholar
- R. Kumar and R. Panigrahy. On finding frequent elements in a data stream. In APPROX-RANDOM, pages 584--595, 2007. Google ScholarDigital Library
- E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997. Google ScholarDigital Library
- M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. Google ScholarDigital Library
- R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
- R. Motwani and S. Vassilvitskii. Distinct value estimators in power law distributions. In ANALCO, 2006.Google ScholarCross Ref
- S. Muthukrishnan. Data streams: algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2003. Google ScholarDigital Library
- S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. In FOCS, pages 559--569, 2007. Google ScholarDigital Library
- J. H. van Lint. An Introduction to Coding Theory. New York: Springer-Verlag, 1992. Google ScholarDigital Library
- D. Woodruff. Optimal space lower bounds for all frequency moments. In SODA, pages 167--175, 2004. Google ScholarDigital Library
- D. Woodruff. Efficient and Private Distance Approximation in the Communication and Streaming Models. PhD thesis, MIT, 2007. Google ScholarDigital Library
Index Terms
- The average-case complexity of counting distinct elements
Recommendations
Relations between Average-Case and Worst-Case Complexity
The consequences of the worst-case assumption NP=P are very well understood. On the other hand, we only know a few consequences of the analogous average-case assumption “NP is easy on average.” In this paper we establish several new results on the worst-...
On the average-case complexity of MCSP and its variants
CCC '17: Proceedings of the 32nd Computational Complexity ConferenceWe prove various results on the complexity of MCSP (Minimum Circuit Size Problem) and the related MKTP (Minimum Kolmogorov Time-Bounded Complexity Problem):
• We observe that under standard cryptographic assumptions, MCSP has a pseudorandom self-...
Comments