Copyright © 2007 Elsevier B.V. All rights reserved.
Sampling streaming data with replacement
Available online 15 March 2007.
References and further reading may be available for this article. To view references and further reading you must purchase this article.
Abstract
Simple random sampling is a widely accepted basis for estimation from a population. When data come as a stream, the total population size continuously grows and only one pass through the data is possible. Reservoir sampling is a method of maintaining a fixed size random sample from streaming data. Reservoir sampling without replacement has been extensively studied and several algorithms with sub-linear time complexity exist. Although reservoir sampling with replacement is previously mentioned by some authors, it has been studied very little and only linear algorithms exist. A with-replacement reservoir sampling algorithm of sub-linear time complexity is introduced. A thorough complexity analysis of several approaches to the with-replacement reservoir sampling problem is also provided.
Keywords: Data stream mining; Random sampling with replacement; Reservoir sampling
Article Outline
- 1. Introduction
- 2. Notation and definitions
- 3. Reservoir sampling without replacement (RSXR)
- 4. Reservoir sampling with replacement (RSWR)
- 4.1. Two implementations of RSWR: RSWR-naive and RSWR-batch
- 4.2. Formal proofs for RSWR-naive and RSWR-batch
- 5. Faster sampling by skipping elements
- 6. Performance evaluation
- 6.1. Expected CPU runtime
- 6.2. Empirical study
- 7. Conclusions
- Acknowledgements
- References






E-mail Article
Add to my Quick Links

Cited By in Scopus (0)



