ABSTRACT
Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded data stream is not feasible in many streaming scenarios. Therefore, we target at approximately eliminating duplicates in streaming environments given a limited space. Based on a well-known bitmap sketch, we introduce a data structure, Stable Bloom Filter, and a novel and simple algorithm. The basic idea is as follows: since there is no way to store the whole history of the stream, SBF continuously evicts the stale information so that SBF has room for those more recent elements. After finding some properties of SBF analytically, we show that a tight upper bound of false positive rates is guaranteed. In our empirical study, we compare SBF to alternative methods. The results show that our method is superior in terms of both accuracy and time effciency when a fixed small space and an acceptable false positive rate are given.
- {1} R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proc. of VLDB, 2002. Google ScholarDigital Library
- {2} Internet Archive. http://www.archive.org/.Google Scholar
- {3} B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of PODS, 2002. Google ScholarDigital Library
- {4} B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregation queries over data streams. In Proc. of ICDE, 2004. Google ScholarDigital Library
- {5} F. Baboescu, S. Singh, and G. Varghese. Packet classification for core routers: Is there an alternative to cams? In Proc. of INFOCOMM, 2003.Google ScholarCross Ref
- {6} M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity m easures. In Proc. of KDD, 2003. Google ScholarDigital Library
- {7} B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. In CACM, 1970. Google ScholarDigital Library
- {8} A. Z. Broder, M. Najork, and J. L. Wiener. Efficient url caching for world wide web crawling. In Proc. of WWW, 2003. Google ScholarDigital Library
- {9} D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. B. Zdonik. Monitoring streams - a new class of data management applications. In Proc. of VLDB, 2002. Google ScholarDigital Library
- {10} S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing for an uncertain world. In Proc. of CIDR, 2003.Google Scholar
- {11} S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In Proc. of ICDE, 2005. Google ScholarDigital Library
- {12} J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. In Proc. of SIGMOD, 2000. Google ScholarDigital Library
- {13} E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In Proc. of PODS, 2003. Google ScholarDigital Library
- {14} S. Cohen and Y. Matias. Spectral bloom filters. In Proc. of SIGMOD, 2003. Google ScholarDigital Library
- {15} G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 15(3):529-540, 2003. Google ScholarDigital Library
- {16} C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. In Proc. of SIGMOD, 2003. Google ScholarDigital Library
- {17} C. Estan and G. Varghese. Data streaming in computer networks. In Proc. of Workshop on Management and Processing of Data Streams (MPDS) in cooperation with SIGMOD/PODS, 2003.Google Scholar
- {18} L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache:a scalable wide area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):281-293, 2000. Google ScholarDigital Library
- {19} P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182-209, 1985. Google ScholarDigital Library
- {20} H. Garcia-Molina, J. D. Ullman, and J. Widom. Database System Implementation. Prentice Hall, 2000. Google ScholarDigital Library
- {21} A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2(4), 1999. Google ScholarDigital Library
- {22} Cisco System Inc. Cisco network accounting services. http://www.cisco.com/warp/public/cc/pd/iosw/ prodlit/nwact_wp.pdf, 2002.Google Scholar
- {23} G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of VLDB, 2002. Google ScholarDigital Library
- {24} A. Metwally, D. Agrawal, and A. E. Abbadi. Duplicate detection in click streams. In Proc. of WWW, 2005. Google ScholarDigital Library
- {25} M. Mitzenmacher. Compressed bloom filters. IEEE/ACM Trans. Netw., 10(5):604-612, 2002. Google ScholarDigital Library
- {26} T. Palpanas, M. Vlachos, E. J. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In Proc. of ICDE, 2004. Google ScholarDigital Library
- {27} M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.Google Scholar
- {28} N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In Proc. of VLDB, 2003. Google ScholarDigital Library
- {29} P. A. Tucker, D. Maier, and T. Sheard. Applying punctuation schemes to queries over continuous data streams. IEEE Data Eng. Bull., 26(1):33-40, 2003.Google Scholar
- {30} M. Weis and F. Naumann. Dogmatix tracks down duplicates in xml. In Proc. of SIGMOD, June 2005. Google ScholarDigital Library
- {31} K. Whang, B. T. V. Zenden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 15(2):208-229, 1990. Google ScholarDigital Library
Index Terms
- Approximately detecting duplicates for streaming data using stable bloom filters
Recommendations
Improved approximate detection of duplicates for data streams over sliding windows
Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements ...
Approximately Detecting Duplicates for Probabilistic Data Streams over Sliding Windows
PAAP '10: Proceedings of the 2010 3rd International Symposium on Parallel Architectures, Algorithms and ProgrammingA probabilistic data stream $S$ is defined as a sequence of uncertain tuples $,i=1...\infty$, with the semantics that element $t_i$ occurs in the stream with probability $p_i \in (0,1)$. Thus each distinct element $t$, which occurs in tuples of $S$, has ...
Detecting Volatility Shift in Data Streams
ICDM '14: Proceedings of the 2014 IEEE International Conference on Data MiningCurrent drift detection techniques detect a change in distribution within a stream. However, there are no current techniques that analyze the change in the rate of these detected changes. We coin the term stream volatility, to describe the rate of ...
Comments