Article

Approximately detecting duplicates for streaming data using stable bloom filters

Authors:
Fan Deng

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

,
Davood Rafiei

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of dataJune 2006Pages 25–36https://doi.org/10.1145/1142473.1142477

Published:27 June 2006Publication History

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

Pages 25–36

ABSTRACT

Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded data stream is not feasible in many streaming scenarios. Therefore, we target at approximately eliminating duplicates in streaming environments given a limited space. Based on a well-known bitmap sketch, we introduce a data structure, Stable Bloom Filter, and a novel and simple algorithm. The basic idea is as follows: since there is no way to store the whole history of the stream, SBF continuously evicts the stale information so that SBF has room for those more recent elements. After finding some properties of SBF analytically, we show that a tight upper bound of false positive rates is guaranteed. In our empirical study, we compare SBF to alternative methods. The results show that our method is superior in terms of both accuracy and time effciency when a fixed small space and an acceptable false positive rate are given.

References

{1} R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proc. of VLDB, 2002. Google ScholarDigital Library
{2} Internet Archive. http://www.archive.org/.Google Scholar
{3} B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of PODS, 2002. Google ScholarDigital Library
{4} B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregation queries over data streams. In Proc. of ICDE, 2004. Google ScholarDigital Library
{5} F. Baboescu, S. Singh, and G. Varghese. Packet classification for core routers: Is there an alternative to cams? In Proc. of INFOCOMM, 2003.Google ScholarCross Ref
{6} M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity m easures. In Proc. of KDD, 2003. Google ScholarDigital Library
{7} B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. In CACM, 1970. Google ScholarDigital Library
{8} A. Z. Broder, M. Najork, and J. L. Wiener. Efficient url caching for world wide web crawling. In Proc. of WWW, 2003. Google ScholarDigital Library
{9} D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. B. Zdonik. Monitoring streams - a new class of data management applications. In Proc. of VLDB, 2002. Google ScholarDigital Library
{10} S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing for an uncertain world. In Proc. of CIDR, 2003.Google Scholar
{11} S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In Proc. of ICDE, 2005. Google ScholarDigital Library
{12} J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. In Proc. of SIGMOD, 2000. Google ScholarDigital Library
{13} E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In Proc. of PODS, 2003. Google ScholarDigital Library
{14} S. Cohen and Y. Matias. Spectral bloom filters. In Proc. of SIGMOD, 2003. Google ScholarDigital Library
{15} G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 15(3):529-540, 2003. Google ScholarDigital Library
{16} C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. In Proc. of SIGMOD, 2003. Google ScholarDigital Library
{17} C. Estan and G. Varghese. Data streaming in computer networks. In Proc. of Workshop on Management and Processing of Data Streams (MPDS) in cooperation with SIGMOD/PODS, 2003.Google Scholar
{18} L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache:a scalable wide area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):281-293, 2000. Google ScholarDigital Library
{19} P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182-209, 1985. Google ScholarDigital Library
{20} H. Garcia-Molina, J. D. Ullman, and J. Widom. Database System Implementation. Prentice Hall, 2000. Google ScholarDigital Library
{21} A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2(4), 1999. Google ScholarDigital Library
{22} Cisco System Inc. Cisco network accounting services. http://www.cisco.com/warp/public/cc/pd/iosw/ prodlit/nwact_wp.pdf, 2002.Google Scholar
{23} G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of VLDB, 2002. Google ScholarDigital Library
{24} A. Metwally, D. Agrawal, and A. E. Abbadi. Duplicate detection in click streams. In Proc. of WWW, 2005. Google ScholarDigital Library
{25} M. Mitzenmacher. Compressed bloom filters. IEEE/ACM Trans. Netw., 10(5):604-612, 2002. Google ScholarDigital Library
{26} T. Palpanas, M. Vlachos, E. J. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In Proc. of ICDE, 2004. Google ScholarDigital Library
{27} M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.Google Scholar
{28} N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In Proc. of VLDB, 2003. Google ScholarDigital Library
{29} P. A. Tucker, D. Maier, and T. Sheard. Applying punctuation schemes to queries over continuous data streams. IEEE Data Eng. Bull., 26(1):33-40, 2003.Google Scholar
{30} M. Weis and F. Naumann. Dogmatix tracks down duplicates in xml. In Proc. of SIGMOD, June 2005. Google ScholarDigital Library
{31} K. Whang, B. T. V. Zenden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 15(2):208-229, 1990. Google ScholarDigital Library

Index Terms

Approximately detecting duplicates for streaming data using stable bloom filters
1. Information systems
  1. Data management systems

Recommendations

Improved approximate detection of duplicates for data streams over sliding windows

Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements ...
Read More
Approximately Detecting Duplicates for Probabilistic Data Streams over Sliding Windows
PAAP '10: Proceedings of the 2010 3rd International Symposium on Parallel Architectures, Algorithms and Programming

A probabilistic data stream $S$ is defined as a sequence of uncertain tuples $,i=1...\infty$, with the semantics that element $t_i$ occurs in the stream with probability $p_i \in (0,1)$. Thus each distinct element $t$, which occurs in tuples of $S$, has ...
Read More
Detecting Volatility Shift in Data Streams
ICDM '14: Proceedings of the 2014 IEEE International Conference on Data Mining

Current drift detection techniques detect a change in distribution within a stream. However, there are no current techniques that analyze the change in the rate of these detected changes. We coin the term stream volatility, to describe the rate of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
June 2006
830 pages
ISBN:1595934340
DOI:10.1145/1142473
General Chairs:
Clement Yu
University of Illinois at Chicago
,
Peter Scheuermann
Northwestern University
,
Program Chair:
Surajit Chaudhuri
Microsoft Research
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
algorithms
approximation
data stream
duplicate detection
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 182
  Total Citations
  View Citations
- 1,241
  Total Downloads
- Downloads (Last 12 months)44
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Approximately detecting duplicates for streaming data using stable bloom filters

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improved approximate detection of duplicates for data streams over sliding windows

Approximately Detecting Duplicates for Probabilistic Data Streams over Sliding Windows

Detecting Volatility Shift in Data Streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Approximately detecting duplicates for streaming data using stable bloom filters

SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improved approximate detection of duplicates for data streams over sliding windows

Approximately Detecting Duplicates for Probabilistic Data Streams over Sliding Windows

Detecting Volatility Shift in Data Streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media