skip to main content
10.1145/1142473.1142477acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Approximately detecting duplicates for streaming data using stable bloom filters

Published:27 June 2006Publication History

ABSTRACT

Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded data stream is not feasible in many streaming scenarios. Therefore, we target at approximately eliminating duplicates in streaming environments given a limited space. Based on a well-known bitmap sketch, we introduce a data structure, Stable Bloom Filter, and a novel and simple algorithm. The basic idea is as follows: since there is no way to store the whole history of the stream, SBF continuously evicts the stale information so that SBF has room for those more recent elements. After finding some properties of SBF analytically, we show that a tight upper bound of false positive rates is guaranteed. In our empirical study, we compare SBF to alternative methods. The results show that our method is superior in terms of both accuracy and time effciency when a fixed small space and an acceptable false positive rate are given.

References

  1. {1} R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proc. of VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. {2} Internet Archive. http://www.archive.org/.Google ScholarGoogle Scholar
  3. {3} B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of PODS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. {4} B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregation queries over data streams. In Proc. of ICDE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. {5} F. Baboescu, S. Singh, and G. Varghese. Packet classification for core routers: Is there an alternative to cams? In Proc. of INFOCOMM, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  6. {6} M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity m easures. In Proc. of KDD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. {7} B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. In CACM, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. {8} A. Z. Broder, M. Najork, and J. L. Wiener. Efficient url caching for world wide web crawling. In Proc. of WWW, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. {9} D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. B. Zdonik. Monitoring streams - a new class of data management applications. In Proc. of VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. {10} S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing for an uncertain world. In Proc. of CIDR, 2003.Google ScholarGoogle Scholar
  11. {11} S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In Proc. of ICDE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. {12} J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. In Proc. of SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. {13} E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In Proc. of PODS, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. {14} S. Cohen and Y. Matias. Spectral bloom filters. In Proc. of SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. {15} G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 15(3):529-540, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. {16} C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. In Proc. of SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. {17} C. Estan and G. Varghese. Data streaming in computer networks. In Proc. of Workshop on Management and Processing of Data Streams (MPDS) in cooperation with SIGMOD/PODS, 2003.Google ScholarGoogle Scholar
  18. {18} L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache:a scalable wide area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):281-293, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. {19} P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182-209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. {20} H. Garcia-Molina, J. D. Ullman, and J. Widom. Database System Implementation. Prentice Hall, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. {21} A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2(4), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. {22} Cisco System Inc. Cisco network accounting services. http://www.cisco.com/warp/public/cc/pd/iosw/ prodlit/nwact_wp.pdf, 2002.Google ScholarGoogle Scholar
  23. {23} G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. {24} A. Metwally, D. Agrawal, and A. E. Abbadi. Duplicate detection in click streams. In Proc. of WWW, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. {25} M. Mitzenmacher. Compressed bloom filters. IEEE/ACM Trans. Netw., 10(5):604-612, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. {26} T. Palpanas, M. Vlachos, E. J. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In Proc. of ICDE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. {27} M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.Google ScholarGoogle Scholar
  28. {28} N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In Proc. of VLDB, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. {29} P. A. Tucker, D. Maier, and T. Sheard. Applying punctuation schemes to queries over continuous data streams. IEEE Data Eng. Bull., 26(1):33-40, 2003.Google ScholarGoogle Scholar
  30. {30} M. Weis and F. Naumann. Dogmatix tracks down duplicates in xml. In Proc. of SIGMOD, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. {31} K. Whang, B. T. V. Zenden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 15(2):208-229, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Approximately detecting duplicates for streaming data using stable bloom filters

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
      June 2006
      830 pages
      ISBN:1595934340
      DOI:10.1145/1142473

      Copyright © 2006 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 June 2006

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader