skip to main content
10.1145/1150402.1150436acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Very sparse random projections

Published:20 August 2006Publication History

ABSTRACT

There has been considerable interest in random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space. Let A in Rn x D be our n points in D dimensions. The method multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation. R typically consists of entries of standard normal N(0,1). It is well known that random projections preserve pairwise distances (in the expectation). Achlioptas proposed sparse random projections by replacing the N(0,1) entries in R with entries in -1,0,1 with probabilities 1/6, 2/3, 1/6, achieving a threefold speedup in processing time.We recommend using R of entries in -1,0,1 with probabilities 1/2√D, 1-1√D, 1/2√D for achieving a significant √D-fold speedup, with little loss in accuracy.

References

  1. Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671--687, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dimitris Achlioptas, Frank McSherry, and Bernhard Schölkopf. Sampling techniques for kernel methods. In Proc. of NIPS, pages 335--342, Vancouver, BC, Canada, 2001.Google ScholarGoogle Scholar
  3. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proc. of STOC, pages 20--29, Philadelphia,PA, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Rosa Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and random projection. In Proc. of FOCS (Also to appear in Machine Learning), pages 616--623, New York, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: Applications to image and text data. In Proc. of KDD, pages 245--250, San Francisco, CA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jeremy Buhler and Martin Tompa. Finding motifs using random projections. Journal of Computational Biology, 9(2):225--242, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  7. Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. of STOC, pages 380--388, Montreal, Quebec, Canada, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. P. Chistyakov and F. Götze. Limit distributions of studentized means. The Annals of Probability, 32(1A):28--77, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  9. Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22(1):60--65, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Susan T. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers, 23(2):229--236, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  11. Richard Durrett. Probability: Theory and Examples. Duxbury Press, Belmont, CA, second edition, 1995.Google ScholarGoogle Scholar
  12. William Feller. An Introduction to Probability Theory and Its Applications (Volume II). John Wiley & Sons, New York, NY, second edition, 1971.Google ScholarGoogle Scholar
  13. Xiaoli Zhang Fern and Carla E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proc. of ICML, pages 186--193, Washington, DC, 2003.Google ScholarGoogle Scholar
  14. Dmitriy Fradkin and David Madigan. Experiments with random projections for machine learning. In Proc. of KDD, pages 517--522, Washington, DC, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory A, 44(3):355--362, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Navin Goel, George Bebis, and Ara Nefian. Face recognition experiments with random projection. In Proc. of SPIE, pages 426--437, Bellingham, WA, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  17. Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115--1145, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Götze. On the rate of convergence in the multivariate CLT. The Annals of Probability, 19(2):724--739, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  19. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In FOCS, pages 189--197, Redondo Beach,CA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of STOC, pages 604--613, Dallas, TX, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mapping into Hilbert space. Contemporary Mathematics, 26:189--206, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  22. Samuel Kaski. Dimensionality reduction by random mapping: Fast similarity computation for clustering. In Proc. of IJCNN, pages 413--418, Piscataway, NJ, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  23. Man Lan, Chew Lim Tan, Hwee-Boon Low, and Sam Yuan Sung. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Proc. of WWW, pages 1032--1033, Chiba, Japan, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Erich L. Lehmann and George Casella. Theory of Point Estimation. Springer, New York, NY, second edition, 1998.Google ScholarGoogle Scholar
  25. Will E. Leland, Murad S. Taqqu, Walter Willinger, and Daniel V. Wilson. On the self-similar nature of Ethernet traffic. IEEE/ACM Trans. Networking, 2(1):1--15, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Edda Leopold and Jorg Kindermann. Text categorization with support vector machines. how to represent texts in input space? Machine Learning, 46(1-3):423--444, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Henry C. M. Leung, Francis Y. L. Chin, S. M. Yiu, Roni Rosenfeld, and W. W. Tsang. Finding motifs with insufficient number of strong binding sites. Journal of Computational Biology, 12(6):686--701, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  28. Ping Li, Trevor J. Hastie, and Kenneth W. Church. Improving random projections using marginal information. In Proc. of COLT, Pittsburgh, PA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jessica Lin and Dimitrios Gunopulos. Dimensionality reduction by random projection and latent semantic indexing. In Proc. of SDM, San Francisco, CA, 2003.Google ScholarGoogle Scholar
  30. Bing Liu, Yiming Ma, and Philip S. Yu. Discovering unexpected information from your competitors' web sites. In Proc. of KDD, pages 144--153, San Francisco, CA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kun Liu, Hillol Kargupta, and Jessica Ryan. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on Knowledge and Data Engineering, 18(1):92--106, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. F. Logan, C. L. Mallows, S. O. Rice, and L. A. Shepp. Limit distributions of self-normalized sums. The Annals of Probability, 1(5):788--809, 1973.Google ScholarGoogle ScholarCross RefCross Ref
  33. Chris D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. E. J. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46(5):232--351, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  35. Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. In Proc. of PODS, pages 159--168, Seattle,WA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. Randomized algorithms and NLP: Using locality sensitive hash function for high speed noun clustering. In Proc. of ACL, pages 622--629, Ann Arbor, MI, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jason D. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger. Tackling the poor assumptions of naive Bayes text classifiers. In Proc. of ICML, pages 616--623, Washington, DC, 2003.Google ScholarGoogle Scholar
  38. Ozgur D. Sahin, Aziz Gulbeden, Fatih Emekcci, Divyakant Agrawal, and Amr El Abbadi. Prism: indexing multi-dimensional data in p2p networks using reference vectors. In Proc. of ACM Multimedia, pages 946--955, Singapore, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. I. S. Shiganov. Refinement of the upper bound of the constant in the central limit theorem. Journal of Mathematical Sciences, 35(3):2545--2550, 1986.Google ScholarGoogle Scholar
  41. Chunqiang Tang, Sandhya Dwarkadas, and Zhichen Xu. On scaling latent semantic indexing for large peer-to-peer systems. In Proc. of SIGIR, pages 112--121, Sheffield, UK, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Santosh Vempala. Random projection: A new approach to VLSI layout. In Proc. of FOCS, pages 389--395, Palo Alto, CA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Santosh Vempala. The Random Projection Method. American Mathematical Society, Providence, RI, 2004.Google ScholarGoogle Scholar
  44. William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Springer-Verlag, New York, NY, fourth edition, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Clement T. Yu, K. Lam, and Gerard Salton. Term weighting in information retrieval using the term precision model. Journal of ACM, 29(1):152--170, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Very sparse random projections

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2006
      986 pages
      ISBN:1595933395
      DOI:10.1145/1150402

      Copyright © 2006 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 August 2006

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader