ABSTRACT
There has been considerable interest in random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space. Let A in Rn x D be our n points in D dimensions. The method multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation. R typically consists of entries of standard normal N(0,1). It is well known that random projections preserve pairwise distances (in the expectation). Achlioptas proposed sparse random projections by replacing the N(0,1) entries in R with entries in -1,0,1 with probabilities 1/6, 2/3, 1/6, achieving a threefold speedup in processing time.We recommend using R of entries in -1,0,1 with probabilities 1/2√D, 1-1√D, 1/2√D for achieving a significant √D-fold speedup, with little loss in accuracy.
- Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671--687, 2003. Google ScholarDigital Library
- Dimitris Achlioptas, Frank McSherry, and Bernhard Schölkopf. Sampling techniques for kernel methods. In Proc. of NIPS, pages 335--342, Vancouver, BC, Canada, 2001.Google Scholar
- Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proc. of STOC, pages 20--29, Philadelphia,PA, 1996. Google ScholarDigital Library
- Rosa Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and random projection. In Proc. of FOCS (Also to appear in Machine Learning), pages 616--623, New York, 1999. Google ScholarDigital Library
- Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: Applications to image and text data. In Proc. of KDD, pages 245--250, San Francisco, CA, 2001. Google ScholarDigital Library
- Jeremy Buhler and Martin Tompa. Finding motifs using random projections. Journal of Computational Biology, 9(2):225--242, 2002.Google ScholarCross Ref
- Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. of STOC, pages 380--388, Montreal, Quebec, Canada, 2002. Google ScholarDigital Library
- G. P. Chistyakov and F. Götze. Limit distributions of studentized means. The Annals of Probability, 32(1A):28--77, 2004.Google ScholarCross Ref
- Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22(1):60--65, 2003. Google ScholarDigital Library
- Susan T. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers, 23(2):229--236, 1991.Google ScholarCross Ref
- Richard Durrett. Probability: Theory and Examples. Duxbury Press, Belmont, CA, second edition, 1995.Google Scholar
- William Feller. An Introduction to Probability Theory and Its Applications (Volume II). John Wiley & Sons, New York, NY, second edition, 1971.Google Scholar
- Xiaoli Zhang Fern and Carla E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proc. of ICML, pages 186--193, Washington, DC, 2003.Google Scholar
- Dmitriy Fradkin and David Madigan. Experiments with random projections for machine learning. In Proc. of KDD, pages 517--522, Washington, DC, 2003. Google ScholarDigital Library
- P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory A, 44(3):355--362, 1987. Google ScholarDigital Library
- Navin Goel, George Bebis, and Ara Nefian. Face recognition experiments with random projection. In Proc. of SPIE, pages 426--437, Bellingham, WA, 2005.Google ScholarCross Ref
- Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115--1145, 1995. Google ScholarDigital Library
- F. Götze. On the rate of convergence in the multivariate CLT. The Annals of Probability, 19(2):724--739, 1991.Google ScholarCross Ref
- Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In FOCS, pages 189--197, Redondo Beach,CA, 2000. Google ScholarDigital Library
- Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of STOC, pages 604--613, Dallas, TX, 1998. Google ScholarDigital Library
- W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mapping into Hilbert space. Contemporary Mathematics, 26:189--206, 1984.Google ScholarCross Ref
- Samuel Kaski. Dimensionality reduction by random mapping: Fast similarity computation for clustering. In Proc. of IJCNN, pages 413--418, Piscataway, NJ, 1998.Google ScholarCross Ref
- Man Lan, Chew Lim Tan, Hwee-Boon Low, and Sam Yuan Sung. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Proc. of WWW, pages 1032--1033, Chiba, Japan, 2005. Google ScholarDigital Library
- Erich L. Lehmann and George Casella. Theory of Point Estimation. Springer, New York, NY, second edition, 1998.Google Scholar
- Will E. Leland, Murad S. Taqqu, Walter Willinger, and Daniel V. Wilson. On the self-similar nature of Ethernet traffic. IEEE/ACM Trans. Networking, 2(1):1--15, 1994. Google ScholarDigital Library
- Edda Leopold and Jorg Kindermann. Text categorization with support vector machines. how to represent texts in input space? Machine Learning, 46(1-3):423--444, 2002. Google ScholarDigital Library
- Henry C. M. Leung, Francis Y. L. Chin, S. M. Yiu, Roni Rosenfeld, and W. W. Tsang. Finding motifs with insufficient number of strong binding sites. Journal of Computational Biology, 12(6):686--701, 2005.Google ScholarCross Ref
- Ping Li, Trevor J. Hastie, and Kenneth W. Church. Improving random projections using marginal information. In Proc. of COLT, Pittsburgh, PA, 2006. Google ScholarDigital Library
- Jessica Lin and Dimitrios Gunopulos. Dimensionality reduction by random projection and latent semantic indexing. In Proc. of SDM, San Francisco, CA, 2003.Google Scholar
- Bing Liu, Yiming Ma, and Philip S. Yu. Discovering unexpected information from your competitors' web sites. In Proc. of KDD, pages 144--153, San Francisco, CA, 2001. Google ScholarDigital Library
- Kun Liu, Hillol Kargupta, and Jessica Ryan. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on Knowledge and Data Engineering, 18(1):92--106, 2006. Google ScholarDigital Library
- B. F. Logan, C. L. Mallows, S. O. Rice, and L. A. Shepp. Limit distributions of self-normalized sums. The Annals of Probability, 1(5):788--809, 1973.Google ScholarCross Ref
- Chris D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 1999. Google ScholarDigital Library
- M. E. J. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46(5):232--351, 2005.Google ScholarCross Ref
- Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. In Proc. of PODS, pages 159--168, Seattle,WA, 1998. Google ScholarDigital Library
- Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. Randomized algorithms and NLP: Using locality sensitive hash function for high speed noun clustering. In Proc. of ACL, pages 622--629, Ann Arbor, MI, 2005. Google ScholarDigital Library
- Jason D. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger. Tackling the poor assumptions of naive Bayes text classifiers. In Proc. of ICML, pages 616--623, Washington, DC, 2003.Google Scholar
- Ozgur D. Sahin, Aziz Gulbeden, Fatih Emekcci, Divyakant Agrawal, and Amr El Abbadi. Prism: indexing multi-dimensional data in p2p networks using reference vectors. In Proc. of ACM Multimedia, pages 946--955, Singapore, 2005. Google ScholarDigital Library
- Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, 1988. Google ScholarDigital Library
- I. S. Shiganov. Refinement of the upper bound of the constant in the central limit theorem. Journal of Mathematical Sciences, 35(3):2545--2550, 1986.Google Scholar
- Chunqiang Tang, Sandhya Dwarkadas, and Zhichen Xu. On scaling latent semantic indexing for large peer-to-peer systems. In Proc. of SIGIR, pages 112--121, Sheffield, UK, 2004. Google ScholarDigital Library
- Santosh Vempala. Random projection: A new approach to VLSI layout. In Proc. of FOCS, pages 389--395, Palo Alto, CA, 1998. Google ScholarDigital Library
- Santosh Vempala. The Random Projection Method. American Mathematical Society, Providence, RI, 2004.Google Scholar
- William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Springer-Verlag, New York, NY, fourth edition, 2002. Google ScholarDigital Library
- Clement T. Yu, K. Lam, and Gerard Salton. Term weighting in information retrieval using the term precision model. Journal of ACM, 29(1):152--170, 1982. Google ScholarDigital Library
Index Terms
- Very sparse random projections
Recommendations
Random Projections of Smooth Manifolds
We propose a new approach for nonadaptive dimensionality reduction of manifold-modeled data, demonstrating that a small number of random linear projections can preserve key information about a manifold-modeled signal. We center our analysis on the ...
Visual pattern discovery using random projections
VAST '12: Proceedings of the 2012 IEEE Conference on Visual Analytics Science and Technology (VAST)An essential element of exploratory data analysis is the use of revealing low-dimensional projections of high-dimensional data. Projection Pursuit has been an effective method for finding interesting low-dimensional projections of multidimensional ...
Iterative random projections for high-dimensional data clustering
In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data ...
Comments