Article

Very sparse random projections

Authors:
Ping Li

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Trevor J. Hastie

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Kenneth W. Church

Microsoft Corporation, Redmond, WA

Microsoft Corporation, Redmond, WA
View Profile

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2006Pages 287–296https://doi.org/10.1145/1150402.1150436

Published:20 August 2006Publication History

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 287–296

ABSTRACT

There has been considerable interest in random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space. Let A in Rⁿ x D be our n points in D dimensions. The method multiplies A by a random matrix R in R^D x k, reducing the D dimensions down to just k for speeding up the computation. R typically consists of entries of standard normal N(0,1). It is well known that random projections preserve pairwise distances (in the expectation). Achlioptas proposed sparse random projections by replacing the N(0,1) entries in R with entries in -1,0,1 with probabilities 1/6, 2/3, 1/6, achieving a threefold speedup in processing time.We recommend using R of entries in -1,0,1 with probabilities 1/2√D, 1-1√D, 1/2√D for achieving a significant √D-fold speedup, with little loss in accuracy.

References

Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671--687, 2003. Google ScholarDigital Library
Dimitris Achlioptas, Frank McSherry, and Bernhard Schölkopf. Sampling techniques for kernel methods. In Proc. of NIPS, pages 335--342, Vancouver, BC, Canada, 2001.Google Scholar
Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proc. of STOC, pages 20--29, Philadelphia,PA, 1996. Google ScholarDigital Library
Rosa Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and random projection. In Proc. of FOCS (Also to appear in Machine Learning), pages 616--623, New York, 1999. Google ScholarDigital Library
Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: Applications to image and text data. In Proc. of KDD, pages 245--250, San Francisco, CA, 2001. Google ScholarDigital Library
Jeremy Buhler and Martin Tompa. Finding motifs using random projections. Journal of Computational Biology, 9(2):225--242, 2002.Google ScholarCross Ref
Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. of STOC, pages 380--388, Montreal, Quebec, Canada, 2002. Google ScholarDigital Library
G. P. Chistyakov and F. Götze. Limit distributions of studentized means. The Annals of Probability, 32(1A):28--77, 2004.Google ScholarCross Ref
Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22(1):60--65, 2003. Google ScholarDigital Library
Susan T. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers, 23(2):229--236, 1991.Google ScholarCross Ref
Richard Durrett. Probability: Theory and Examples. Duxbury Press, Belmont, CA, second edition, 1995.Google Scholar
William Feller. An Introduction to Probability Theory and Its Applications (Volume II). John Wiley & Sons, New York, NY, second edition, 1971.Google Scholar
Xiaoli Zhang Fern and Carla E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proc. of ICML, pages 186--193, Washington, DC, 2003.Google Scholar
Dmitriy Fradkin and David Madigan. Experiments with random projections for machine learning. In Proc. of KDD, pages 517--522, Washington, DC, 2003. Google ScholarDigital Library
P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory A, 44(3):355--362, 1987. Google ScholarDigital Library
Navin Goel, George Bebis, and Ara Nefian. Face recognition experiments with random projection. In Proc. of SPIE, pages 426--437, Bellingham, WA, 2005.Google ScholarCross Ref
Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115--1145, 1995. Google ScholarDigital Library
F. Götze. On the rate of convergence in the multivariate CLT. The Annals of Probability, 19(2):724--739, 1991.Google ScholarCross Ref
Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In FOCS, pages 189--197, Redondo Beach,CA, 2000. Google ScholarDigital Library
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of STOC, pages 604--613, Dallas, TX, 1998. Google ScholarDigital Library
W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mapping into Hilbert space. Contemporary Mathematics, 26:189--206, 1984.Google ScholarCross Ref
Samuel Kaski. Dimensionality reduction by random mapping: Fast similarity computation for clustering. In Proc. of IJCNN, pages 413--418, Piscataway, NJ, 1998.Google ScholarCross Ref
Man Lan, Chew Lim Tan, Hwee-Boon Low, and Sam Yuan Sung. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Proc. of WWW, pages 1032--1033, Chiba, Japan, 2005. Google ScholarDigital Library
Erich L. Lehmann and George Casella. Theory of Point Estimation. Springer, New York, NY, second edition, 1998.Google Scholar
Will E. Leland, Murad S. Taqqu, Walter Willinger, and Daniel V. Wilson. On the self-similar nature of Ethernet traffic. IEEE/ACM Trans. Networking, 2(1):1--15, 1994. Google ScholarDigital Library
Edda Leopold and Jorg Kindermann. Text categorization with support vector machines. how to represent texts in input space? Machine Learning, 46(1-3):423--444, 2002. Google ScholarDigital Library
Henry C. M. Leung, Francis Y. L. Chin, S. M. Yiu, Roni Rosenfeld, and W. W. Tsang. Finding motifs with insufficient number of strong binding sites. Journal of Computational Biology, 12(6):686--701, 2005.Google ScholarCross Ref
Ping Li, Trevor J. Hastie, and Kenneth W. Church. Improving random projections using marginal information. In Proc. of COLT, Pittsburgh, PA, 2006. Google ScholarDigital Library
Jessica Lin and Dimitrios Gunopulos. Dimensionality reduction by random projection and latent semantic indexing. In Proc. of SDM, San Francisco, CA, 2003.Google Scholar
Bing Liu, Yiming Ma, and Philip S. Yu. Discovering unexpected information from your competitors' web sites. In Proc. of KDD, pages 144--153, San Francisco, CA, 2001. Google ScholarDigital Library
Kun Liu, Hillol Kargupta, and Jessica Ryan. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on Knowledge and Data Engineering, 18(1):92--106, 2006. Google ScholarDigital Library
B. F. Logan, C. L. Mallows, S. O. Rice, and L. A. Shepp. Limit distributions of self-normalized sums. The Annals of Probability, 1(5):788--809, 1973.Google ScholarCross Ref
Chris D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 1999. Google ScholarDigital Library
M. E. J. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46(5):232--351, 2005.Google ScholarCross Ref
Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. In Proc. of PODS, pages 159--168, Seattle,WA, 1998. Google ScholarDigital Library
Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. Randomized algorithms and NLP: Using locality sensitive hash function for high speed noun clustering. In Proc. of ACL, pages 622--629, Ann Arbor, MI, 2005. Google ScholarDigital Library
Jason D. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger. Tackling the poor assumptions of naive Bayes text classifiers. In Proc. of ICML, pages 616--623, Washington, DC, 2003.Google Scholar
Ozgur D. Sahin, Aziz Gulbeden, Fatih Emekcci, Divyakant Agrawal, and Amr El Abbadi. Prism: indexing multi-dimensional data in p2p networks using reference vectors. In Proc. of ACM Multimedia, pages 946--955, Singapore, 2005. Google ScholarDigital Library
Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, 1988. Google ScholarDigital Library
I. S. Shiganov. Refinement of the upper bound of the constant in the central limit theorem. Journal of Mathematical Sciences, 35(3):2545--2550, 1986.Google Scholar
Chunqiang Tang, Sandhya Dwarkadas, and Zhichen Xu. On scaling latent semantic indexing for large peer-to-peer systems. In Proc. of SIGIR, pages 112--121, Sheffield, UK, 2004. Google ScholarDigital Library
Santosh Vempala. Random projection: A new approach to VLSI layout. In Proc. of FOCS, pages 389--395, Palo Alto, CA, 1998. Google ScholarDigital Library
Santosh Vempala. The Random Projection Method. American Mathematical Society, Providence, RI, 2004.Google Scholar
William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Springer-Verlag, New York, NY, fourth edition, 2002. Google ScholarDigital Library
Clement T. Yu, K. Lam, and Gerard Salton. Term weighting in information retrieval using the term precision model. Journal of ACM, 29(1):152--170, 1982. Google ScholarDigital Library

Index Terms

Very sparse random projections
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Random Projections of Smooth Manifolds

We propose a new approach for nonadaptive dimensionality reduction of manifold-modeled data, demonstrating that a small number of random linear projections can preserve key information about a manifold-modeled signal. We center our analysis on the ...
Read More
Visual pattern discovery using random projections
VAST '12: Proceedings of the 2012 IEEE Conference on Visual Analytics Science and Technology (VAST)

An essential element of exploratory data analysis is the use of revealing low-dimensional projections of high-dimensional data. Projection Pursuit has been an effective method for finding interesting low-dimensional projections of multidimensional ...
Read More
Iterative random projections for high-dimensional data clustering

In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2006
986 pages
ISBN:1595933395
DOI:10.1145/1150402
Conference Chair:
Tina Eliassi-Rad
LLNL
,
General Chair:
Lyle Ungar
University of Pennsylvania
,
Program Chairs:
Mark Craven
University of Wisconsin
,
Dimitrios Gunopulos
University of California, Riverside
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
random projections
rates of convergence
sampling
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 329
  Total Citations
  View Citations
- 2,169
  Total Downloads
- Downloads (Last 12 months)113
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Very sparse random projections

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Random Projections of Smooth Manifolds

Visual pattern discovery using random projections

Iterative random projections for high-dimensional data clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Very sparse random projections

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Random Projections of Smooth Manifolds

Visual pattern discovery using random projections

Iterative random projections for high-dimensional data clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media