Skip to main content
Log in

Sparse distance metric learning

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Nearest neighbour classification requires a good distance metric. Previous approaches try to learn a quadratic distance metric learning so that observations of different classes are well separated. For high-dimensional problems, where many uninformative variables are present, it is attractive to select a sparse distance metric, both to increase predictive accuracy but also to aid interpretation of the result. We investigate the \(\ell 1\)-regularized metric learning problem, making a connection with the Lasso algorithm in the linear least squared settings. We show that the fitted transformation matrix is close to the desired transformation matrix in \(\ell 1\)-norm by assuming a version of the compatibility condition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific, Nashua, New Hampshire

    MATH  Google Scholar 

  • Bian W, Tao D (2011) Learning a distance metric by empirical loss minimization. In: Proceedings of the twenty-second international joint conference on artificial intelligence—vol 2, IJCAI’11. Association for the Advancement of Artificial Intelligence Press, pp 1186–1191

  • Bian W, Tao D (2012) Constrained empirical risk minimization framework for distance metric learning. IEEE Trans Neural Netw Learn Syst 23(8):1194–1205

    Article  Google Scholar 

  • Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:1705–1732

    Article  MATH  MathSciNet  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45:5–32

    Article  MATH  Google Scholar 

  • Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data. Springer, Berlin

    Book  MATH  Google Scholar 

  • Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Friedman JH, Hastie T, Tibshirani R (2009) December). Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22

    Google Scholar 

  • Goldberger J, Roweis S, Hinton G, Salakhutdinov R (2004). Neighborhood component analysis. In: Advances in neural information processing systems 17. MIT Press, Cambridge, pp 513–520

  • Hix S, Noury A, Roland G (2006) Dimensions of politics in the European Parliament. Am J Polit Sci 50:494–511

    Article  Google Scholar 

  • Negahban S, Wainwright MJ (2011) Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann Stat 89:1069–1097

    Article  MathSciNet  Google Scholar 

  • Soifer A, Grünbaum B, Johnson P, Rousseau C (2008) The mathematical coloring book: mathematics of coloring and the colorful life of its creators. Springer, New York

    Google Scholar 

  • Van De Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392

    Article  MATH  MathSciNet  Google Scholar 

  • Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244

    MATH  Google Scholar 

  • Xing EP, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems 15. MIT Press, Cambridge, pp 505–512

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolai Meinshausen.

Apeendix: Proofs

Apeendix: Proofs

1.1 Additional lemmata

Lemma 6.1

Suppose Assumptions 3.1 and 3.2 hold and suppose the covariates are bounded, that is there exists some \(c>0\) such that \( |X_{i,h}|\le c \) for all \( h = 1, \ldots , p, \) and \( i=1, \ldots ,n\). Let

$$\begin{aligned} \lambda _0 := 8 c^2 \delta ^2 \sqrt{\frac{t^2 + 4 \log (p)+2 \log (n)}{(n-1)/2}}. \end{aligned}$$

Then, for \(N={n(n-1)/2}\),

$$\begin{aligned} \mathbf P \left( \max _{1 \le h\le p, 1\le \ell \le p}|2N^{-1} \sum _{ij} \varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })| < \lambda _0\right) \ge 1 - 2 \exp \left( -\frac{t^2}{2}\right) \!. \end{aligned}$$

Proof

The main difficulty here is the fact that the pairwise distances are not independent. When the number of observations \(n\) is odd, we can pick a set of \((n-1)/{2}\) pairwise distances so that no indices overlap and hence are independent. If \(n\) is even, can we not pick a set of \(n/2\) pairwise distances. For this set of pairwise distances, we can show that the contribution from the noise term can be bounded with high probability. We can then show that the contribution of noise to the set of all pairwise distances can also be bounded by applying an union bound.

We first consider the case when \(n\) is odd. We would use a result from graph theory to prove our claim. A complete graph with \(n\) vertices is a graph where there is an edge between any two vertices. A pairwise distance can be represented as an edge in the complete graph. An edge colouring of a graph is that we can assign an colour to each edge such that no adjacent edges share the same colour. This implies that pairwise distances corresponds to edges of the same colour are independent under Assumption 3.2. Edge chromatic number is the minimum number of colours required to colour the edges of a graph. A complete graph with \(n\) vertices has an edge chromatic number of \(n\) when \(n\) is odd and \(n-1\) when \(n\) is even(a constructive proof is given in Soifer et al. (2008, P.133), from which we can deduce that there is a colouring of the edges such that each colour consists of exactly \((n-1)/2\) edges when \(n\) is odd. Therefore, the set of all pairwise distances can be partitioned into \(n\) partitions \(G_1,\ldots ,G_n\), each partition consists of \((n-1)/{2}\) pairwise distances that no indices overlap. This partition is always possible because there exists a colouring of the edges with \(n\) colours in a complete graph would give such a partition, with edges of each colours forms a partition.

Let \(V_{h,\ell ,G_k} = \frac{2}{n-1} \sum _{ \{i,j\} \in G_k} 2\varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })\).

By the boundedness of \(X\) and \(\varepsilon \),

$$\begin{aligned} - 8c^2\delta ^2\le 2\varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell }) \le 8c^2\delta ^2. \end{aligned}$$

Therefore, by Hoeffding’s inequality,

$$\begin{aligned} P\left( |V_{h,\ell ,G_k}|\ge \lambda _0\right)&\le 2 \exp \left[ -\frac{2 \lambda _0^2 \left( \frac{n-1}{2}\right) ^2}{\frac{n-1}{2} (16c^2\delta ^2)^2}\right] \\&\le 2 \exp \left( - \frac{t^2 +2 \log n + 4\log p}{2}\right) \!. \end{aligned}$$

By a union bound over all choices of \(h\) and \(\ell \),

$$\begin{aligned} P\left( \max _{1 \le h\le p, 1\le \ell \le p}|V_{h,\ell ,G_k}| \ge \lambda _0 \right)&\le 2p^2 \exp \left( -\frac{t^2+4\log p + 2 \log n}{2}\right) \\&= 2 \exp \left( -\frac{t^2+2\log n }{2}\right) \!. \end{aligned}$$

Finally, by a simple union bound over all partitions of \(G_k\),

$$\begin{aligned} P\left( \max _k \max _{1 \le h\le p, 1\le \ell \le p}|V_{h,\ell ,G_k}| \ge \lambda _0 \right)&\le 2n \exp \left( -\frac{t^2+2\log n }{2} \right) \\&= 2 \exp \left( -\frac{t^2}{2}\right) \!. \end{aligned}$$

And thus

$$\begin{aligned}&\mathbf P \left( \max _{1 \le h\le p, 1\le \ell \le p}|2 N^{-1} \sum _{ij} \varepsilon _{i,j} \left( X_{ih}-X_{jh} \right) \left( X_{i\ell } - X_{j\ell }\right) | < \lambda _0 \right) \\&\quad = \mathbf P \left( \max _{1 \le h\le p, 1\le \ell \le p}n^{-1} |\sum _k V_{h,\ell ,G_k}| < \lambda _0 \right) \\&\quad \ge \mathbf P \left( \max _k \max _{1 \le h\le p, 1\le \ell \le p}|V_{h,\ell ,G_k}| < \lambda _0 \right) \\&\quad \ge 1-2\exp \left( {-t}^{2/2}\right) \!. \end{aligned}$$

When \(n\) is even, we can decompose the complete graph into \(n-1\) sets of disjoint pairs, each set with \(n\) pairwise distances. We can derive a slightly stronger bound in this case than the case when n is odd. \(\square \)

The proofs of Lemmata 6.2 and 6.3 follow closely with the ones given in Chapter 6 of Bühlmann and van de Geer (2011), with modifications to handle the matrix notations.

Lemma 6.2

Assume that \(\max _{1 \le h\le p, 1\le \ell \le p}|2 N^{-1} \sum _{ij} \varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })| < \lambda _0\). Then

$$\begin{aligned} N^{-1} \sum _{ij} (\hat{d}(i,j)-d^{*}(i,j))^2 + \lambda \Vert \hat{\mathbf{M }}\Vert _1 \le \lambda _0 \Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert _1 + \lambda \Vert \mathbf{M }^{*}\Vert _1. \end{aligned}$$

Proof

Since \(\hat{\mathbf{M }}\) minimizes \(L(\mathbf{M })\), we have that \(L(\hat{\mathbf{M }}) \le L(\mathbf M^{*} )\) and thus

$$\begin{aligned}&\sum _{ij} \frac{ (d(i,j) - \hat{d}(i,j))^2}{N} + \lambda \Vert \hat{\mathbf{M }}\Vert _1 \le \sum _{ij} \frac{ (d(i,j) - d^{*}(i,j))^2}{N} + \lambda \Vert \mathbf{M }^{*}\Vert _1\\&- 2 \sum _{ij} \frac{\hat{d}(i,j) d(i,j)^2}{N} + \sum _{ij} \frac{\hat{d}(i,j)^2}{N}+\lambda \Vert \hat{\mathbf{M }}\Vert _1 \le - 2 \sum _{ij} \frac{d^{*}(i,j) d(i,j)}{N} + \sum _{ij} \frac{d^{*}(i,j)}{N} \lambda \Vert \mathbf{M }^{*}\Vert _1\\&\sum _{ij} \frac{\left( \hat{d}(i j)-d^{*}(i, j)\right) ^2}{N} + \lambda \Vert \hat{\mathbf{M }}\Vert _1 \le \sum _{ij} 2 (d(i,j) - d^{*}(i,j)) \frac{(\hat{d}(i,j) - d^{*}(i, j))}{N}+ \lambda \Vert \mathbf{M }^{*}\Vert _1 \\&\sum _{ij} \frac{\left( \hat{d}(i j)-d^{*}(i, j)\right) ^2}{N} + \lambda \Vert \hat{\mathbf{M }}\Vert _1 \le \sum _{ij} 2 \varepsilon _{k} \frac{(\hat{d}(i ,j) - d^{*}(i, j))}{N} + \lambda \Vert \mathbf{M }^{*}\Vert _1\\&\quad \quad = \sum _{ij} 2 \varepsilon _{i,j} tr( (\hat{\mathbf{M }}-\mathbf{M }^{*})(X_i - X_j)(X_i-X_j)^{T})/N+ \lambda \Vert \mathbf{M }^{*}\Vert _1 \\&\quad \quad = tr( (\hat{\mathbf{M }}-\mathbf{M }^{*})\sum _{ij} 2 \varepsilon _{i,j} (X_i - X_j)(X_i-X_j)^{T})/N + \lambda \Vert \mathbf{M }^{*}\Vert _1\\&\quad \quad \le \lambda _0 \Vert \hat{\mathbf{M }}-\mathbf{M }^{*}\Vert _1 + \lambda \Vert \mathbf{M }^{*}\Vert _1, \end{aligned}$$

which completes the proof. \(\square \)

Lemma 6.3

Assume that \(N^{\!-\!1} \max _{1 \!\le \! h\le p, 1\le \!\ell \! \le p}|2 \sum _{ij} \varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })| < \lambda _0\) holds. By picking \(\lambda \ge 2 \lambda _0\),

$$\begin{aligned} 2 \sum _{ij} \frac{(\hat{d}(i,j) - d^{*}(i,j))^2}{N} + \lambda \Vert \hat{\mathbf{M }}_{S_c}\Vert _1 \le 3 \lambda \Vert \hat{\mathbf{M }}_{S} - \mathbf{M }_{S}^{*} \Vert _1. \end{aligned}$$

Proof

By triangle inequality,

$$\begin{aligned} \Vert \hat{\mathbf{M }}\Vert \ge \Vert \mathbf{M }^{*}_S\Vert - \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert + \Vert {\hat{\mathbf{M}}}_{S^c}\Vert . \end{aligned}$$

Also note that we can expand \(\Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert = \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S \Vert + \Vert \hat{\mathbf{M }}_{S^c}\Vert .\) We can hence further extend the result of Lemma 6.2,

$$\begin{aligned}&2\sum _{ij} \frac{\left( \hat{d}_{i j}-d^{*}(i,j) \right) ^2}{N} + 2\lambda \Vert \hat{\mathbf{M }}\Vert _1 \le 2\lambda _0 \Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert _1 + 2\lambda \Vert \mathbf{M }^{*}\Vert _1\\&2\sum _{ij} \frac{\left( \hat{d}_{i j}-d^{*}(i,j)\right) ^2}{N} + 2\lambda \left( \Vert \mathbf{M }^{*}_S\Vert _1 - \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1 + \Vert {\hat{\mathbf{M}}}_{S^c}\Vert _1 \right) \le \lambda \Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert _1 + 2\lambda \Vert \mathbf{M }^{*}\Vert _1\\&2\sum _{ij} \frac{\left( \hat{d}_{i j}-d^{*}(i,j)\right) ^2}{N} + 2\lambda \left( \Vert \mathbf{M }^{*}_S\Vert _1 - \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1 + \Vert {\hat{\mathbf{M}}}_{S^c}\Vert _1 \right) \\&\qquad \qquad \le \lambda \left( \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1 + \Vert \hat{\mathbf{M }}_{S_c}\Vert _1 \right) + 2\lambda \Vert \mathbf{M }^{*}\Vert _1\\&2\sum _{ij} \frac{\left( \hat{d}_{i j}-d^{*}(i,j)\right) ^2}{N} + \lambda \Vert {\hat{\mathbf{M}}}_{S^c}\Vert _1 \le 3 \lambda \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1 - \lambda \Vert \hat{\mathbf{M }}_{S_c}\Vert _1\\&2\sum _{ij} \frac{\left( \hat{d}_{i j}-d^{*}(i,j)\right) ^2}{N}+ \lambda \Vert \hat{\mathbf{M }}_{S_c}\Vert _1 \le 3 \lambda \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1, \end{aligned}$$

which completes the proof. \(\square \)

1.2 Proof of Theorem 3.4

Finally, we can prove the performance of the regularized method. By Lemma 6.1, it holds with probability at least \(1 - 2 \exp (-t^2/2)\) that

$$\begin{aligned} \max _{1 \le h\le p, 1\le \ell \le p}|2N^{-1} \sum _{ij} \varepsilon _{i,j} \left( X_{ih}-X_{jh}\right) \left( X_{i\ell } - X_{j\ell }\right) | < \frac{\lambda }{2}. \end{aligned}$$

Hence, using Lemma 6.3, it holds with probability \(1 - 2 \exp (-t^2/2)\) that

$$\begin{aligned} 2 \sum _{ij} \frac{\left( \hat{d}(i,j) - d^{*}(i,j)\right) ^2}{N} + \lambda \Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert _1 \;\le \; 4 \lambda \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1. \end{aligned}$$

Using the compatibility condition, there exists some \(\psi >0\) such that

$$\begin{aligned} \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1 \le (\sqrt{N}\psi )^{-1} \sum _{ij} \left( \hat{d}(i,j) - d^{*}(i,j)\right) ^2\!. \end{aligned}$$

Hence

$$\begin{aligned} 2 \sum _{ij} \frac{\left( \hat{d}(i,j) - d^{*}(i,j)\right) ^2}{N} + \lambda \Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert _1&\le 4 \lambda \sum _{ij} \frac{\left( \hat{d}(i,j) - d^{*}(i,j)\right) }{\sqrt{N} \psi } \\&\le \sum _{ij} \frac{\left( \hat{d}(i,j) - d^{*}(i,j)\right) ^2}{N} + \frac{4\lambda ^2}{\psi ^2}, \end{aligned}$$

which completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choy, T., Meinshausen, N. Sparse distance metric learning. Comput Stat 29, 515–528 (2014). https://doi.org/10.1007/s00180-013-0437-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-013-0437-2

Keywords

Navigation