Skip to main content
Log in

Learning a Distance Metric from Relative Comparisons between Quadruplets of Images

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper is concerned with the problem of learning a distance metric by considering meaningful and discriminative distance constraints in some contexts where rich information between data is provided. Classic metric learning approaches focus on constraints that involve pairs or triplets of images. We propose a general Mahalanobis-like distance metric learning framework that exploits distance constraints over up to four different images. We show how the integration of such constraints can lead to unsupervised or semi-supervised learning tasks in some applications. We also show the benefit on recognition performance of this type of constraints, in rich contexts such as relative attributes, class taxonomies and temporal webpage analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Note that the decomposition from \(\mathbf M \) is not unique.

  2. The authors use the constraint \(\sum _{(\mathcal {I}_i,\mathcal {I}_j) \in \mathcal {D}} \sqrt{ D^2_\mathbf{M }(\mathcal {I}_i,\mathcal {I}_j)}\) instead of the usual squared Mahalanobis distance to avoid learning a matrix \(\mathbf M \) that is rank 1.

  3. Different ways to set the parameter \(\gamma \) exist. It can for example be determined with prior knowledge about the page or it can be chosen heuristically following the observation in Adar et al. (2009a): human users tend to visit more frequently webpages that often change. In other words, human users can be considered as intelligent web crawlers with a good crawling strategy. For instance, a page that is visited everyday by a lot of unique visitors can be assumed to be different everyday (in this case \(\gamma = 24\) h). This popularity information can be obtained from services that provide detailed statistics about the visits to a website (e.g., Google Analytics).

  4. We use the publicly available code of Oliva and Torralba (2001) in MATLAB to compute GIST descriptors. In particular, we choose the following setting: 8 oriented edge responses at 4 different scales. The computation time of the GIST descriptor of a page version (screen capture of about \(1000 \times 1000\) pixels) using a \(10 \times 10\) grid is 3.2 s.

  5. We also tried to include the Jaccard distance of words (similar to Dice’s coefficient of words used in Adar et al. (2009b), with the exception that it satisfies the properties of a distance metric) but it does not improve performances.

  6. www.cnn.com, www.bbc.co.uk, www.npr.org, www.nytimes.com, www.finance.yahoo.com, www.npr.org/music, www.bu.edu, www.ocw.mit.edu.

  7. We minimize the number of common versions used for training among the different splits: i.e., the first training split contains the first 5 days, the second one the 6th–10th days, the third one the 11th–15th days.

  8. We experimented with different values of m (i.e., \(m=4, 8\) and 10), and this setting returned the best recognition performance for all the distance metrics. All the distance metrics benefit from greater values of m, which means that they need to focus on highly detailed small regions of pages.

  9. The accuracies reported with zero annotated pair sample per class correspond to those of Sect. 5.5, Fig. 8 and Table 2.

  10. http://www.image-net.org/challenges/LSVRC/2010/.

  11. We use the eigendecomposition \(\mathbf M = \mathbf UDU ^{\top }\) where \(\mathbf D \) is a diagonal matrix, and we formulate \(\mathbf L = \mathbf D ^{1/2} \mathbf U ^{\top }\).

  12. We report the results for 10 nearest neighbor classification (which performs better than 1-NN, 5-NN and 50-NN).

  13. As mentioned in Verma et al. (2012, footnote 1), there exist several ways to define a CSL and one of these ways is chosen in Verma et al. (2012).

  14. It is not necessary to discuss the sign of \(D_{kl}\) since \(\mathcal {I}_k\) was annotated to have stronger presence of \(a_m\) than \(\mathcal {I}_l\). We infer \(D_{kl} > 0\).

  15. A different setup is used in Parkash and Parikh (2012) where additional feedback improves recognition.

  16. For instance, if we have \((k) \prec (i) \prec (e) \prec (f) \sim (g) \prec (h) \prec (j) \prec (l)\), the classes (i) and (j) and the second closest classes of (f) and (g). The classes (k) and (l) are their third closest classes.

References

  • Adar, E., Teevan, J., & Dumais, S. (2009). Resonance on the web: web dynamics and revisitation patterns. In ACM CHI conference on human factors in computing systems (CHI)

  • Adar, E., Teevan, J., Dumais, S., & Elsas, J. (2009). The web changes everything: Understanding the dynamics of web content. In ACM WSDM conference series web search and data mining (WSDM).

  • Agarwal, S., Wills, J., Cayton, L., Lanckriet, G., Kriegman, D.J., & Belongie, S. (2007). Generalized non-metric multidimensional scaling. In International conference on artificial intelligence and statistics (AISTATS) (pp. 11–18).

  • Avila, S., Thome, N., Cord, M., Valle, E., & Araújo, Ad A. (2013). Pooling in image representation: The visual codeword point of view. Computer Vision and Image Understanding (CVIU), 117(5), 453–465.

    Article  Google Scholar 

  • Ben Saad, M., & Gançarski, S. (2011). Archiving the web using page changes pattern: A case study. In Joint conference on digital library (JCDL)

  • Borg, I., & Groenen, P. (2005). Modern multidimensional scaling: Theory and applications. Springer series in statistics. New York: Springer.

  • Boyd, S., & Vandenberghe, L. (2008). Subgradient. Notes for EE364b. Stanford University, Winter 2006–2007. http://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf.

  • Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Cai, D., Yu, S., Wen, J., & Ma, W. (2003). Vips: A vision-based page segmentation algorithm. Microsoft Technical Report. MSR-TR-2003-79-2003

  • Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19(5), 1155–1178.

    Article  MathSciNet  MATH  Google Scholar 

  • Chapelle, O., & Keerthi, S. S. (2010). Efficient algorithms for ranking with svms. Information Retrieval, 13(3), 201–215.

    Article  Google Scholar 

  • Chechik, G., Sharma, V., Shalit, U., & Bengio, S. (2010). Large scale online learning of image similarity through ranking. The Journal of Machine Learning Research (JMLR), 11, 1109–1135.

    MathSciNet  MATH  Google Scholar 

  • Cord, M., & Cunningham, P. (2008). Machine learning techniques for multimedia. Santa Clara, CA: Springer.

    Book  Google Scholar 

  • Davis, J. V., Kulis, B., Jain, P., Sra, S., & Dhillon, I.S. (2007) . Information-theoretic metric learning. In International conference on machine learning (ICML)

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition (CVPR)

  • Douze, M., Jégou, H., Sandhawalia, H., Amsaleg, L., & Schmid, C. (2009). Evaluation of gist descriptors for web-scale image search. In ACM international conference on image and video retrieval (CIVR)

  • Finley, T., & Joachims, T. (2005). Supervised clustering with support vector machines. In International conference on machine learning (ICML) (pp. 217–224). New York: ACM.

  • Finley, T., & Joachims, T. (2008). Supervised k-means clustering. Cornell Computing and Information Science Technical Report.

  • Frome, A., Singer, Y., & Malik, J. (2006). Image retrieval and classification using local distance functions. In Advances in neural information processing systems (NIPS)

  • Frome, A., Singer, Y., Sha, F., & Malik, J. (2007). Learning globally-consistent local distance functions for shape-based image retrieval and classification. In IEEE international conference on computer vision (ICCV).

  • Goh, H., Thome, N., Cord, M., & Lim, J. (2012). Unsupervised and supervised visual codes with restricted Boltzmann machines. In European conference on computer vision (ECCV).

  • Guillaumin, M., Verbeek, J., & Schmid, C. (2009), Is that you? Metric learning approaches for face identification. In IEEE international conference on computer vision (ICCV).

  • Hocking, T. D., Schleiermacher, G., Janoueix-Lerosey, I., Boeva, V., Cappo, J., Delattre, O., et al. (2013). Learning smoothing models of copy number profiles using breakpoint annotations. BMC Bioinformatics, 14(1), 164.

    Article  Google Scholar 

  • Hwang, S.J., Grauman, K., & Sha, F. (2011). Learning a tree of metrics with disjoint visual features. In Advances in neural information processing systems (NIPS).

  • Hwang, S.J., Grauman, K., & Sha, F. (2013). Analogy-preserving semantic embedding for visual object categorization. In International conference on machine learning (ICML).

  • Jain, P., Kulis, B., & Grauman, K. (2008). Fast image search for learned metrics. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 133–142). New York: ACM.

  • Joachims, T. (2005). A support vector method for multivariate performance measures. In Proceedings of the 22nd international conference on machine learning (pp. 377–384). New York: ACM.

  • Joachims, T. (2006). Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 217–226). New York: ACM.

  • Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural svms. Machine Learning, 77(1), 27–59.

    Article  MATH  Google Scholar 

  • Keerthi, S. S., & DeCoste, D. (2005). A modified finite Newton method for fast solution of large scale linear svms. Journal of Machine Learning Research, 6(1), 341.

    MathSciNet  MATH  Google Scholar 

  • Kendall, M. G., & Gibbons, J. D. (1990). Rank correlation methods (5th ed.). New York: Oxford University Press.

  • Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method. Psychometrika, 29(2), 115–129.

    Article  MathSciNet  MATH  Google Scholar 

  • Kulis, B. (2012). Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4), 287–364.

    Article  MathSciNet  MATH  Google Scholar 

  • Kumar, M., Torr, P., & Zisserman, A. (2007). An invariant large margin nearest neighbour classifier. In IEEE international conference on computer vision (ICCV).

  • Kumar, N., Berg, A., Belhumeur, P., & Nayar, S. (2009). Attribute and simile classifiers for face verification. In IEEE international conference on computer vision (ICCV).

  • Lajugie, R., Bach, F., & Arlot, S. (2014). Large-margin metric learning for constrained partitioning problems. In International conference on machine learning (ICML) (pp. 297–305).

  • Lampert, C. H., Nickisch, H., & Harmeling, S. (2009) Learning to detect unseen object classes by between-class attribute transfer. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Law, M. T., Sureda Gutierrez, C., Thome, N., Gançarski, S., & Cord, M. (2012). Structural and visual similarity learning for web page archiving. In 10th workshop on content-based multimedia indexing (CBMI).

  • Law, M. T., Thome, N., & Cord, M. (2013). Quadruplet-wise image similarity learning. In IEEE international conference on computer vision (ICCV) (pp. 249–256).

  • Law, M. T., Thome, N., Gançarski, S., & Cord, M. (2012). Structural and visual comparisons for web page archiving. In ACM symposium on document engineering (DocEng).

  • Luo, P., Fan, J., Liu, S., Lin, F., Xiong, Y., & Liu, J. (2009). Web article extraction for web printing: A dom+ visual based approach. In ACM symposium on document engineering (DocEng). New York: ACM.

  • McFee, B., & Lanckriet, G. (2009). Partial order embedding with multiple kernels. In International conference on machine learning (ICML) (pp. 721–728). New York: ACM.

  • McFee, B., & Lanckriet, G. (2010). Metric learning to rank. In International conference on machine learning (ICML).

  • Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2012). Metric learning for large-scale image classification: Generalizing to new classes at near-zero cost. In European conference on computer vision (ECCV).

  • Mignon, A., & Jurie, F. (2012). PCCA: A new approach for distance learning from sparse pairwise constraints. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV), 42(3), 145–175.

    Article  MATH  Google Scholar 

  • Parikh, D., & Grauman, K. (2011). Relative attributes. In IEEE international conference on computer vision (ICCV).

  • Parkash, A., & Parikh, D. (2012). Attributes for classifier feedback. In European conference on computer vision (ECCV).

  • Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., & Poggio, T. (2007). Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 29(3), 411–426.

    Article  Google Scholar 

  • Shaw, B., Huang, B. C., & Jebara, T. (2011). Learning a distance metric from a network. In Advances in neural information processing systems (NIPS) (pp. 1899–1907).

  • Shepard, R. N. (1962a). The analysis of proximities: Multidimensional scaling with an unknown distance function. I. Psychometrika, 27(2), 125–140.

    Article  MathSciNet  MATH  Google Scholar 

  • Shepard, R. N. (1962b). The analysis of proximities: Multidimensional scaling with an unknown distance function. ii. Psychometrika, 27(3), 219–246.

    Article  MathSciNet  MATH  Google Scholar 

  • Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In IEEE international conference on computer vision (ICCV).

  • Song, R., Liu, H., Wen, J., & Ma, W. (2004). Learning block importance models for web pages. In World wide web conference (WWW).

  • Spengler, A., & Gallinari, P. (2010). Document structure meets page layout: Loopy random fields for web news content extraction. In ACM symposium on document engineering (DocEng).

  • Tenenbaum, J., De Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.

    Article  Google Scholar 

  • Theriault, C., Thome, N., & Cord, M. (2013). Extended coding and pooling in the hmax model. IEEE Transactions on Image Processing, 22(2), 764–777.

    Article  MathSciNet  Google Scholar 

  • Torresani, L., & Lee, K. (2007). Large margin component analysis. In Advances in neural information processing systems (NIPS).

  • Verma, N., Mahajan, D., Sellamanickam, S., & Nair, V. (2012). Learning hierarchical similarity metrics. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Weinberger, K., & Chapelle, O. (2008). Large margin taxonomy embedding with an application to document categorization. Advances in Neural Information Processing Systems (NIPS), 21, 1737–1744.

    Google Scholar 

  • Weinberger, K., & Saul, L. (2009). Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research (JMLR), 10, 207–244.

  • Xing, E., Ng, A., Jordan, M., & Russell, S. (2002). Distance metric learning, with application to clustering with side-information. In Advances in neural information processing systems (NIPS).

  • Yang, J., Yu, K., Gong, Y., Huang, T. (2009) Linear spatial pyramid matching using sparse coding for image classification. In IEEE conference on computer vision and pattern recognition (CVPR).

Download references

Acknowledgments

This work was partially supported by the SCAPE Project cofunded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement nb 270137).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marc T. Law.

Additional information

Communicated by K. Kise.

Appendix 1: Solver for the Vector Optimization Problem

Appendix 1: Solver for the Vector Optimization Problem

We describe here the optimization process when the goal is to learn a dissimilarity function \(\mathscr {D}_\mathbf{w }\) parameterized by a vector \(\mathbf w \).

1.1 Primal Form of the Optimization Problem

We first rewrite Eq. (14) in the primal form in order to use the efficient and scalable primal Newton method (Chapelle and Keerthi 2010).

The first two constraints of Eq. (14) over \(\mathcal {S}\) and \(\mathcal {D}\) try to satisfy Eqs. (9) and (10). They are equivalent to \(y_{ij}(\mathscr {D}_\mathbf{w } (\mathcal {I}_i, \mathcal {I}_j) - b) \ge 1 - \xi _{ij}\) where \(y_{ij} = 1 \Longleftrightarrow (\mathcal {I}_i, \mathcal {I}_j) \in \mathcal {D}\) and \(y_{ij} = -1 \Longleftrightarrow (\mathcal {I}_i, \mathcal {I}_j) \in \mathcal {S}\). Equation (14) can then be rewritten equivalently:

$$\begin{aligned} \begin{aligned} \min _{(\mathbf w , b)} \frac{1}{2} (\Vert \mathbf w \Vert _2^2 + b^2)&+ C_p \sum _{(\mathcal {I}_i, \mathcal {I}_j) \in \mathcal {S} \cup \mathcal {D}} L_1(y_{ij},\mathscr {D}_\mathbf{w } (\mathcal {I}_i, \mathcal {I}_j) - b) \\&+ C_q \sum _{q \in \mathcal {N}} L_{\delta _q}(1,\mathscr {D}_\mathbf{w } (\mathcal {I}_k, \mathcal {I}_l)- \mathscr {D}_\mathbf{w } (\mathcal {I}_i, \mathcal {I}_j)) \\ {}&s.t. \quad \mathbf w \in \mathscr {C}^d, b \in \mathscr {C} \end{aligned} \end{aligned}$$
(27)

where \(L_1\) and \(L_{\delta _q}\) are loss functions and \(y_{ij} \in \{ -1 ; 1 \}\). In particular, for Eqs. (14) and (27) to be strictly equivalent, they have to correspond to the classic hinge loss function \(L_\delta (y,t) = \max (0,\delta -yt)\). We actually use a differentiable approximation of this function to have good convergence properties (Chapelle 2007; Chapelle and Keerthi 2010).

For convenience, we rewrite some variables:

  • \(\varvec{\omega } = [\mathbf w ^\top , b]^\top \) is the concatenation of \(\mathbf w \) and b in a single \((d+1)\)-dimensional vector. We note \(e = d+1\) and then have \(\varvec{\omega } \in \mathbb {R}^e\).

  • \(\mathbf c _{ij} = [(\varPsi (\mathcal {I}_{i}, \mathcal {I}_{j}))^\top ,-1]^\top \) is the concatenation vector of \(\varPsi (\mathcal {I}_{i}, \mathcal {I}_{j})\) and \(-1\). We also have \(\mathbf c _{ij} \in \mathbb {R}^e\).

  • \(p = (\mathcal {I}_{i}, \mathcal {I}_{j}) \Longleftrightarrow \mathbf c _p = \mathbf c _{ij}\) and \(y_p = y_{ij}\).

  • \(q = (\mathcal {I}_i, \mathcal {I}_j, \mathcal {I}_k, \mathcal {I}_l) \Longleftrightarrow \mathbf z _q = \mathbf x _{kl} - \mathbf x _{ij}\).

Equation (27) can be rewritten equivalently with these variables:

$$\begin{aligned} \begin{aligned} \min _{\varvec{\omega } \in \mathscr {C}^e} \frac{1}{2} \Vert \varvec{\omega } \Vert _2^2&+ C_p \sum _{p \in \mathcal {S} \cup \mathcal {D}} L_1(y_{p},\varvec{\omega }^\top \mathbf c _p) \\&+ C_q \sum _{q \in \mathcal {N}} L_{\delta _q}(1, \varvec{\omega }^\top \mathbf z _q) \end{aligned} \end{aligned}$$
(28)

By choosing such a regularization, our scheme may be compared to a RankSVM (Chapelle and Keerthi 2010), with the exception that the loss function \(L_{\delta _q}\) works on quadruplets. The complexity of this convex problem w.r.t. \(\varvec{\omega }\) is linear in the number of constraints (i.e., the cardinality of \(\mathcal {N} \cup \mathcal {D} \cup \mathcal {S}\)). It can be solved with a classic or stochastic (sub)gradient descent w.r.t. \(\varvec{\omega }\) depending on the number of constraints. The number of parameters to learn is small and grows linearly with the input space dimension, limiting overfitting (Mignon and Jurie 2012). It can also be extended to kernels (Chapelle and Keerthi 2010).

We describe in the following how to apply Newton method (Keerthi and DeCoste 2005; Chapelle 2007; Chapelle and Keerthi 2010) to solve Eq. (28) with good convergence properties. The primal Newton method (Chapelle and Keerthi 2010) is known to be fast for SVM classifier and RankSVM training. As our vector model is an extension of the RankSVM model, the learning is then also fast.

1.2 Loss Functions

Let us first describe loss functions that are appropriate for Newton method. Since the hinge loss function is not differentiable, we use differentiable approximations of \(L_1\) and \(L_{\delta _q}\) inspired by the Huber loss function.

For simplicity, we also constrain the domain of \(\delta _q\) to be 0 or 1 (i.e., \(\delta _q \in \{ 0, 1\}\)). The set \(\mathcal {N}\) can then be partitioned as two sets \(\mathcal {A}\) and \(\mathcal {B}\) such that for all:

  • \(q \in \mathcal {N}, \delta _q = 1 \Longleftrightarrow q \in \mathcal {A}\)

  • \(q \in \mathcal {N}, \delta _q = 0 \Longleftrightarrow q \in \mathcal {B}\)

In Eq. (28), we consider \(t_p = \varvec{\omega }^\top \mathbf c _p\) or \(t_q = \varvec{\omega }^\top \mathbf z _q\). Without loss of generality, let us consider \(t_r\) with \(r \in \varvec{\beta }\) (with \(\varvec{\beta } = \mathcal {S}\), \(\mathcal {D}\), \(\mathcal {A}\) or \(\mathcal {B}\)) and \(y \in \{ -1, +1 \}\). Our loss functions are written:

$$\begin{aligned} L_1^h(y,t_r)= & {} \left\{ \begin{array}{l l l l} 0 &{}\text { if } &{} y t_r > 1 + h &{} \text {set: } \varvec{\beta }_{1,y}^0\\ \frac{(1+h-yt_r)^2}{4h} &{}\text { if } &{} | 1 - y t_r | \le h &{} \text {set: } \varvec{\beta }_{1,y}^Q \\ 1 - y t_r &{} \text { if } &{} y t_r < 1 - h &{} \text {set: } \varvec{\beta }_{1,y}^L\\ \end{array} \right. \end{aligned}$$
(29)
$$\begin{aligned} L_0^h(y,t_r)= & {} \left\{ \begin{array}{l l l l} 0 &{}\text { if } &{} y t_r > 0 &{} \text {set :} \varvec{\beta }_{0,y}^0\\ \frac{t_r^2}{4h} &{}\text { if } &{} | - h - y t_r | \le h &{} \text {set :} \varvec{\beta }_{0,y}^Q\\ - h - y t_r &{} \text { if } &{} y t_r < - 2 h &{} \text {set :} \varvec{\beta }_{0,y}^L\\ \end{array} \right. \nonumber \\ \end{aligned}$$
(30)

where \(h \in [0.01,0.5]\). In all our experiments, we set \(h = 0.05\).

As described in Chapelle (2007), \(L_1^h\) is inspired from the Huber loss function, it is a differentiable approximation of the hinge loss (\(L_1(y,t) = \max (0,1-yt)\)) when \(h \rightarrow 0\). Similarly, \(L_0^h\) is a differentiable approximation when \(h \rightarrow 0\) of \(L_0(y,t) = \max (0,-yt)\), the adaptation of the hinge loss that considers the absence of security margin. Given set \(\varvec{\beta }\) and \(y \in \{ -1, +1\}\), we can infer three disjoint sets:

  • \(\varvec{\beta }_{i,y}^0\) is the subset of elements in \(\varvec{\beta }\) that have zero loss in \(L_i^h(y,\cdot )\).

  • \(\varvec{\beta }_{i,y}^Q\) is the subset of elements in \(\varvec{\beta }\) that are in the quadratic part of \(L_i^h(y,\cdot )\).

  • \(\varvec{\beta }_{i,y}^L\) is the subset of elements in \(\varvec{\beta }\) in the non-zero loss linear part of \(L_i^h(y,\cdot )\).

1.3 Gradient and Hessian Matrices

figure b

By considering \(L_1 = L_1^h\) and \(L_0 = L_0^h\) in Eq. (28), the gradient \(\triangledown \in \mathbb {R}^e\) of Eq. (28) w.r.t. \(\varvec{\omega }\) is:

$$\begin{aligned} \begin{aligned} \triangledown&= \varvec{\omega } + \frac{C_p}{2h} \sum _{p \in ({\mathcal {S} \cup \mathcal {D}})_{1,y_p}^Q} (\varvec{\omega }^{\top } \mathbf c _p - (1+h)y_p) \mathbf c _p \\&\quad - C_p \sum _{p \in ({\mathcal {S} \cup \mathcal {D}})_{1,y_p}^L} y_p \mathbf c _p + \frac{C_q}{2h} \sum _{q \in {\mathcal {A}}_{1,1}^Q} (\varvec{\omega }^{\top } \mathbf z _q - (1+h)) \mathbf z _q \\&\quad + \frac{C_q}{2h} \sum _{q \in {\mathcal {B}}_{0,1}^Q} (\varvec{\omega }^{\top } \mathbf z _q) \mathbf z _q - C_q \sum _{q \in (\mathcal {A}_{1,1}^L \cup \mathcal {B}_{0,1}^L)} \mathbf z _q \end{aligned} \end{aligned}$$
(31)

and the Hessian matrix \(\mathbf H \in \mathbb {R}^{e \times e}\) of Eq. 28 w.r.t. \(\varvec{\omega }\) is:

$$\begin{aligned} \mathbf H = \mathbf I _e + \frac{C_p}{2h} \sum _{p \in ({\mathcal {S} \cup \mathcal {D}})_{1,y_p}^Q} \mathbf c _p \mathbf c _p^{\top } + \frac{C_q}{2h} \sum _{q \in (\mathcal {A}_{1,1}^Q \cup \mathcal {B}_{0,1}^Q)} \mathbf z _q \mathbf z _q^{\top } \end{aligned}$$
(32)

where \(\mathbf I _e \in \mathbb {R}^{e \times e}\) is the identity matrix. \(\mathbf H \) is the sum of a positive definite matrix (\(\mathbf I _e\)) and of positive semi-definite matrices. \(\mathbf H \) is then positive definite, and thus invertible (because every positive definite matrix is invertible).

Proof

\(\mathbf H \) can be written \(\mathbf H = \mathbf I _e + \mathbf B \) with \(\mathbf B \in \mathbb {R}^{e \times e}\) a positive semi-definite matrix. For all vector \(\mathbf z \in \mathbb {R}^{e}\), we have \(\mathbf z ^{\top } \mathbf H \mathbf z = \mathbf z ^{\top }{} \mathbf I _e\mathbf z + \mathbf z ^{\top } \mathbf B \mathbf z \). By definition of positive (semi-)definiteness, we have the following property: for all nonzero \(\mathbf z \in \mathbb {R}^{e}\), \(\mathbf z ^{\top }{} \mathbf I _e\mathbf z > 0\) and \(\mathbf z ^{\top } \mathbf B \mathbf z \ge 0\). Then for all nonzero \(\mathbf z \in \mathbb {R}^{e}\), \(\mathbf z ^{\top } \mathbf H \mathbf z > 0\). \(\mathbf H \) is then a positive definite matrix. \(\square \)

The global learning scheme is described in Algorithm 2. The step size \(\eta _t > 0\) can be set to 1 and unchanged as in Chapelle (2007), or optimized at each iteration through line search (see Sect. 9.5.2 in Boyd and Vandenberghe 2004). The parameter \(\epsilon \ge 0\) determines the stopping criterion by controlling the \(\ell _2\)-norm of the difference of \(\varvec{\omega }\) between iteration t and \(t-1\).

Complexity Computing the Hessian takes \(O(\sigma e^2)\) time (where \(\sigma = |({\mathcal {S} \cup \mathcal {D}})_{1,y_p}^Q|+|(\mathcal {A}_{1,1}^Q \cup \mathcal {B}_{0,1}^Q)|\)) and solving the linear system is \(O(e^3)\) because of the inversion of \(\mathbf H _t \in \mathbb {R}^{e \times e}\). This can be prohibitive if e is large but we restrict \(e \le 1001\) in our experiments; the inversion of \(\mathbf H _t\) is then very fast. Other optimization methods are proposed in Chapelle and Keerthi (2010) (e.g., a truncated Newton method) if e is large.

It can be noticed that Newton method is appropriate for unconstrained problems, where the inclusion of \(\mathbf H ^{-1}\) at each iteration allows to converge faster to the global minimum. When \(\mathscr {C}^e\) is \(\mathbb {R}_+^e\), Eq. (28) is a constrained problem and the minimum of the unconstrained problem is not necessarily the minimum of the constrained problem. In Eq. (28), since our loss functions are linear almost everywhere on their domain, the Hessian of the problem is close to the identity matrix and it is affected almost exclusively by the regularization term. This is why applying a projected Newton method is not a major issue in our case. If computing the inverse of the Hessian is too much expensive, the Hessian can be omitted and a classic projected gradient method can be used.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Law, M.T., Thome, N. & Cord, M. Learning a Distance Metric from Relative Comparisons between Quadruplets of Images. Int J Comput Vis 121, 65–94 (2017). https://doi.org/10.1007/s11263-016-0923-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0923-4

Keywords

Navigation