Elsevier

Pattern Recognition

Volume 96, December 2019, 106981
Pattern Recognition

Deep point-to-subspace metric learning for sketch-based 3D shape retrieval

https://doi.org/10.1016/j.patcog.2019.106981Get rights and content

Highlights

  • A representative-view selection (RVS) module is designed to identify the most representative views of a 3D shape for reducing the redundancy.

  • A deep point-to-subspace metric learning (DPSML) module is proposed to calculate the query-adaptive similarity for sketch-based 3D shape retrieval.

  • The representation learning problem is formulated as a classification problem with a specially designed classifier and training loss.

  • State-of-the-art performance on SHREC 2013, 2014 and 2016 benchmarks are achieved.

Abstract

One key issue in managing a large scale 3D shape dataset is to identify an effective way to retrieve a shape-of-interest. The sketch-based query, which enjoys the flexibility in representing the user’s intention, has received growing interests in recent years due to the popularization of the touchscreen technology. Essentially, the sketch depicts an abstraction of a shape in a certain view while the shape contains the full 3D information. Matching between them is a cross-modality retrieval problem, and the state-of-the-art solution is to project the sketch and the 3D shape into a common space with which the cross-modality similarity can be calculated by the feature similarity/distance within. However, for a given query, only part of the viewpoints of the 3D shape is representative. Thus, blindly projecting a 3D shape into a feature vector without considering what is the query will inevitably bring query-unrepresentative information. To handle this issue, in this work we propose a Deep Point-to-Subspace Metric Learning (DPSML) framework to project a sketch into a feature vector and a 3D shape into a subspace spanned by a few selected basis feature vectors. The similarity between them is defined as the distance between the query feature vector and its closest point in the subspace by solving an optimization problem on the fly. Note that, the closest point is query-adaptive and can reflect the viewpoint information that is representative to the given query. To efficiently learn such a deep model, we formulate it as a classification problem with a special classifier design. To reduce the redundancy of 3D shapes, we also introduce a Representative-View Selection (RVS) module to select the most representative views of a 3D shape. By conducting extensive experiments on various datasets, we show that the proposed method can achieve superior performance over its competitive baseline methods and attain the state-of-the-art performance.

Introduction

With the rapid development of 3D sensing techniques, 3D shape data has received increasing research interests in the field of computer vision. Since the volume of 3D shape data grows significantly, shape retrieval has been becoming a crucial problem for 3D shape data management [1], [2], [3], [4], [5], [6]. In its early year, a keyword is first labeled for each 3D shape, and is used as the query for retrieval [7], [8]. However, the keyword labeling is a time-consuming process, and is also impractical for the real-world applications, especially when dealing with large-scale datasets. Then, by using a 3D shape as query, considerable research has been devoted to the content-based 3D shape retrieval techniques. However, the acquisition of a query shape itself is difficult due to the nature of the 3D modality. Recently, the prevalence of touchscreen technologies (e.g., smart phones and tablet computers) enable the hand-drawing sketch a more convenient way for representing the user’s intention. Compared with using a keyword or 3D shape as query, the sketch-based 3D shape retrieval is more straightforward and thus easier to be implemented in practical applications [9], [10], [11], [12].

The hand-drawing sketches usually contain limited information and only reflect certain views of 3D shapes. As a result, obtaining a discriminative 3D shape features aiming to reduce the cross-modality discrepancy to sketch becomes a key issue. In order to extract 3D shape features, different 3D shape representations have been proposed. Recently, the point-cloud based [13], [14], [15], [16] and the multi-view based [17], [18], [19], [20] representations gradually become dominate choices. In particular, the multi-view based representations have achieved state-of-the-art performance so far [17], [18], [19], [20]. For this type of representations, the 3D shape is initially rendered by a family of 2D views, as shown in Fig. 1. On top of that, one can then leverage the well-established 2D image deep models (e.g., AlexNet [21], VGG [22] and ResNet [23]), which are pre-trained on large-scale datasets (e.g., ImageNet [24]), for feature extraction.

Despite the promising prospect of the sketch-based 3D shape retrieval, there still exists three major challenges which have been hindering its development. First, the free-hand sketch drawing is a subjective activity, resulting in large variation among different individuals. Second, the sketch and 3D shape have a large cross-modality discrepancy, which makes it difficult to obtain modality-independent features. Third, the sketch usually reflects certain view of a 3D shape, and the visual appearance of different views may vary significantly. Aiming to handle these problems, the existing methods can be coarsely categorized into traditional descriptor based [2], [25] and deep-learned descriptor based [26], [27]. The first kind methods commonly apply the hand-crafted or shallow-learned features to describe both sketches and 3D shapes for similarity measurement. Nevertheless, it is difficult to design discriminative feature descriptors applied for both sketches and 3D shapes due to the large cross-modality discrepancy [11]. In contrast, the second kind methods, which are based on the deep-learned features are considered to be more robust and with more discriminative power. It can better accommodate the cross-modality discrepancy, and attain an improved retrieval accuracy.

As mentioned above, the query sketch is only representative to part views of a 3D shape, and the unrepresentative views offer minor contribution or even be harmful for retrieval. However, many existing methods [20], [28], [29], [30] treat all the views equally without considering the viewpoint information. In order to resolve this problem, we propose a Deep Point-to-Subspace Metric Learning (DPSML) framework. First, a Representative-View Selection (RVS) module is applied to obtain several most representative 3D shape views, and then a subspace spanned by feature vectors from the selected views is generated for describing a 3D shape. Later, the similarity between a sketch and a 3D shape is defined as the distance between the sketch feature vector and its closest point in the spanned subspace by solving an optimization problem on the fly. Note that, the closest point is query-adaptive and can reflect the viewpoint information captured by the query sketch. Moreover, in order to efficiently learn a deep model, we formulate the representation learning problem as a classification problem without the pairwise sample learning process used by many existing methods [29], [31]. In summary, the proposed DPSML is an end-to-end framework, and its effectiveness and robustness are extensively demonstrated by a set of experiments on three widely used benchmark datasets i.e., SHREC 2013, 2014 and 2016.

The rest of this paper is organized as follows. Section 2 describes the related works which are representative to the proposed method. Then, we give a method overview. Section 3 presents a detailed explanation of the proposed framework. Section 4 provides the details of the used benchmark datasets, evaluation metrics and the implementation details. The experimental results, comparisons to the state-of-the-arts along with a discussion are provided in Section 5. Finally, Section 6 concludes this work.

Section snippets

Related works

The work in [12], [32] provided a comprehensive survey and comparison of the sketch-based 3D shape retrieval methods. In the following, we restrain the review to the representative methods closely related to this work. More specifically, we cover the traditional sketch-based 3D shape retrieval methods e.g., hand-crafted or shallow-learned features and the deep-learned descriptors for the task of 3D shape retrieval in Sections 2.1.1 and 2.1.2, respectively.

Methodology

As shown in Fig. 2, the proposed framework mainly contains three modules. First, the feature extraction module is described in Section 3.1. Then, the details of the proposed RVS module are given in Section 3.2. Last, the detailed explanation of the DPSML framework is described in Section 3.3.

Experimental setups

In order to demonstrate the effectiveness of the proposed method, we evaluate it on three public benchmark datasets, i.e., the SHREC 2013 [12], [33], SHREC 2014 [32], [46] and SHREC 2016 [47]. We first introduce the experimental setups, including the details of benchmark datasets and the used evaluation metrics. Next, we present the implementation details of our framework. Then, we calculate all the metrics to investigate the performance and compare our results against the state-of-the-arts.

Evaluation on the SHREC 2013 dataset

Our proposed method is based on an efficient point-to-subspace learning. In order to further improve the retrieval accuracy, a modified center learning method is used as part of the loss function. In order to demonstrate the effectiveness of RVS module, we compare the performance of the proposed method with different fusion operations, i.e., average pooling and FC-layer based feature. We also report the results with and without “center learning” method as described in the Section 3.3.2. Note

Conclusions

In this paper, we propose a novel DPSML framework for sketch-based 3D shape retrieval. First, the raw features for both sketches and 3D shapes (represented by 12 rendered views) are extracted via pre-trained deep models (AlexNet, VGG and ResNet). Second, a RVS module is introduced to reduce the redundancy of the rendered views and results in a set of most representative views. Then, the sketch is projected into a feature point and the 3D shape is projected into a subspace which is spanned by

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (NSFC) with Grant Nos. 61403265, 61602499. This work was partially supported by the Key Research and Development Program of Sichuan Province (No. 2019YFG0409). Lingqiao Liu was in part supported by ARC DECRA Fellowship DE170101259. This work was also partially supported by the Fundamental Research Funds for the Central Universities (No. 18lgzd06).

Yinjie Lei received his M.S. degree from Sichuan University (SCU), China, with the area of Image Processing in 2009, and the Ph.D. degree in Computer Vision from University of Western Australia (UWA), Australia in 2013. He is currently an associate professor with the college of Electronics and Information Engineering at SCU. He serves as the vice dean of the College of Electronics and Information Engineering at SCU since 2017. His research interests mainly include deep learning, 3D biometrics,

References (53)

  • P. Shilane et al.

    The princeton shape benchmark

    Proceedings of the Shape Modeling Applications

    (2004)
  • J.W. Tangelder et al.

    A survey of content based 3D shape retrieval methods

    Proceedings of Shape Modeling Applications

    (2004)
  • M. Eitz et al.

    Sketch-based shape retrieval

    ACM Trans. Graphics (TOG)

    (2012)
  • B. Gong et al.

    Learning semantic signatures for 3D object retrieval

    IEEE Trans. Multimedia

    (2013)
  • T. Furuya et al.

    Ranking on cross-domain manifold for sketch-based 3D model retrieval

    Proceedings of the Cyberworlds (CW)

    (2013)
  • Z. Wu et al.

    3D ShapeNets: a deep representation for volumetric shapes

    Proceedings of the Computer Vision and Pattern Recognition (CVPR)

    (2015)
  • A. Garcia Garcia et al.

    PointNet: a 3D convolutional neural network for real-time object class recognition

    Proceedings of the International Joint Conference on Neural Networks (IJCNN)

    (2016)
  • X. Liu, Z. Han, Y. Liu, M. Zwicker, Point2Sequence: learning the shape representation of 3D point clouds with an...
  • J. Xie et al.

    Learning barycentric representations of 3D shapes for sketch-based 3D shape retrieval

    Proceedings of the Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • A. Kanezaki et al.

    RotationNet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints

    Proceedings of the Computer Vision and Pattern Recognition (CVPR)

    (2018)
  • Y. Feng et al.

    GVCNN: group-view convolutional neural networks for 3D shape recognition

    Proceedings of the Computer Vision and Pattern Recognition (CVPR)

    (2018)
  • H. You et al.

    PVNet: a joint convolutional network of point cloud and multi-view for 3D shape recognition

    Proceedings of the ACM International Conference on Multimedia

    (2018)
  • A. Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

    Proceedings of the Neural Information Processing Systems (NeurIPS)

    (2012)
  • K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556v1...
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • J. Deng et al.

    ImageNet: a large-scale hierarchical image database

    Proceedings of the Computer Vision and Pattern Recognition (CVPR)

    (2009)
  • Cited by (36)

    • Sketch-based 3D shape retrieval via teacher–student learning

      2024, Computer Vision and Image Understanding
    • JFLN: Joint Feature Learning Network for 2D sketch based 3D shape retrieval

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Additionally, a cross-modal similarity model is employed for feature matching between different modalities, which effectively improves the cross-modal retrieval accuracy. Lei et al. [37] proposed a DPSML framework to project the sketch features into points, while the shape descriptors are mapped into a subspace. The similarity of cross-modal samples are defined as the distance between points and subspace.

    View all citing articles on Scopus

    Yinjie Lei received his M.S. degree from Sichuan University (SCU), China, with the area of Image Processing in 2009, and the Ph.D. degree in Computer Vision from University of Western Australia (UWA), Australia in 2013. He is currently an associate professor with the college of Electronics and Information Engineering at SCU. He serves as the vice dean of the College of Electronics and Information Engineering at SCU since 2017. His research interests mainly include deep learning, 3D biometrics, object recognition and semantic segmentation.

    Ziqin Zhou received her bachelor’s degree from Sichuan University (SCU), China, in 2017. She is currently pursuing the M.S. degree with the Electronics and Information Engineering at SCU. Her current research interests include 3D shape analysis and neural architecture search.

    Pingping Zhang received his B.E. degree in mathematics and applied mathematics, Henan Normal University (HNU), Xinxiang, China, in 2012. He is currently a Ph.D. candidate in the School of Information and Communication Engineering, Dalian University of Technology (DUT), Dalian, China. His research interests include deep learning, saliency detection, object tracking and semantic segmentation.

    Yulan Guo received the B.Eng. and Ph.D. degrees from National University of Defense Technology (NUDT) in 2008 and 2015, respectively. He was a visiting Ph.D. student with the University of Western Australia from 2011 to 2014. He is currently an Assistant Professor with the College of Electronic Science, NUDT. He has authored over 60 articles in journals and conferences, such as the IEEE TPAMI and IJCV. His current research interests focus on 3D vision, particularly on 3D feature learning, 3D modeling, 3D object recognition, and 3D biometrics. Dr. Guo received the NUDT Outstanding Doctoral Dissertation Award in 2015 and the CAAI Outstanding Doctoral Dissertation Award in 2016. He served as an associate editor for IET Computer Vision, a guest editor for IEEE TPAMI, a PC member for several international conferences (e.g., ACM MM, IJCAI, AAAI), and a reviewer for over 30 international journals and conferences.

    Zijun Ma received her bachelor’s degree from Sichuan University (SCU), China in 2018. She is currently pursuing the M.S. degree with the Electronics and Information Engineering at SCU. Her current research interests include 3D shape analysis and deep model compression.

    1

    The second author has the equal contribution as the first author for this work.

    View full text