Abstract
Real-time SLAM is a prerequisite for online virtual and augmented reality (VR and AR) applications on mobile devices. Under the observation that the efficient feature matching is crucial for both 3D mappings and camera locations in the feature-based SLAM, we propose a clustering forest-based metric for feature matching. Instead of a predefined cluster number in the k-means-based feature hierarchy, the proposed forest self-learn the underlying feature distribution, where the affinity estimation is based on efficient forest traversals. Considering the spatial consistency, the matching feature pair is assigned a confident score by virtue of contextual leaf assignments to reduce the RANSAC iterations. Furthermore, an incremental forest growth scheme is presented for a robust exploration in new scenes. This framework facilitates fast SLAMs for VR and AR applications on mobile devices.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
The simultaneous localization and mapping (SLAM) play an important role in the VR and AR applications on mobile devices (Fig. 1). The SLAM has undergone rapid developments in recent years with an inception of several SLAM systems, such as PTAM [8], LSD-SLAM [6], and ORB-SLAM [10]. The feature-based SLAM is known to be effective for the 3D global mapping and camera locations, especially invariant to viewpoints and illuminations compared with the direct SLAM methods. A group of image features, including SIFT [9], SURF [1], BRIEF [4], ORB [14], and bag of words [7] have been used in feature-based SLAMs. The ORB feature has obvious advantages over others in fast extractions for the real-time SLAM. However, without the GPU and PC support, the ORB-SLAM has limited processing frame rates on mobile devices [15], which is not enough for online applications.
Considering the time-consuming feature matching for map generations as well as the camera locations in feature-based SLAMs, we investigate an adaptation of the ORB-SLAM by proposing a clustering forest for the fast feature correspondence establishment (see Fig. 2). Compared with the hierarchical vocabulary tree [10], there is no need to predefine the clustering number in the training phase of the feature forest. Moreover, there is just a limited number of binary comparisons in forest traversals for feature affinity estimation. Taking into account the spatial consistency, we propose a confident score for the feature matching by virtue of feature contexts. The matching pairs with similar contextual leaf assignments are assumed to be reliable. Furthermore, we present an incremental adaptation of the forest to accommodate newly-explored keyframes compared with the fixed vocabulary tree. The main point of this paper is to propose a forest-based method for efficient feature matching, and further the fast SLAM on mobile devices.
2 Feature Forest
The clustering forest works in an unsupervised manner without prior labeling, which is known for its self-learning underlying data distributions. The optimal node splitting parameters are learned by maximizing the information gain I as in the density forest [5]. We use the trace operator [12] to avoid the rank deficiency of the covariance matrix \(\sigma (F)\) of the high dimensional ORB feature set F. Here we measure the information gain by the Hamming metric.
where \(|\cdot |\) returns the cardinality of feature set \(F_{k}\) in left and right children nodes. The ORB feature is a 256-dimensional binary vector with each 8-bit byte serving as a feature channel. The binary function \(\phi (s,\rho ,\tau )=[\Vert f_{(s)}-\rho \Vert _h<\tau ],\) where \([\cdot ]\) is an indicator function. The features bearing channel \(f_{(s)}, s\in [1,32]\) with the Hamming distance to byte \(\rho \) lower than threshold \(\tau \) is assigned to the left child node.
The forest is composed of five independent decision trees learned from randomly-selected feature subsets. The tree growths terminate when the number of instances inside the leaf node is below a predefined threshold \(\gamma \), and \(\gamma =50.\) Each tree has approx. 10 layers. Of course, the binary decision tree in the feature forest is deeper than the vocabulary tree. Fortunately, the forest traversals are extremely fast considering binary tests in branch nodes. Since the parameters of the hierarchical forest model are composed of binary tests in branch nodes, as well as the mean representor \(f_{\ell }\) and instance number \(n_{\ell }\) of the leaf nodes, it is easy to load the forest model into the memory of the mobile devices.
2.1 Affinity Estimation
When given the feature forest, it’s straightforward to estimate pairwise affinities of ORB features. The ORB feature pair reaching the same leaf node is assumed to be similar with a distance set at 0, and 1 otherwise. The distance matrix \(D=\frac{1}{n_T}\sum _{k=1}^{n_T}D_k\) by the forest with \(n_T\) trees, where \(D_k(f_i, f_j)=1\) if \(\ell (f_i)=\ell (f_j).\) \(\ell (f)\) denotes the leaf node of feature f. Given the distance matrix D between ORB feature set \(F_n\) of the newly-explored frame and \(F_o\) of the already stored keyframes, the feature matching
The feature pair with the smallest pairwise distance is assumed to be the matching pair.
Note that, the pairwise distance entry is set according to binary functions \(\phi \) stored in branch nodes. The balanced tree depth \(\nu \) depends on the cardinality of the training data F, and \(\nu =\log _2 |F|.\) The time cost for the pairwise distance matrix between ORB feature set \(F_i\) and \(F_j\) is \(O((|F_i|+|F_j|)\cdot \nu \cdot n_T)\). In our experiments, \(\nu \in [9,12]\) and \(n_T=5\). The time cost is lower than the common pairwise distance computation of ORB features with a complexity of \(O(|F_i|\cdot |F_j|).\)
Similar to the vocabulary tree [10], the feature forest stores the direct and inverse indices between leaf nodes and features on keyframes. There are approx. \(|F|/\gamma \) leaf nodes. The leaf index can be denoted by \(\log _2 (|F|/\gamma )\) bits. On the keyframes of already explored scenes, there is a direct index from the ORB feature to leaf nodes of the feature forest as shown in Fig. 3. On the other hand, the inverse index stores all the ORB features of keyframes that reach the leaf node. For the correspondence estimation between the newly-explored frame \(F_n\) and stored keyframes, just the forest traversals of \(F_n\) are needed with a complexity of \(O(|F_n|\cdot \nu \cdot n_T)\) on byte-based binary comparisons. As we can see, the online distance matrix update cost for the newly-explored frame is extremely lower than the common pairwise distance computation with a complexity of \(O(|F_n|\cdot |F_o|).\) The time cost is also lower than the vocabulary tree with \(O(|F_n|\cdot k \cdot \nu )\) of Hamming distance computations for the 256-dimensional features with k clusters for each splitting.
2.2 Matching Confidence
Considering the spatial consistency and perspective geometry, the correspondences of neighboring ORB features of one frame tend to be close in other frames or 3D maps. We no longer treat the matching pairs equally as in traditional features-based SLAMs. Instead, we present a confident score of the matching feature pair \((f_i, f_j)\).
where function \(\theta _k(\mathcal {N}(f))\) returns leaf indices of surrounding context \(\mathcal {N}(f)\) of feature f with respect to the k-th decision tree. The direct index of ORB feature as described in Sect. 2.1 is utilized to get the leaf index set of feature context \(\mathcal {N}(f)\). The confident score is computed by the intersection \(\wedge \) of the contextual leaf assignments of corresponding features \(f_i\) and \(f_j\). Since decision trees in the feature forest are constructed almost independently, we consider all decision trees in the forest to measure the consistency of contextual leaf assignments. Z is a normalization constant. In our experiments, the size of the context patch is set at \(1\%\) of the image size. The matching pair is denoted as a triplet \(\langle f_i,f_j, \alpha (f_i,f_j)\rangle \).
The feature pairs bearing large confident scores are likely to be correct matchings. The feature matchings are sorted according to the confident scores. The 3D mapping and camera location are prone to use the feature pairs with high confident scores. For instance, the RANSAC process for camera locations prefers the matching pairs with large confident scores. We observe that the weighted RANSAC using the confident scores is likely to terminate after a small number of iterations.
2.3 Online Forest Refinement
The feature forest is trained offline. When the scene exploration goes on, more and more keyframes and ORB features are located and stored. In this work, we present an online forest refinement scheme with incremental tree growths to accommodate the newly-added features on the keyframes, which facilitates the adaptation to the new scene. Similar to [13], we incrementally split the leaf nodes with available online data. There are two criteria to split the candidate leaf node in online forest refinements: (1) The number of newly-added features in the leaf node is larger than a predefined threshold, i.e. \(\gamma ,\) the same as the predefined leaf size; (2) The deviation from the mean of the newly-added features \(F_{n,\ell }\) to the offline learned leaf node representor \(f_\ell \) is large enough.
We measure the deviation between \(f_\ell \) and the representor \(f_\ell '\) of its brother node. When \(\Vert f_\ell - \bar{F}{_{n,\ell }}\Vert >\beta \Vert f_\ell -f_\ell '\Vert ,\) the second criterion is met. The constant coefficient \(\beta \) is set at 0.5. The leaf nodes of the feature forest is incrementally split and the tree grows when the above two criteria are met as shown in Fig. 3(c). The optimal splitting parameters are determined by maximizing the information gain as described in Sect. 2. Taking into account the features assigned to the leaf node in the training phase, we employ the weighted covariance matrix to estimate the information gain. The following weights are assigned to newly-added features \(F_{n,\ell }\) and offline learned leaf node representor \(f_\ell \).
Different from the unweighted information gain estimation in the training phase (Sect. 2), the trace of the covariance matrix \(\sigma (F_k)\) of the child node is defined as
The center of the leaf node is computed as a weighted mean, and \(\bar{ F} =\sum _{i=1}^{|F|} u_i f_i.\) Note that, the incremental tree growth changes the tree configurations, and the direct and inverse indices update accordingly. We keep a dynamic leaf node index list. The features in the already explored keyframes can be assigned to the online-split leaf nodes. Considering that the leaf node splitting just handles a limited number of instances, the leaf-splitting-based forest refinement is efficient enough for the online adaptation to new scenes.
3 Experimental Results
We perform experiments on the mobile device to evaluate the proposed method. We use Samsung Galaxy S7 with Snapdragon 820 processor 1.6 GHz and 4 GB RAM. The stereo gray images are captured by uSens Fingo camera as shown in Fig. 1(a, b). The proposed method establishes the feature correspondences in both 3D mapping and tracking processes by the feature forest. The proposed system works real-time and achieves up to 60 FPS without the common GPU and PC support.
Given the feature correspondence, the 3D maps and continuous camera locations are obtained as shown in Figs. 1 and 4. We test one virtual scene with a colored balloon and several white blocks. With the hand-held mobile phone, we can freely explore the virtual environments as shown in the supplemental video. We illustrate the feature matching between keyframes in Fig. 5. The proposed method is robust to obtain the ORB feature matching regardless of the viewpoint and illumination variations.
We report the precision and recall rates of the proposed feature forest (FF) and the incremental feature forest (IFF) with online refinement on public SLAM datasets, including New College [16], Bicocca25b [3], Ford2 [11], and Malaga6L [2] as listed in Table 1. The proposed IFF method achieves an improvement over the comparable bag of word (BoW) [7] and the FF methods.
We also report the precision and recall of the proposed FF and the IFF methods of different types indoor scenes, including the table/chair, the plant, and the poster as shown in Table 2. We observe that the posters with abundant textures have higher precision and recall rates than other types of objects. The IFF approach with online refinement produces an improvement over the original feature forest. We believe the reason is that the adaptation to the new scene enables the accurate affinity estimation and feature matching.
4 Conclusion
This paper presents a random-forest-based fast feature matching technique for the mobile device mounted SLAM. The proposed method takes advantage of the offline feature forest together with the online incremental forest adaptation for the feature affinity and matching confidences. The matching confident scores reduce the candidate searching space and facilitate the real-time SLAM for VR and AR applications on mobile devices.
References
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32
Blanco, J.L., Moreno, F.A., Gonzalez, J.: A collection of outdoor robotic datasets with centimeter-accuracy ground truth. Auton. Robots 27(4), 327 (2009)
Bonarini, A., Burgard, W., Fontana, G., Matteucci, M., Sorrenti, D.G., Tardos, J.D.: Rawseeds: robotics advancement through web-publishing of sensorial and elaborated extensive data sets. In: Proceedings of IROS, vol. 6 (2006)
Calonder, M., Lepetit, V., Strecha, C., Fua, P.: BRIEF: binary robust independent elementary features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 778–792. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_56
Criminisi, A., Shotton, J., Konukoglu, E.: Decision forests for classification, regression, density estimation, manifold learning and semi-supervised learning. Microsoft Research Cambridge, Technical report MSRTR-2011-114 5(6), 12 (2011)
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part II. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Gálvez-López, D., Tardos, J.D.: Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 28(5), 1188–1197 (2012)
Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 225–234. IEEE (2007)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)
Pandey, G., McBride, J.R., Eustice, R.M.: Ford campus vision and lidar data set. Int. J. Robot. Res. 30(13), 1543–1552 (2011)
Pei, Y., Kim, T.K., Zha, H.: Unsupervised random forest manifold alignment for lipreading. In: IEEE International Conference on Computer Vision, pp. 129–136 (2013)
Ristin, M., Guillaumin, M., Gall, J., Van Gool, L.: Incremental learning of random forests for large-scale image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38(3), 490–503 (2016)
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: IEEE International Conference on Computer Vision, pp. 2564–2571. IEEE (2011)
Shridhar, M., Neo, K.Y.: Monocular slam for real-time applications on mobile platforms (2015)
Smith, M., Baldwin, I., Churchill, W., Paul, R., Newman, P.: The new college vision and laser data set. Int. J. Robot. Res. 28(5), 595–599 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, Y., Pei, Y. (2018). Incremental Feature Forest for Real-Time SLAM on Mobile Devices. In: Lai, JH., et al. Pattern Recognition and Computer Vision. PRCV 2018. Lecture Notes in Computer Science(), vol 11256. Springer, Cham. https://doi.org/10.1007/978-3-030-03398-9_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-03398-9_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03397-2
Online ISBN: 978-3-030-03398-9
eBook Packages: Computer ScienceComputer Science (R0)