Current Issue Cover
多媒体工程:2016——图像检索研究进展与发展趋势

于俊清1, 吴泽斌1, 吴飞2, 孙立峰3(1.华中科技大学计算机科学与技术学院, 武汉 430074;2.浙江大学计算机学院, 杭州 310058;3.清华大学计算机科学与技术系, 北京 100084)

摘 要
目的 基于内容的图像检索方法利用从图像提取的特征进行检索,以较小的时空开销尽可能准确的找到与查询图片相似的图片。方法 本文从浅层特征、深层特征和特征融合3个方面对图像检索国内外研究进展和面临的挑战进行介绍,并对未来的发展趋势进行展望。结果 尺度下不变特征转换(SIFT)存在缺乏空间几何信息和颜色信息,高层语义的表达不够等问题;而CNN (convolutional neural network)特征则往往缺乏足够的底层信息。为了丰富描述符的信息,通常将SIFT与CNN等特征进行融合。融合方式主要包括:串连、核融合、图融合、索引层次融合和得分层(score-level)融合。"融合"可以有效地利用不同特征的互补性,提高检索的准确率。结论 与SIFT相比,CNN特征的通用性及几何不变性都不够强,依然是图像检索领域面临的挑战。
关键词
Multimedia technology 2016:advances and trends in image retrieval

Yu Junqing1, Wu Zebin1, Wu Fei2, Sun Lifeng3(1.Computer Department of HuaZhong University of Science and Technology,Wuhan 430074,China;2.Computer Department of Zhejiang University,Hangzhou 310058,China;3.Computer Department of Tsinghua University,Beijing 100084,China)

Abstract
Objective Content-based image retrieval uses features extracted from an image to retrieve similar images accurately and with low memory and time consumption from a large-scale dataset.Scale-invariant feature transform (SIFT) is robust to translation,scaling,rotation,viewpoint changing,and occlusion,as well as performs fast extraction.Thus,SIFT is widely used theoretically and practically.However,SIFT has some shortcomings,such as a lack of spatial geometric information and color information.Convolutional neural network (CNN) has good domain transferability,and deep features from pre-trained CNN can be applied to various domains.CNN deep features have recently attracted considerable attention and exhibit superior performance over SIFT.However,contrary to the shortcoming of SIFT,CNN features lack shallow information.Thus,SIFT is usually fused with CNN features and other shallow features.Method This report reviews the recent advances and challenges in image retrieval in the world and in China,including shallow feature,deep feature,and feature fusion.Future development trends are also explored.For shallow features,we mainly review SIFT and its variants,the encoding methods,and the development of these methods.For deep features,we divide the descriptors of the features into different categories according to the type of CNN layer that was used:fully connected layer,convolutional layer,and softmax layer.Many features can be extracted from a convolutional layer,and many pooling methods are proposed.Result The encoding methods of SIFT mainly include bag of features (BOF),vector of locally aggregated vectors (VLAD),Fisher vector (FV),and triangulation embedding (TE),and they mostly consist of two steps:embedding and aggregation (or pooling).For CNN features,features from the fully connected layer of CNN are typically used because of their good transferability and accuracy.However,deep features from the convolutional layer have become an increasingly attractive option recently because the convolutional features can be effectively combined with a variety of pooling methods such as sum-pooling,max-pooling,VLAD-pooling,and FV-pooling,and they perform well in the domains of image classification and retrieval.The fusion methods can mainly be divided into five types:concatenation,kernel fusion,graph fusion,index-level fusion,and score-level fusion.Concatenation,kernel fusion,and index-level fusion work directly on different features,and graph fusion and score-level fusion work on the retrieval results of different features.Fusion uses complementary different features and can improve image retrieval accuracy effectively.Conclusion SIFT and CNN feature are complementary to each other:SIFT contains rich low-level information,and CNN features contain rich high semantic information; SIFT has a good property of invariance,which is the shortcoming of CNN features.Fusion is an effective way to maximize image information.However,time and space consumption will inevitably increase,and a good algorithm that can be used to distinguish good features from bad ones is yet to be studied.At present,the generalizability and geometric invariance of CNN features are inferior to those of SIFT; this issue continues to be a challenge for image retrieval researchers.The generalizability of CNN features is limited by the domain and statistic difference between the source task (usually ImageNet) and the target task.Fine tuning is a good strategy to solve this problem; however,this approach needs an additional labeled dataset similar to the target task.To enhance the geometric invariance of CNN,the CNN descriptor space consumption and extraction time will inevitably increase,and only scale invariance is usually considered for simplicity,ignoring other aspects of invariance.Moreover,the number of CNN features from one image is usually much smaller than that of SIFT; thus,insufficient information for encoding will be captured.The most commonly used CNNs are designed for image classification tasks and not for image retrieval.However,image retrieval is a more fine-grained domain; a relevant algorithm needs to find similar images,not just the images from one class.Thus,a CNN trained for image retrieval may be a good future research direction.More work is still needed to strike a better balance among generalizability,invariance,memory consumption,and extraction time for an effective and efficient image retrieval descriptor.
Keywords

订阅号|日报