research-article

Descriptive visual words and visual phrases for image applications

Authors:
Shiliang Zhang

Key Lab of Intelligent Information Processing, Institute of Computing Technology, CAS, Beijing, China

Key Lab of Intelligent Information Processing, Institute of Computing Technology, CAS, Beijing, China
View Profile

,
Qi Tian

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Gang Hua

Microsoft Live Labs Research, Redmond, USA

Microsoft Live Labs Research, Redmond, USA
View Profile

,
Qingming Huang

Graduate University of Chinese Academy of Sciences, Beijing, China

Graduate University of Chinese Academy of Sciences, Beijing, China
View Profile

,
Shipeng Li

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

MM '09: Proceedings of the 17th ACM international conference on MultimediaOctober 2009Pages 75–84https://doi.org/10.1145/1631272.1631285

Published:19 October 2009Publication History

MM '09: Proceedings of the 17th ACM international conference on Multimedia

Pages 75–84

ABSTRACT

The Bag-of-visual Words (BoW) image representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are comparable to the words in texts. However, massive experiments show that the commonly used visual words are not as expressive as the text words, which is not desirable because it hinders their effectiveness in various applications. In this paper, Descriptive Visual Words (DVWs) and Descriptive Visual Phrases (DVPs) are proposed as the visual correspondences to text words and phrases, where visual phrases refer to the frequently co-occurring visual word pairs. Since images are the carriers of visual objects and scenes, novel descriptive visual element set can be composed by the visual words and their combinations which are effective in representing certain visual objects or scenes. Based on this idea, a general framework is proposed for generating DVWs and DVPs from classic visual words for various applications. In a large-scale image database containing 1506 object and scene categories, the visual words and visual word pairs descriptive to certain scenes or objects are identified as the DVWs and DVPs. Experiments show that the DVWs and DVPs are compact and descriptive, thus are more comparable with the text words than the classic visual words. We apply the identified DVWs and DVPs in several applications including image retrieval, image re-ranking, and object recognition. The DVW and DVP combination outperforms the classic visual words by 19.5% and 80% in image retrieval and object recognition tasks, respectively. The DVW and DVP based image re-ranking algorithm: DWPRank outperforms the state-of-the-art VisualRank by 12.4% in accuracy and about 11 times faster in efficiency.

References

S. Battiato, G. M. Farinella, G. Gallo, and D. Ravi. Spatial hierarchy of textons distribution for scene classification. Proc. Eurocom Multimedia Modeling, pp. 333--342, 2009. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. International World-Wide Web Conference, pp. 107--117, 1998. Google ScholarDigital Library
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierarchical image database. Proc. CVPR, pp. 710--719, 2009.Google ScholarCross Ref
C. Fellbaum. Wordnet: an electronic lexical database. Bradford Books, 1998.Google ScholarCross Ref
B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 16(315): 972--976, Jan. 2007.Google ScholarCross Ref
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. Proc. VLDB, pp. 518--529, 1999. Google ScholarDigital Library
W. H. Hsu, L. S. Kennedy, and S. F. Chang. Video search reranking through random walk over document-level context graph. Proc. ACM Multimedia, pp. 971--980, 2007. Google ScholarDigital Library
Y. Jing and S. Baluja. VisualRank: applying PageRank to large-scale image search. PAMI, 30(11): 1877--1890, Nov. 2008. Google ScholarDigital Library
F. Jurie and B. Triggs. Creating efficient codebooks for visual recognition. Proc. ICCV, pp. 17--21, 2005. Google ScholarDigital Library
S. Lazebnik and M. Raginsky. Supervised learning of quantizer codebook by information loss minimization. PAMI, 31(7): 1294--1309, July 2009. Google ScholarDigital Library
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. Proc. CVPR, pp. 2169--2178, 2006. Google ScholarDigital Library
D. Liu, G. Hua, P. Viola, and T. Chen. Integrated feature selection and higher-order spatial feature extraction for object categorization. Proc. CVPR, pp. 1--8, 2008.Google ScholarCross Ref
J. Liu, W. Lai, X. Hua, Y. Huang, and S. Li. Video search re-ranking via multi-graph propagation. ACM Multimedia, pp. 208--217, 2007. Google ScholarDigital Library
D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2): 91--110, Nov. 2004. Google ScholarDigital Library
M. Marszalek and C. Schmid. Spatial weighting for bag-of-features. Proc. CVPR, pp. 2118--2125, 2006. Google ScholarDigital Library
F. Moosmann, E. Nowak, and F. Jurie. Randomized clustering forests for image classification. PAMI, 30(9): 1632--1646, Sep. 2008. Google ScholarDigital Library
D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. Proc. CVPR, pp. 2161--2168, 2006. Google ScholarDigital Library
F. Perronnin and C. Dance. Fisher kernels on visual vocabulary for image categorization. Proc. CVPR, pp. 1--8, 2007.Google ScholarCross Ref
F. Perronnin. Universal and adapted vocabularies for generic visual categorization. PAMI, 30(7): 1243--1256, July 2008. Google ScholarDigital Library
S. Savarese, J. Winn, and A. Criminisi. Discriminative object class models of appearance and shape by correlatons. Proc. CVPR, pp. 2033--2040, 2006. Google ScholarDigital Library
Z. Si, H. Gong, Y. N. Wu, and S. C. Zhu. Learning mixed templates for object recognition. Proc. CVPR, 2009.Google Scholar
J. Sivic and A. Zisserman. Video Google: a text retrieval approach to object matching in videos. Proc. ICCV, pp. 1470--1477, 2003. Google ScholarDigital Library
X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, and X. Hua. Bayesian video search reranking. Proc. ACM Multimedia, pp. 131--140, 2008. Google ScholarDigital Library
A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30(11): 1958--1970, Nov. 2008. Google ScholarDigital Library
P. Viola and M. Jones. Robust real-time face detection. Proc. ICCV, pp. 7--14, 2001. Google ScholarDigital Library
C. Wang, D. Blei, and L. Fei-Fei. Simultaneous image classification and annotation. Proc. CVPR, 2009.Google Scholar
F. Wang, Y. G. Jiang, and C. W. Ngo. Video event detection using motion relativity and visual relatedness. Proc. ACM Multimedia, pp. 239--248, 2008. Google ScholarDigital Library
J. Winn, A. Criminisi, and T. Minka. Object categorization by learning universal visual dictionary. Proc. ICCV, pp. 17--21, 2005. Google ScholarDigital Library
Z. Wu, Q. F. Ke, and J. Sun. Bundling features for large-scale partial-duplicate web image search. Proc. CVPR, 2009.Google Scholar
D. Xu and S. F. Chang. Video event recognition using kernel methods with multilevel temporal alignment. PAMI, 30(11): 1985--1997, Nov. 2008. Google ScholarDigital Library
L. Yang, P. Meer, and D. J. Foran. Multiple class segmentation using a unified framework over mean-shift patches. Proc. CVPR, pp. 1--8, 2007.Google ScholarCross Ref
J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: from visual words to visual phrases. Proc. CVPR, pp.1--8, 2007.Google ScholarCross Ref
Y. T. Zheng, M. Zhao, S. Y. Neo, T. S. Chua, and Q. Tian. Visual synset: a higher-level visual representation. CVPR, pp. 1--8, 2008.Google Scholar
X. Zhou, X. D. Zhuang, S. C. Yan, S. F. Chang, M.H. Johnson, and T.S. Huang. SIFT-bag kernel for video event analysis. Proc. ACM Multimedia, pp. 229--238, 2008. Google ScholarDigital Library

Index Terms

Descriptive visual words and visual phrases for image applications
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Generating Descriptive Visual Words and Visual Phrases for Large-Scale Image Applications

Bag-of-visual Words (BoWs) representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are ...
Read More
Visual content representation using semantically similar visual words

Local feature analysis of visual content, namely using Scale Invariant Feature Transform (SIFT) descriptors, have been deployed in the 'bag-of-visual words' model (BVW) as an effective method to represent visual content information and to enhance its ...
Read More
Fractal dimension of bag-of-visual words

Scene recognition is an important and challenging problem in computer vision. One of the most used scene recognition methods is the bag-of-visual words. Despite the interesting results, this approach does not capture the detail richness of spatial ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '09: Proceedings of the 17th ACM international conference on Multimedia
October 2009
1202 pages
ISBN:9781605586083
DOI:10.1145/1631272
General Chairs:
Wen Gao
Peking University, China
,
Yong Rui
Microsoft, China
,
Alan Hanjalic
Delft University of Technology, The Netherlands
,
Program Chairs:
Changsheng Xu
Institute of Automation, Chinese Academy of Sciences, China
,
Eckehard Steinbach
Technical University of Munich, Germany
,
Abdulmotaleb El Saddik
University of Ottawa, Canada
,
Michelle Zhou
IBM T. J. Watson Research Center, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bag-of-visual words
image re-ranking
object recognition
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 192
  Total Citations
  View Citations
- 1,488
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Descriptive visual words and visual phrases for image applications

MM '09: Proceedings of the 17th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Generating Descriptive Visual Words and Visual Phrases for Large-Scale Image Applications

Visual content representation using semantically similar visual words

Fractal dimension of bag-of-visual words