research-article

Describing Images with Hierarchical Concepts and Object Class Localization

Authors:
Yahong Han

Tianjin University, Tianjin, China

Tianjin University, Tianjin, China
View Profile

,
Guang Li

Tianjin University, Tianjin, China

Tianjin University, Tianjin, China
View Profile

ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia RetrievalJune 2015Pages 251–258https://doi.org/10.1145/2671188.2749290

Published:22 June 2015Publication History

ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval

Pages 251–258

ABSTRACT

Current research into automatic generation of semantic descriptions centers mainly on improving the annotation accuracy for individual tag or attributes. In this paper, we focus on the generation of more informative descriptions for images. We proposes to generate layered, semantically meaningful descriptions and create summaries of key aspects of the data from the component detectors. In particular, the output descriptions include superclass, class, attributes, and the location of the area of the object which may interest users. We propose to integrate ROI (Region of Interest) identification and hierarchical semantic elements detection into a joint framework. The joint optimization of the ROI localizer and the hierarchical concept detection make them mutually beneficial and reciprocal. In this way, we create a discriminative image description generation framework based on a tightly coupled multi-layer optimization. The output descriptions contain richer information of the image content with layered contextual information, thereby enabling better management and usage of image data. Experiments on two public open benchmark datasets demonstrate that the proposed method obtains state of the art performance.

References

U. Brefeld, T. Gärtner, T. Scheffer, and S. Wrobel. Efficient co-regularised least squares regression. In NIPS, pages 137--144, 2006. Google ScholarDigital Library
X. Cai, F. Nie, H. Huang, and C. H. Ding. Multi-class ℓ_2,1-norm support vector machine. In IEEE International Conference on Data Mining, pages 91--100, 2011. Google ScholarDigital Library
K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research, 2:265--292, 2002. Google ScholarDigital Library
D. Ding, F. Metze, S. Rawat, P. F. Schulam, S. Burger, E. Younessian, L. Bao, M. G. Christel, and A. Hauptmann. Beyond audio and video retrieval: towards multimedia summarization. In Proceedings of the 2012 ACM International Conference on Multimedia Retrieval, 2012. Google ScholarDigital Library
A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, pages 1778--1785, 2009.Google ScholarCross Ref
A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. ECCV, pages 15--29, 2010. Google ScholarDigital Library
T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861--874, 2006. Google ScholarDigital Library
Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012. Google ScholarDigital Library
S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, pages 2712--2719. IEEE, 2013. Google ScholarDigital Library
A. Hauptmann, R. Yan, W.-H. Lin, M. Christel, and H. Wactlar. Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia, 9(5):958--966, 2007. Google ScholarDigital Library
G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, pages 1601--1608, 2011. Google ScholarDigital Library
X. Liu, Y. Mu, B. Lang, and S.-F. Chang. Compact hashing for mixed image-keyword query over multi-label images. In ACM Multimedia, page 18, 2012. Google ScholarDigital Library
Z. Ma, F. Nie, Y. Yang, J. R. Uijlings, and N. Sebe. Web image annotation via subspace-sparsity collaborated feature selection. IEEE Transactions on Multimedia, 14(4):1021--1030, 2012. Google ScholarDigital Library
Z. Ma, Y. Yang, Y. Cai, N. Sebe, and A. G. Hauptmann. Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In ACM Multimedia, pages 469--478, 2012. Google ScholarDigital Library
F. Nie, H. Huang, X. Cai, and C. H. Ding. Efficient and robust feature selection via joint ℓ_2,1-norms minimization. NIPS, 23:1813--1821, 2010.Google Scholar
D. Parikh and K. Grauman. Relative attributes. In ICCV, pages 503--510, 2011. Google ScholarDigital Library
C. Pollard and I. A. Sag. Head-driven phrase structure grammar. University of Chicago Press, 1994.Google Scholar
C. Wang, S. Yan, L. Zhang, and H.-J. Zhang. Multi-label sparse coding for automatic image annotation. In CVPR, pages 1643--1650, 2009.Google ScholarCross Ref
S.-I. Yu, Z. Xu, D. Ding, W. Sze, F. Vicente, Z. Lan, Y. Cai, S. Rawat, P. Schulam, S. Bahmani, et al. Informedia e-lamp@ trecvid 2012 multimedia event detection and recounting (med and mer). 2012.Google Scholar
D. Zhang, M. M. Islam, and G. Lu. A review on automatic image annotation techniques. Pattern Recognition, 45(1):346--362, 2012. Google ScholarDigital Library
D. Zhou, J. Huang, and B. Scholkopf. Learning with hypergraphs: Clustering, classification, and embedding. NIPS, 19:1601, 2007.Google Scholar

Index Terms

Describing Images with Hierarchical Concepts and Object Class Localization
1. Information systems
  1. Information retrieval
    1. Document representation
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Abstraction

Recommendations

Medical image describing behavior: a comparison between an expert and novice
iConference '11: Proceedings of the 2011 iConference

This preliminary study, as a part of a broader study about the medical image use by experts and novices, examines the differences in image describing behavior between these two categories of users. Eye tracking technique was used to capture the image ...
Read More
Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing
ICCV '13: Proceedings of the 2013 IEEE International Conference on Computer Vision

In recent years, there has been a great deal of progress in describing objects with attributes. Attributes have proven useful for object recognition, image search, face verification, image description, and zero-shot learning. Typically, attributes are ...
Read More
Rotation invariant HOG for object localization in web images

To localize objects in Web images using an invariant descriptor is crucial. The HOG (histogram of oriented gradients) descriptor is used to increase the accuracy of localization. It is a shape descriptor that considers frequencies of gradient ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval
June 2015
700 pages
ISBN:9781450332743
DOI:10.1145/2671188
General Chairs:
Alex Hauptmann
Carnegie Mellon University, USA
,
Chong-Wah Ngo
City University of Hong Kong, China
,
Xiangyang Xue
Fudan University, China
,
Program Chairs:
Yu-Gang Jiang
Fudan University, China
,
Cees Snoek
University of Amsterdam and Qualcomm Research Netherlands
,
Nuno Vasconcelos
University of California, San Diego, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
image descriptions
multi-layer semantic elements
roi identification
visual attributes
Qualifiers
- research-article
Conference

Acceptance Rates
ICMR '15 Paper Acceptance Rate48of127submissions,38%Overall Acceptance Rate254of830submissions,31%
More
Upcoming Conference
ICMR '24

Sponsor:

sigmm

International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket , Thailand
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 168
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Describing Images with Hierarchical Concepts and Object Class Localization

ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Medical image describing behavior: a comparison between an expert and novice

Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing

Rotation invariant HOG for object localization in web images