Automatic Image Annotation via Deep Learning

Overview

This project works on a basic research problem in multimedia content analysis: automatic image annotation, which labels the semantic content of images with a set of keywords. The research on image annotation has great scientific merit because it directly addresses an ultimate problem in multimedia society that enables the computer to understand and represent the visual information like human. Moreover, image annotation has great commercial potentials. If the images were well annotated, many well-developed techniques for retrieving textual data can be transferred to the image domain, which make the commercial multimedia search engine possible.

Despite the extensive study of automatic image annotation for more than fifteen years, the overall research results are far from satisfactory and practical due to the following two difficulties: 1) how to extract visual features from images to represent the semantic meaning of the textual keywords, which is a classical problem in multimedia content analysis named semantic gap; 2) how to distinguish thousands of keywords in the generated visual feature space, essentially a multi-label learning task involving in a large number of labels, a well-known challenge in machine learning. Most machine learning techniques and multimedia content analysis system fail in real-world image annotation task even if they have demonstrated impressive results under lab environment.

However, image annotation is not a hard problem for human, even children. Therefore, this project seeks a new way to promote the research progress of image annotation by referencing human’s visual system and perception procedure. According to this consideration, deep model, which has the similar laminar structure with human’s cerebral cortex and the similar information delivery flow in visual areas, is chosen in this project. So far, deep model has never been used in image annotation task. So in this project, we will work on two subtasks: 1) design a novel deep architecture to address the semantic gap between visual features and textual keywords; 2) design new learning techniques to adapt real-world image annotation task.

Award

  • Best Paper Award in ACM International Conference on Multimedia Computing and Service 2010 for the paper entitled "Fuzzy-Based Contextual Cueing for Region-level Annotation".
  • Qualcomm Award in ACM International Conference of Multimedia 2011 for the paper entitled "Bilinear Deep Learning for Image Classification".

Publication

  • Sheng-hua Zhong, Yan Liu, Yang Liu. Robust Image Classification with Human Visual Cortex-Like Mechanisms. Under review in IEEE Transactions on Multimedia (TMM). [PDF]
  • Sheng-hua Zhong, Yan Liu, Yang Liu, Changsheng Li. Water Reflection Recognition Based on Motion Blur Invariant Moments in Curvelet Space. Accepted by IEEE Transactions on Image Processing (TIP). [PDF]
  • Sheng-hua Zhong, Yan Liu, Feifei Ren, Jinghuan Zhang, Tongwei Ren. Video Saliency Detection via Dynamic Consistent Spatio-Temporal Attention Modelling. Accept In Proceedings of 27th Conference on Artificial Intelligence (AAAI'13), 2013. [PDF]
  • Jonathan I. Flombaum, Sheng-hua Zhong, Zheng Ma, Colin Wilson, Yan Liu. What Is the Marginal Advantage of Extrapolation during Multiple Object Tracking? Insights from a Kalman Filter Model. In Proceeding of the 13th annual meeting of Vision Sciences Society (VSS'13), 2013.
  • Sheng-hua Zhong, Yan Liu, Gangshan Wu. S-SIFT: A Shorter SIFT without Least Discriminative Visual Orientation. In Proceeding of the 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI'12), 2012. [PDF]
  • Sheng-hua Zhong, Yan Liu, Yao Zhang, Fu-lai Chung. Attention Modeling for Face Recognition via Deep Learning. In Proceeding of the 34th annual meeting of the Cognitive Science Society (CogSci'12), 2012. [PDF]
  • Sheng-hua Zhong, Yan Liu, Fu-lai Chung, Gangshan Wu. Semiconducting bilinear deep learning for incomplete image recognition. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR'12), 2012. [PDF]
  • Yan Liu, Sheng-hua Zhong ,Wenjie Li. Query-oriented Multi-document Summarization via Unsupervised Deep Learning. In Proceedings of 26th Conference on Artificial Intelligence (AAAI'12), 2012. [PDF]
  • Sheng-hua Zhong, Yan Liu, Yang Liu. Bilinear Deep Learning for Image Classification. In Proceedings of ACM International Conference on Multimedia (SIG MM'11), 2011. [PDF]
  • Sheng-hua Zhong, Yan Liu, Yang Liu, Fu-lai Chung. Region Level Annotation by Fuzzy Based Contextual Cueing Label Propagation. In Multimedia Tools and Applications. (Invited Paper) [PDF]
  • Sheng-hua Zhong, Yan Liu, Ling Shao, Gangshan Wu. Unsupervised Saliency Detection Based on 2D Gabor and Curvelets Transforms. In Proceedings of ACM International Conference on Internet Multimedia Computing and Service (ACM ICIMCS'11), 2011. [PDF]
  • Sheng-hua Zhong, Yan Liu, Ling Shao, Fu-lai Chung. Water Reflection Recognition via Minimizing Reflection Cost Based on Motion Blur Invariant Moments. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR'11), 2011. [PDF]
  • Sheng-hua Zhong, Yan Liu, Yang Liu, Fu-lai Chung. Fuzzy-Based Contextual Cueing for Region-level Annotation. In Proceedings of ACM International Conference on Internet Multimedia Computing and Service (ACM ICIMCS'10), 2010. (Best Paper Award) [PDF]
  • Sheng-hua Zhong, Yan Liu, Yang Liu, and Fu-lai Chung. A Semantic No-Reference Image Sharpness Metric Based on Top-Down and Bottom-Up Saliency Map Modeling. In Proceedings of IEEE International Conference on Image Processing (ICIP'10), 2010. [PDF]

Presentation

  • Query-oriented Multi-document Summarization via Unsupervised Deep Learning. In Proceedings of 26th Conference on Artificial Intelligence (AAAI'12), 2012. [PPT][Presentation MP3]
  • Bilinear Deep Learning for Image Classification. [PPT]
  • Water Reflection Recognition via Minimizing Reflection Cost Based on Motion Blur Invariant Moments. [PPT]
  • Fuzzy-Based Contextual Cueing for Region-level Annotation. [PPT]
  • A Semantic No-Reference Image Sharpness Metric Based on Top-Down and Bottom-Up Saliency Map Modeling. [PPT]

Demonstration

Dataset