Objectives: develop tools and techniques for leveraging user-generated multimedia as training resource for automatic semantic labeling.
The most dominant element in the video retrieval paradigm based on semantic labeling is the availability of a large vocabulary of robust detectors. Scaling up the number of detectors will only be possible if the fundamental problem in automatic indexing based on supervised machine learning is resolved: the lack of a large and diverse set of manually labeled visual examples to model the diversity in object and scene appearance adequately. A new direction in tackling this fundamental problem is employing user tagged visual data provided by online services such as YouTube and Flickr. These annotations are less accurate than the current practice in semantic video retrieval, but the amount of training samples is several orders of magnitude larger.
Intuitively, if different persons label visually similar images and videos using the same tags, these tags are likely to reflect objective aspects of the visual content. We will study how this intuition can be exploited to obtain relevant labels for visual content. To that end several data mining strategies will be explored, covering textual, visual, social, lexical, and multimodal approaches. All phases of the research will be evaluated in the TRECVID benchmark.