Widespread visual sensors and unprecedented
connectivity have left us awash with visual data—from online photo
collections, home videos, news footage, medical images, or surveillance
feeds. Which images and videos
among them warrant human attention?
I present two problem settings in which this question is critical:
supervised learning of object categories, and unsupervised video
In the first setting, the challenge is
to sift through candidate training images and select those that, if labeled by a
human, would be most informative to the recognition system. To address this challenge, we introduce
a novel large-scale active learning algorithm that efficiently indexes millions
of unlabeled instances according to their informativeness. We use it to deploy a “live
learning” system that actively requests crowd-sourced annotations on images
crawled from the Web, yielding state-of-the-art accuracy on the PASCAL object
detection benchmark with minimal human intervention.
In the second setting, the challenge is to sift
through a long-running video and select only the essential parts needed to
summarize it for a human viewer.
Unlike traditional keyframe selection techniques, we propose an
object-driven approach that predicts the impact each object has on generating
the “story” of the video. Using
novel importance cues indicative of importance in an egocentric camera’s view,
our approach turns hours of video into a compact storyboard summary that a human
can interpret in just seconds.
Both domains demonstrate the importance of isolating
the key visual data that deserves human attention, and suggest exciting new
applications for large-scale visual
This talk describes work with Yong Jae Lee, Sudheendra
Vijayanarasimhan, and Prateek Jain.