Abstract: Advances in camera miniaturization and mobile computing have enabled the
development of wearable camera systems which can capture both the
user’s view of the scene (the egocentric, or first-person, view) and
their gaze behavior. In contrast
to the established third-person video paradigm, the egocentric paradigm
makes it possible to easily collect examples of naturally-occurring
human behavior, such as activities of daily living, from a consistent
vantage point. Moreover, there exist a variety
of egocentric cues which can be extracted from these videos and used
for weakly-supervised learning of objects and activities. We focus on
activities requiring hand-eye coordination and model the spatio-temporal
relationship between the gaze point, the scene
objects, and the action label. We demonstrate that gaze measurement can
provide a powerful cue for recognition. In addition, we present an
inference method that can predict gaze locations and use the predicted
gaze to infer action labels. We demonstrate improvements
in action recognition rates and gaze prediction accuracy relative to
state-of-the-art methods, on a new dataset containing egocentric videos
of daily activities and gaze. We will also describe some applications in
psychology, where we are developing methods
for automating the measurement of children’s behavior, as part of a
large effort targeting autism and other behavioral disorders. This is
joint work with Alireza Fathi, Yin Li, and Agata Rozga.