ABSTRACT
Computer vision has seen major success in learning to recognize objects from massive “disembodied” Web photo collections labeled by human annotators. Yet cognitive science tells us that perception develops in the context of acting and moving in the world—and without intensive supervision. Meanwhile, many realistic vision tasks require not only categorizing a well-composed human-taken photo, but also intelligently deciding where to look in the first place. In the context of these challenges, we are exploring ways to learn visual representations from unlabeled video accompanied by multi-modal sensory data like egomotion and sound. Moving from passively captured video to agents that control their own first-person cameras, we investigate how agents can learn to move intelligently to acquire visual observations. We present reinforcement learning approaches for active and exploratory look-around behavior, which show promising results for transferring policies to novel perception tasks and unseen environments.