What does it mean to understand an image or video? One common answer in computer vision has been that understanding means naming things: this part of the image corresponds to a refrigerator and that to a person, for instance. While important, this ability is not enough: humans can effortlessly reason about the rich world that images depict and what they can do in it. For example, if a friend shows you the way to their kitchen for you to get something, they won’t worry that you’ll get lost walking back or that you’d have trouble figuring out how to open their refrigerator or cabinets. While both are an ordinary feat for humans (or even a dog or cat), they are currently far beyond the abilities of computers.
In my talk, I’ll discuss my efforts towards bridging this gap. In the first part, I’ll discuss the task of navigation or getting from one place to another. In particular, our goal is to take a single demonstration of a path and retrace it, either forwards or backwards, under noisy actuation and a changing environment. Rather than build an explicit model of the world, we learn a network that attends to a sequence of memories in order to make decisions. In the second part, I’ll discuss learning about boring everyday interactions using web videos from YouTube. In particular, I’ll discuss our new work on learning about everyday tasks like making pancakes with minimal human supervision. Instead of labeling things densely, we can take advantage of the natural supervision inherent in a collection of instructional videos showing multiple tasks.