The ability to predict how an environment changes based on forces applied to it is fundamental for a robot to achieve specific goals. Traditionally in robotics, this problem is addressed through the use of pre-specified models or physics simulators, taking advantage of prior knowledge of the problem structure. While these models are general and have broad applicability, they depend on accurate estimation of model parameters such as object shape, mass, friction etc. On the other hand, learning based methods such as Predictive State Representations or more recent deep learning approaches have looked at learning these models directly from raw perceptual information in a model-free manner. These methods operate on raw data without any intermediate parameter estimation, but lack the structure and generality of model-based techniques.
In this talk, I will present some work that tries to bridge the gap between these two paradigms by proposing a specific class of deep visual models (SE3-Nets) that explicitly encode strong physical and 3D geometric priors (specifically, rigid body physics) in their structure. As opposed to deep models that reason about motion a pixel level, we show that the physical priors implicit in our network architectures enable them to reason about dynamics at the object level – our network learns to identify objects in the scene and to predict rigid body rotation and translation per object. We apply our model to the task of visuomotor control of a Baxter manipulator based on raw RGBD data and show that our method can achieve real-time robust control without any external supervision. Finally, I will present some preliminary results on extending our approach to handle more dynamic tasks with long term planning.