Human intelligence is beyond pattern recognition. From a single image, we’re able to explain what we see, reconstruct them in 3D, predict what’s going to happen, and plan our actions. In this talk, I will present our recent work on physical scene understanding—reverse-engineering these capacities to make machines that are versatile, data-efficient, and have better generalization ability. The core idea is to exploit the scene’s compositional structure by integrating deep recognition networks with generative, approximate simulation engines. I’ll focus on a few topics: building an object representation for both its geometry and physics; learning compact, interpretable dynamics models for planning and control; perception and reasoning beyond vision.