Rapid advances in 2D perception have led to systems that accurately detect objects in real-world images. However, these systems make predictions in 2D, ignoring the 3D structure of the world. Concurrently, advances in 3D shape prediction have mostly focused on synthetic benchmarks and isolated objects. In this talk, I present our efforts in unifying advances in these two areas. In particular, I will present our recent work in augmenting state-of-the-art 2D recognition systems with the ability to infer 3D shapes on real-world images in the wild. Then I will turn to embodied question answering where 3D shape cues are used both for semantic navigation and question answering, fused with 2D cues in an end-to-end manner.