In this talk, I will discuss the task of learning to infer 3D structure without explicit supervision and present two recent attempts in this direction. I will first describe a differentiable ray consistency formulation which enables learning single-view 3D prediction models using indirect multi-view supervision. We will show that this formulation allows leveraging varying kinds of observations (foreground labels, depth or semantics) as supervisory signal and examine its application in diverse scenarios. I will then present a method that learns to assemble shapes using volumetric primitives and show that this yields interpretable and coherent abstractions in an unsupervised manner. I will demonstrate that these representations can be leveraged for applications like shape parsing, manipulation, retrieval etc.