The past few years have witnessed remarkable advancements in 2D image understanding driven by deep learning. But when it comes to the scenarios involving interactions between human, robot and the world, we need to understand not only 2D visual information but also 3D geometry of the scene from images. This talk focuses on our recent progresses on recovering 3D object geometry, such as 3D structure and pose of rigid and articulated objects, from monocular imagery by levering data-driven representations learned from both 2D and 3D data. In particular, I will first introduce an approach to 3D human pose reconstruction based on the sparse representation of 3D human poses and CNN-based 2D pose predictions. I will discuss how to jointly optimize structural and viewpoint parameters with convex programming and how to account for the uncertainties in 2D pose predictions with an EM algorithm. Next, I will show that, using semantic keypoints and CAD models, it is feasible to estimate the 6-DoF pose of a rigid object in a cluttered image with high precision allowing a robot to grasp the object. Finally, I will introduce how to build object-category models from images of different instances with a multi-image matching algorithm that optimizes the cycle consistency of feature correspondences.