Supervised learning paradigm for training Deep Convolutional Neural Networks (DCNN) rests on the availability of large amounts of manually annotated images, which are necessary for training deep models with millions of parameters. In this talk, I will present novel techniques for mitigating the required manual annotation, by generating large object instance datasets through compositing textured 3D models onto commonly encountered background scenes to synthesize training images. The generated training data augmented with real world annotations outperforms models trained only on real data. Non-textured 3D models are subsequently used for keypoint learning and matching, and 3D object pose estimation from RGB images. The proposed methods showcase promising results with regards to generalization on new and standard benchmark datasets. In the final part of the talk, I will discuss how these perception capabilities can be leveraged and encoded in a spatial map, in order to enable an agent to successfully navigate towards a target object.