Recognizing objects and understanding scenes in 3D is critical for intelligent systems that interact with the 3D world. For instance, in robot manipulation and navigation, robots need to understand the 3D world in order to perform certain tasks. In augmented reality, 3D scene understanding is necessary in offering realistic user experience. In this talk, I will present our efforts in 3D object recognition and scene understanding from RGB-D videos. A new convolutional neural network for end-to-end 6D object pose estimation, i.e., 3D rotation and 3D translation, will be described. The network is able to handle both textured objects and texture-less objects, and is very robust to occlusions between objects. For scene understanding, I will show that the convolutional neural network can be extended to a recurrent neural network for semantic mapping on RGB-D videos, where the network interacts with KinectFusion to jointly reconstruct a 3D scene and recognize semantics in the scene.