*Three mini-talks by GRASP Ph.D. Students on Deep Learning. See presenters below organized by alphabetical order.
Deep Learning for Human Behavior Understanding from First-Person Cameras
Humans make many complex personal and social behavioral decisions while interacting with their surroundings: where to look and move next? Which objects to pick up? With whom to interact? A first-person camera placed at the person’s head captures such decisions, giving us the ability to indirectly tap into a person’s mind and see the world from his/her perspective.
In this work, we study the holistic correlation of a person’s visual attention and his actions from the first-person perspective. To do this, we first introduce a deep fully convolutional EgoNet model that holistically integrates visual appearance and 3D spatial cues to detect objects that are important to the person. Afterwards, we introduce a Visual-Spatial Network, and a Cross-Model EgoSupervision technique that allow us to design deep networks that can learn to detect a person’s interactions with objects and other people from first-person data in an unsupervised fashion. Finally, we show that we can build upon these methods and develop exciting applications such as assessing a person’s skill level in complex activities such as basketball.
Gedas Bertasius is a fourth year PhD student in the CIS Department at University of Pennsylvania. His research focuses on applying deep learning and graphical model methods for various computer vision problems such as edge detection, semantic segmentation, first-person object detection and human behavior modeling from first-person videos. He is currently working with Jianbo Shi and is supported by NSF IGERT Fellowship. His research has been published in ICCV, CVPR, RSS, and AISTATS conferences, and his recent work on first-person vision was featured in New Scientist and Impact magazines.
Polar transformer networks
Convolutional neural networks (CNNs) feature maps are equivariant with respect to translation; a translation in the input causes a translation in the output. Attempts to generalize equivariance have concentrated on rotations. In this talk, we present the Polar Transformer Networks, which combine the idea of the spatial transformer, and the canonical coordinate representations of groups (polar transform) to realize a network that is invariant to translation, and equivariant to rotation and scale. A conventional CNN is used to predict the origin of a polar transform. The polar transform is performed in a differentiable way, and the resulting polar representation is fed into a second CNN. The model is trained end-to-end with a classification loss. We apply the method on variations of MNIST, obtained by perturbing it with clutter, translation, rotation, and scaling. We achieve state of the art performance on the rotated MNIST, with fewer parameters and faster training time than previous methods, and we outperform all tested methods on the SIM2MNIST dataset, which we introduce.
Carlos Esteves is a Ph.D. student at the GRASP Laboratory, University of Pennsylvania, under the supervision of Dr. Kostas Daniilidis. He received the B.S. in computer engineering (2007) and M.S. in electrical and computer engineering (2010) degrees from the Aeronautics Institute of Technology in Brazil (ITA). Before starting his Ph.D. studies, he was a researcher at the Brazilian Aeronautics and Space Institute (IAE), and worked on defense projects for 3.5 years at Denel Dynamics, in South Africa.
Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose
Since the emergence of deep learning, we have witnessed impressive advances for many 2D image understanding tasks, including object detection, segmentation and edge detection. However, neural networks have been less successful for tasks related to 3D understanding, which are ubiquitous in robotic applications. In this talk, I will discuss how a convolutional neural network can be used in an end-to-end fashion for the problem of articulated 3D pose estimation from a single color image. More specifically, I will introduce a general, model-free scheme based on coarse-to-fine volumetric prediction that allows for direct and accurate estimation of 3D pose. The proposed network is trained exclusively on images with 3D pose groundtruth and outperforms all previous model-based and hybrid approaches.
Georgios Pavlakos is a Ph.D. student of Computer and Information Science at the University of Pennsylvania working under the supervision of Prof. Kostas Daniilidis. He received the B.S. degree in Electrical and Computer Engineering from the National Technical University of Athens, in 2014. His research interests lie at the intersection of computer vision, machine learning and robotics and include reconstruction and pose estimation of objects and humans from single images.