Over the recent years, deep learning has emerged as a powerful method for learning feature representations from complex input data, and it has been greatly successful in computer vision, speech recognition, and language modeling. The recent successes typically rely on a large amount of supervision (e.g., class labels). While many deep learning algorithms focus on a discriminative task and extract only task-relevant features that are invariant to other factors, complex sensory data is often generated from intricate interaction between underlying factors of variations (for example, pose, morphology and viewpoints for 3d object images). In this work, we tackle the problem of learning deep representations that disentangle underlying factors of variation and allow for complex reasoning and inference that involve multiple factors. Specifically, we develop deep generative models with higher-order interactions among groups of hidden units, where each group learns to encode a distinct factor of variation. We present several successful instances of deep architectures and their learning methods, including supervised and weakly-supervised setting. Our models achieve strong performance in emotion recognition, face verification, data-driven modeling of 3d objects, and video game prediction. I will also present other related ongoing work.