True interpretation of an image is one where the latent factors that make up the image such as 3D geometry, reflectance, lighting, camera, as well as the semantic correspondence of objects in the image are known. Such decompositions have many practical applications in graphics/entertainment such as motion capture and AR/VR. Besides, the ability to disentangle the factors of variation is fundamental to general intelligence and self-supervised learning from observations. However, even when the structure of the factors are known (i.e. image rendering equation), this is an extremely challenging problem since inverse problems are fundamentally underconstrained. What makes it even more challenging is that in most cases, ground truth values of these latent factors are not attainable at a large-scale for images in-the-wild.
In this talk I discuss how we might address these challenges for two problems: end-to-end single-view 3D mesh recovery of human bodies and decomposition of unconstrained face images into shape, reflectance and illuminance. Key insights are in setting up a structured auto-encoder that reflects the underlying model of the world and in learning data-driven priors.