The ultimate goal of computer vision are models that understand our (visual) world. Explainable AI extends this further, seeking models whose decisions can in addition also be understood by a human user. Such image and video understanding is a difficult inverse problem. It requires learning a metric in image space that reflects object relations in real world. To avoid the need for tedious annotations, we follow a self-supervised strategy to metric and representation learning. We present a divide-and-conquer approach to representation learning that exploits transitivity to discover reliable relationships for training. In addition to that, the talk will present a widely applicable strategy based on deep reinforcement learning to improve the surrogate tasks underlying self-supervision.
Thereafter, we will discuss the learning of explainable models by disentangling representations into diverse object characteristics. This yields a generative model for image and video synthesis, controlled visual retargeting, and unsupervised learning of semantic parts and registration. Time permitting, we can also cover a variety of applications of this research ranging from behavior analysis in neuroscience to visual analytics in the digital humanities.