What is the function of past data that we can store in memory so that, come future data, we can best process it to solve an inference task? I will first formalize a set of desirable properties a representation should have, and derive a variational principle that is related to the Information Bottleneck of Tishby, Bialek and Perira. Unfortunately, the corresponding (IB) Lagrangian, cannot be computed, let alone optimized. I will then show that there exists a different IB Lagrangian, that relates to the model parameters, that is in principle unrelated to the first one, and is instead related to the empirical loss used when training deep networks, which can be computed and easily optimized. I will then show that the latter bounds the former, so by optimizing a function of past (training) data, we can guarantee desirable properties of the representation of future (test) data such as sufficiency, minimality, invariance, and disentanglement. That addresses the issue of how to compute an optimal representation for a given task. What if the task is not fully known ahead of time? It is common practice today to train a model on a task (say, finding cats and dogs in images), and then fine-tuning it for another (say, detecting tumors in a mammogram). Sometimes it works. Sometimes it does not. Worse, it is impossible to predict whether it will. I will introduce a new framework to compute the (asymmetric) distance between tasks, and introduce the notion of Task Accessibility, that can predict whether fine-tuning can work, regardless of how “close” two tasks are. Indeed, there are tasks that are quite close, yet it is not possible to fine-tune from one to another. This universal phenomenon of task inaccessibility is observed in biological systems (critical learning periods) as well as in neural networks, and has nothing to do with biology. Instead, it has to do with the dynamics of learning, which we are only now beginning to uncover.