This was a hybrid event with in-person attendance in Wu and Chen and virtual attendance…
Now that a significant fraction of human knowledge has been shared through the Internet, scraped and squashed into the weights of Large Language Models (LLMs), do we still need embodiment and interaction with the physical world to build representations? Is there a dichotomy between LLMs and “large world models”? What is the role of visual perception in learning such models? Can perceptual agents trained by passive observation learn world models suitable for control?
To begin tackling these questions, I will first address the issue of controllability of LLMs. LLMs are stochastic dynamical systems, for which the notion of controllability is well established: The state (“of mind”) of an LLM can be trivially steered by a suitable choice of input given enough time and memory. However, the space of interest for control of an LLM is not that of words, but that of “meanings” expressible as sentences that a human could have spoken and would understand. Unfortunately, unlike controllability, the notions of meaning and understanding are not usually formalized in a way that is relatable to LLMs in use today.
I will propose a simplistic definition of meaning that reflects the functional characteristics of a trained LLM. I will show that a well-trained LLM establishes a topology in the space of meanings, represented by equivalence classes of trajectories of underlying dynamical model (LLM). Then, I will describe both necessary and sufficient conditions for controllability in such a space of meanings.
I will then highlight the relation between meanings induced by a trained LLM upon the set of sentences that could be uttered, and “physical scenes” underlying sets of images that could be observed. In particular, a physical scene can be defined uniquely and inferred as an abstract concept without the need for embodiment, a view aligned with J. Koenderink’s characterization of images as “controlled hallucinations.”
Lastly, I will show that popular models ostensibly used to represent the 3D scene (Neural Radiance Fields, or NeRFs) can at most represent the images on which they are trained, but not the underlying physical scene. However, composing a NeRF with a Latent Diffusion Model or other inductively-trained generative model yields a viable representation of the physical scene. Such a model class, which can be learned through passive observations, is a first albeit rudimentary Foundational Model of physical scenes in the sense of being sufficient for any downstream inference task based on visual data.