- This event has passed.
Fall 2025 GRASP on Robotics: Alan Yuille, Johns Hopkins University, “3D Vision Language Models and Interactive World Models”
October 24, 2025 @ 10:30 am - 11:45 am
This event was in-person ONLY in Wu and Chen Auditorium…
ABSTRACT
Vision Language Models (VLMs) are extremely successful, but their performance degrades when asked questions involving spatial relations and 3D world knowledge. Inspired by Cognitive Science, we develop 3D VLMs which are 3D-aware and 3D-explicit to help us to diagnose their failure nodes. We present two approaches which involve developing datasets with 3D annotations for training the 3D VLMs. The first works was developed on realistic-synthetic datasets and the 3D VLM is built on a 3D Image Parser. This 3D VLMs significantly outperform conventional VLMs for questions involving 3D/6D (Xingrui Wang et al. CVPR 2025 highlight) and physical reasoning (Xingrui Wang et al., ICLR 2025). This work is extended to complex images taking VLMs as base models and evaluated on a 3D comprehensive reasoning benchmark (W. Ma et al. ICCV 2026). We develop a 3D-VLM which significantly outperforms conventional VLMs when asked questions requiring 3D knowledge (Wufei Ma et al. CVPR 2025 highlight). We further extend this approach to develop a 3D-VLM which performs even better and is also 3D-explicit (Wufei Ma et al. NeurIPS. 2025). We discuss the bigger picture which involves the need for world models as illustrated by (J. Chen et al. ICLR 2025), analysis by synthesis (T. Zheng et al. NeurIPS 2025), and early detection of cancer using radiology reports (P. Bassi et al. MICCAI 2025).