This event will be in-person ONLY in Wu and Chen Auditorium.
Autonomous agents need a world model that explains observations, predicts what comes next, and chooses actions over long horizons. Think of catching a ball: the robot must infer where it is now and where it will be next—even when it slips out of view—and move to intercept. Recently, large diffusion-based video models trained on internet-scale data have shown promising results for world modeling; however, they remain brittle—forecasting errors accumulate over time, especially during long open-loop rollouts without geometric grounding or collective feedback. In this talk, we present our recent research toward a more robust video generation foundation. Instead of diffusion, we build on scalable normalizing flows–a different family of generative models based on invertible transformations. We will detail the mathematical formulation, explain how these models can be trained end to end, and describe how we construct a practical video model from this framework. We will conclude by outlining research directions derived from this approach and steps toward a truly robust world model.