Generative models, particularly those based on diffusion, autoregressive, and transformer
architectures, have emerged as powerful tools for modeling complex, high-dimensional data
distributions. Building on these capabilities, generative models enable machines not only to interpret
sensory inputs but to synthesize realistic outputs across modalities - including images, human motion,
and 3D structure. In this talk, I present recent advances in generative modeling and discuss their
applications in robotics and embodied intelligence, with a focus on three representative domains:
robotic grasping, dance generation, and scene synthesis. I will discuss current limitations of generative
models in these tasks, , including challenges in real-time inference, physical consistency, and
generalization to unseen environments. I will also highlight ongoing and future work aimed at
addressing these issues - such as developing more efficient architectures, integrating multi-modal
inputs, and enabling closed-loop deployment in real-world systems. |