Skip to main content

Advancing Multimodal Models Beyond Human Supervision

Seminar

-
Location: EER 3.646
Speaker:
XuDong Wang

To advance AI toward true artificial general intelligence, it is crucial to incorporate a wider range of sensory inputs, including physical interaction, spatial navigation, and social dynamics. However, achieving the successes of self-supervised Large Language Models (LLMs) across other modalities in our physical and digital environments remains a significant challenge. In this talk, I will discuss how self-supervised learning methods can be harnessed to advance multimodal models beyond the need for human supervision. Firstly, I will highlight a series of research efforts on self-supervised visual scene understanding that leverage the capabilities of self-supervised models to "segment anything" without the need for 1.1 billion labeled segmentation masks, unlike the popular supervised approach, the Segment Anything Model (SAM). Secondly, I will demonstrate how generative and understanding models can work together synergistically, allowing them to complement and enhance each other. Lastly, I will explore the increasingly important techniques for learning from unlabeled or imperfect data within the context of data-centric representation learning. All these research topics are unified by the same core idea: advancing multimodal models beyond human supervision.

Biography

XuDong Wang is a final-year Ph.D. student in the Berkeley AI Research (BAIR) lab at UC Berkeley, advised by Prof. Trevor Darrell, and a Research Scientist on the Llama Research team at GenAI, Meta. He was previously a researcher at Google DeepMind (GDM) and the International Computer Science Institute (ICSI), and a research intern at Meta’s Fundamental AI Research (FAIR) labs and Generative AI (GenAI) Research team. His research focuses on self-supervised learning, multimodal models, and machine learning, with an emphasis on developing foundational AI systems that go beyond the constraints of human supervision. By advancing self-supervised learning techniques for multimodal models—minimizing reliance on human-annotated data—he aims to build intelligent systems capable of understanding and interacting with their environment in ways that mirror, and potentially surpass, the complexity, adaptability, and richness of human intelligence. He is a recipient of the William Oldham Fellowship at UC Berkeley, awarded for outstanding graduate research in EECS.