Learning to Reach Goals from Suboptimal Demonstrations via World Models

Loading...
Thumbnail Image

Advisor

Wong, Alexander
Shafiee, Javad

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

A central challenge for training autonomous agents is the scarcity of high-quality and long-horizon demonstrations. Unlike fields such as natural language or computer vision—where abundant internet data exists—many robotics and decision-making domains lack large, diverse, and high-quality datasets. One underutilized resource is leveraging suboptimal demonstrations, which are easier to collect and potentially more abundant. This limitation is particularly pronounced in goal-conditioned reinforcement learning (GCRL), where agents must learn to reach diverse goal states from limited demonstrations. While methods such as contrastive reinforcement learning (CRL) show promising scaling behavior when given access to abundant and high-quality training demonstrations, they struggle when demonstrations are suboptimal. In particular, when training demonstrations are short or exploratory, CRL struggles to generalize beyond the training demonstrations, and the resulting policy exhibits lower success rates. To overcome this, we explore the use of self-supervised representation learning to extract general-purpose representations from demonstrations. The intuition is that if an agent can first learn robust representations of environment dynamics—without relying on demonstration optimality—it can then use these representations to guide reinforcement learning more effectively. Such representations can serve as a bridge between noisy demonstrations and goal-directed control, allowing policies to learn faster. In this thesis, we propose World Model Contrastive Reinforcement Learning (WM-CRL), which augments CRL with representations from a world model (WM). The world model is trained to anticipate future state embeddings from past state–action pairs, thereby encoding the dynamics of the environment. As the world model aims to only learn environment dynamics, it can leverage both high and low quality demonstrations. By integrating these world model embeddings into CRL’s framework, it can help CRL more easily comprehend the environment dynamics and select actions that more effectively achieve its goals. We evaluate WM-CRL on tasks from the OGBench benchmark. We explore performance on multiple locomotion and manipulation environments and multiple datasets varying in quality. Our results show that WM-CRL can substantially improve performance over CRL in suboptimal-data settings, such as stitching short trajectories or learning from exploratory behavior. However, we find our method offers limited benefit when abundant expert demonstrations are available. Ablation studies further reveal that success depends critically on the stability of world model training and on how its embeddings are integrated into the agent’s architecture.

Description

LC Subject Headings

Citation

Collections