A long-standing goal in robotics is to build robots that can perform a wide range of daily tasks from perceptions obtained with their onboard sensors and specified only via natural language. While recently substantial advances have been achieved in language-driven robotics by leveraging end-to-end learning from pixels, there is no clear and well-understood process for making various design choices due to the underlying variation in setups. In this paper, we conduct an extensive study of the most critical challenges in learning language conditioned policies from offline free-form imitation datasets. We further identify architectural and algorithmic techniques that improve performance, such as a hierarchical decomposition of the robot control learning, a multimodal transformer encoder, discrete latent plans and a self-supervised contrastive loss that aligns video and language representations.

By combining the results of our investigation with our improved model components, we are able to present a novel approach that significantly outperforms the state of the art on the challenging language conditioned long-horizon robot manipulation CALVIN benchmark.

Technical Approach

We build upon relabeled imitation learning to distill many reusable behaviors from undirected offline imitation datasets into a goal-directed policy. Our method consists of only standard supervised learning subroutines, and learns perception, language understanding, and task-agnostic control end-to-end as a single neural network. We systematically compare key components of language conditioned robotic imitation learning, such as observation and action spaces, auxiliary losses for aligning visuo-lingual representations, language models and latent plan representations, and we analyze the effect of other choices, such as data augmentation and optimization. We propose four improvements to these key components: a multimodal transformer encoder to learn to recognize and organize behaviors during robotic interaction into a global categorical latent plan, a hierarchical division of the robot control learning that learns local policies in the gripper camera frame conditioned on the global plan, balancing terms within the KL loss and a self-supervised contrastive visual-language alignment loss. We integrate the best performing improved components in a unified framework, Hierarchical Universal Language Conditioned Policies (HULC).

Network architecture
Figure: Overview of the architecture to learn language conditioned policies. First the language instructions and the visual observations are encoded. During training a multimodal transformer encodes sequences of observations to learn to recognize and organize high-level behaviors through a posterior. Its temporally contextualized features are provided as input to a contrastive visuo-lingual alignment loss. The plan sampler network receives the initial state and the latentlanguage goal and predicts the distribution over plans for achieving the goal. Both prior and posterior distributions are predicted as a vector of multiplecategorical variables and are trained by minimizing their KL divergence. The local policy network receives the latent language instruction, the gripper cameraobservation and the global latent plan to generate a sequence of relative actions in the gripper camera frame to achieve the goal.

Qualitative Results

Language Conditioned Long-Horizon Rollouts of HULC trained and tested on environment D


Coming soon.


A software implementation of this project based on PyTorch including trained models can be found in our GitHub repository for academic usage and is released under the MIT license..


What Matters in Language Conditioned Robotic Imitation Learning
Oier Mees, Lukas Hermann, Wolfram Burgard