Learning Compositional World Models for Robot Imagination

1HKUST, 2MIT, 3UCSD, 4UCF, 5UMass Amherst, 6MIT-IBM Watson AI Lab

ICML 2024


Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this issue, we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos. We illustrate how this factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components. We further show how such a factorization enables us to add additional multimodal goals, allowing us to specify a video we wish to generate given both natural language instructions and a goal image. Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.

Interpolate start reference image.

Compositional World Models. Given language instructions and multimodal instructions such as goal images and sketches, our approach factorizes the generation into a composition of diffusion models conditioned on inferred components. This enables our approach to generalize to both new combinations of language and multimodal input.

Interpolate start reference image.

Overall framework of RoboDreamer. On the left, We leverage the natural compositionally of language to parse instructions into components like action phrases and relation phrases. On the right, we show how RoboDreamer composes multiple components.

Qualitative Results

This section contains video generation under Seen tasks and Unseen tasks; multimodal generation; RLBench Results; generation under Partial Description; Other Datasets.

Seen Task Generation

move green can near water bottle

move apple near orange can

move 7up can near paper bowl

move green chip bag near apple

move rxbar chocolate near coke can

move green chip bag near coke can

move blue plastic bottle near redbull can

move green can near orange

move sponge near blue plastic bottle

move apple near green chip bag

move orange near blue rxbar blueberry

move green can near rxbar blueberry

Unseen Task Generation

move pepsi can near water bottle

move sponge near orange

move rxbar chocolate near pepsi can

move sponge near orange can

move pepsi can near blue plastic bottle

move blue chip bag near coke can

move apple near sponge

move rxbar chocolate near pepsi can

move pepsi can near apple

move apple near blue chip bag

move 7up can near green can

move sponge near apple

MultiModal Generation (Goal Image)

pick pepsi can from bottom drawer

open top drawer

pick apple from white bowl

move rxbar bluecherry near blue plastic bottle

pick banana from white bowl

place orange into middle drawer

MultiModal Generation (Goal Sketch)

pick apple from top drawer

open middle drawer

place blue plastic bottle upright

place pepsi can upright

pick apple

place 7up can into middle drawer


Left shows the synthetic video and Right shows the results.

stack blocks

lift block

take shoes out of box

take shoes out of box

lamp off

close box

lamp off

stack blocks

lift block

close box

take shoes out of box

stack blocks

Partial Description

move pepsi can

move blue chip bag

move rxbar chocolate

move sponge

move rxbar chocolate

move pepsi can

Other Datasets

put the ranch bottle into the pot

pick up red cup and put on blue napkin

place steak meat on the table

close microwave

place the burger meat in the oven

place the pan over the yellow cloth

pick up orange fruit

pick up the blue cup and put it into the brown cup

sweep the green cloth to the left side of the table

move purple cloth to the left of the table

open top drawer

close bottom drawer