Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks.
Previous work trivially fitting the data without revealing the relation between instruction and low-level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes.
To address this issue, we propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints.
Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE), which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency.
PIVOT-R is a primitive-driven waypoint-aware world model with asynchronous hierarchical executors. It only focuses on the prediction of waypoints related to the manipulation task, and it is easier to predict key nodes in the manipulation task than other methods. In addition, PIVOT-R sets different execution frequencies for different modules to have higher execution efficiency and lower redundancy.
It mainly consists of a waypoint-aware world model (WAWM) and an action prediction module, where two modules cooperate with each other through an asynchronous hierarchical executor (AHE). In WAWM, we first use pre-trained VLM to perform low-frequency primitive action parsing on user instructions and provide waypoint indications for the scene prediction module. Then, the scene prediction module learns to model the world knowledge based on waypoints and manipulation trajectories. Finally, we use a lightweight action prediction module to perform high-frequency action prediction and execution.
We conducted real-world experiments, where we set up three tasks: (i) "Pick up": pick up the correct object from the table. (ii) "Put on": Pick up the object and place it on the correct color block. (iii) "Push to": Push the object to the correct color block. We collected 400, 200, and 200 sets of demonstrations respectively. PIVOT-R achieved a 6% improvement over the best baseline.
Push coffee to pink block.
Pick up the juice in the front row.
Pick up starbucks and put it on yellow block.
We choose SeaWave, an open-source benchmark to learn multi-level instruction tasks,
as our experimental platform, and use the corresponding data as demonstration data for imitation learning.
Its greatest advantage is that it provides progressive tasks, facilitating our comprehensive comparison and analysis of the model's capabilities.
It supports 8 skills, including daily operations such as grasping and placing objects, opening and closing doors, and more than 3,000 different instructions.
The SeaWave dataset covers four different levels of language instructions.
We train on this dataset and test on a specially divided test set.
Results are shown in the below.
Pick up Bernachon.
Give me the yogurt.
I'm thirsty, can you use a cup to pour the red-packaged drink for me?
The drink in the lower right corner looks good. Can I have it?
We also perform experiments in different unseen scenarios on level 2, 3, and 4 tasks.
New scenarios include unseen backgrounds (i.e., two unseen tables), changing light intensity, and more distractions (i.e., more objects).
The results are shown in the below.
Changed Background
Changed Lights
More Distractors
We propose PIVOT-R, a primitive-driven waypoint-aware world model. PIVOT-R focuses on the execution of primitive actions. Predicting key waypoints in the future greatly improves performance. It has achieved state-of-the-art results on the SeaWave benchmark, and experiments have proven that it has good robustness. We also use asynchronous hierarchical executors to ensure fast enough execution of the model. In addition, we demonstrate that PIVOT-R has the potential to complete unseen instructions and tasks under the guidance of a high-level VLM. Finally, we also demonstrate PIVOT-R's ability to improve performance through human demonstration. These results illustrate the potential of PIVOT-R.
@misc{zhang2024pivotrprimitivedrivenwaypointawareworld,
title={PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation},
author={Kaidong Zhang and Pengzhen Ren and Bingqian Lin and Junfan Lin and Shikui Ma and Hang Xu and Xiaodan Liang},
year={2024},
eprint={2410.10394},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2410.10394},
}