Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies
With data from only one scene, our humanoid robot automonously performs skills in the wild open world. All skills in videos are autonomous. Videos are 4x speed.
With data from only one scene, our humanoid robot automonously performs skills in the wild open world. All skills in videos are autonomous. Videos are 4x speed.
1Stanford University 2Simon Fraser University 3UPenn 4UIUC 5CMU
Humanoid robots capable of autonomous operation in diverse environments have long been a goal for roboticists. However, autonomous manipulation by humanoid robots has largely been restricted to specific scenarios, primarily due to the difficulty of acquiring generalizable skills. Recent advances in 3D visuomotor policies, such as the 3D Diffusion Policy (DP3), have shown promise in extending these capabilities to wilder environments. However, 3D visuomotor policies often rely on camera calibration and point cloud segmentation, which present challenges for deployment on mobile robots like humanoids. In this work, we introduce the Improved 3D Diffusion Policy (iDP3), a novel 3D visuomotor policy that eliminates these constraints by leveraging egocentric 3D visual representations. We demonstrate that iDP3 enables a full-sized humanoid robot to autonomously perform skills in diverse real-world scenarios, using only data collected in the lab.
Our system mainly consists of four parts: the humanoid robot platform, the data collection system, the learning method, and the real-world deployment. For the learning part, we develop the Improved 3D Diffusion Policy (iDP3) as a visuomotor policy for general-purpose robots. To collect data from humans, we leverage Apple Vision Pro to build a whole-upper-body teleoperation system.
We use the Apple Vision Pro (AVP) to teleoperate the robot's upper body, which provides precise tracking of the human hand, wrist, and head poses. The robot uses Relaxed IK to follow these poses accurately. We stream the robot's vision back to the AVP. Differing from other works, we enable the waist DoF and gain a more flexible workspace.
3D visuomotor policies are inherently dependent on precise camera calibration and fine-grained point cloud segmentation,
which limits their deployment on mobile platforms such as humanoid robots.
We introduce the Improved 3D Diffusion Policy (iDP3), a novel 3D visuomotor policy that eliminates these constraints and thus enables its usage on general-purpose robots.
iDP3 leverages egocentric 3D visual representations, which lie in the camera frame, instead of the world frame as in the 3D Diffusion Policy and other 3D policies.
Leveraging egocentric 3D visual
representations presents challenges in eliminating redundant
point clouds, such as backgrounds or tabletops, especially
without relying on foundation models. To mitigate this issue, we scale up the vision input to capture the entire scene.
Our egocentric 3D representations demonstrate impressive view invariance. As shown below, iDP3 consistently grasps objects even under large view changes, while Diffusion Policy (with finetuned R3M and data augmentation) struggles to grasp even the training objects. Diffusion Policy shows occasional success only with minor view changes. Notably, unlike other works to achieve view generalization, we did not incorporate specific designs for equivariance or invariance.
We evaluated several new objects. While Diffusion Policy, due to the use of Color Jitter augmentation, can occasionally handle some unseen objects, it does so with a very low success rate. In contrast, iDP3 naturally handles a wide range of objects, thanks to its use of 3D representations.
We find that iDP3 generalizes to a wide range of real-world unseen environments, with robust and smooth behavior, while Diffusion Policy presents very jittering behavior in the new scene, and even fails to grasp the training object.
We visualize our egocentric 3D representations, together with the corresponding images. These videos also highlight the complexity of diverse real-world scenes.
This work presents a real-world imitation learning system that enables a full-sized humanoid robot to autonomously perform practical manipulation tasks in diverse real-world environments, using training data collected solely in the lab. The key is the usage of the Improved 3D Diffusion Policy (iDP3), a new 3D visuomotor policy for general-purpose robots. Through extensive experiments, we demonstrate the impressive generalization capabilities of iDP3 in the real world.
(1) While our data collection system using Apple Vision Pro is easy to set up, it becomes physically tiring for humans after about 30 minutes, making large-scale data collection difficult.
(2) The RealSense L515 depth sensor produces noisy point clouds, showing the need for better perception techniques.
(3) Collecting fine-grained manipulation skills, like turning a screw, is time-consuming due to teleoperation limitations with Apple Vision Pro.
(4) We avoided using the robot's lower body, as maintaining balance is still a challenge.
Overall, scaling up high-quality data is the main bottleneck of our system.
@article{ze2024humanoid_manipulation,
title = {Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies},
author = {Yanjie Ze and Zixuan Chen and Wenhao Wang and Tianyi Chen and Xialin He and Ying Yuan and Xue Bin Peng and Jiajun Wu},
year = {2024},
journal = {arXiv preprint arXiv:2410.10803}
}