Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

With data from only one scene, our humanoid robot automonously performs skills in the wild open world. All skills in videos are autonomous. Videos are 4x speed.

Generalizable Humanoid Manipulation with
Improved 3D Diffusion Policies

1Stanford University  2Simon Fraser University   3UPenn   4UIUC   5CMU

 arXiv  PDF  Code (Learning)  Code (Teleop)  TL;DR  YouTube

Abstract

Humanoid robots capable of autonomous operation in diverse environments have long been a goal for roboticists. However, autonomous manipulation by humanoid robots has largely been restricted to specific scenarios, primarily due to the difficulty of acquiring generalizable skills. Recent advances in 3D visuomotor policies, such as the 3D Diffusion Policy (DP3), have shown promise in extending these capabilities to wilder environments. However, 3D visuomotor policies often rely on camera calibration and point cloud segmentation, which present challenges for deployment on mobile robots like humanoids. In this work, we introduce the Improved 3D Diffusion Policy (iDP3), a novel 3D visuomotor policy that eliminates these constraints by leveraging egocentric 3D visual representations. We demonstrate that iDP3 enables a full-sized humanoid robot to autonomously perform skills in diverse real-world scenarios, using only data collected in the lab.



Summary

Overview of Our System

Our system mainly consists of four parts: the humanoid robot platform, the data collection system, the learning method, and the real-world deployment. For the learning part, we develop the Improved 3D Diffusion Policy (iDP3) as a visuomotor policy for general-purpose robots. To collect data from humans, we leverage Apple Vision Pro to build a whole-upper-body teleoperation system.


Whole-Upper-Body Teleoperation

We use the Apple Vision Pro (AVP) to teleoperate the robot's upper body, which provides precise tracking of the human hand, wrist, and head poses. The robot uses Relaxed IK to follow these poses accurately. We stream the robot's vision back to the AVP. Differing from other works, we enable the waist DoF and gain a more flexible workspace.


Improved 3D Diffusion Policies (iDP3)

camera 3D visuomotor policies are inherently dependent on precise camera calibration and fine-grained point cloud segmentation, which limits their deployment on mobile platforms such as humanoid robots.
We introduce the Improved 3D Diffusion Policy (iDP3), a novel 3D visuomotor policy that eliminates these constraints and thus enables its usage on general-purpose robots. iDP3 leverages egocentric 3D visual representations, which lie in the camera frame, instead of the world frame as in the 3D Diffusion Policy and other 3D policies.
Leveraging egocentric 3D visual representations presents challenges in eliminating redundant point clouds, such as backgrounds or tabletops, especially without relying on foundation models. To mitigate this issue, we scale up the vision input to capture the entire scene.

Generalization Ability 1: View Generalization

Our egocentric 3D representations demonstrate impressive view invariance. As shown below, iDP3 consistently grasps objects even under large view changes, while Diffusion Policy (with finetuned R3M and data augmentation) struggles to grasp even the training objects. Diffusion Policy shows occasional success only with minor view changes. Notably, unlike other works to achieve view generalization, we did not incorporate specific designs for equivariance or invariance.

Diffusion Policy
+Finetuned R3M
+Augmentation
DP cannot grasp under large view changes.
iDP3
iDP3 is robust to large view changes.

Generalization Ability 2: Object Generalization

We evaluated several new objects. While Diffusion Policy, due to the use of Color Jitter augmentation, can occasionally handle some unseen objects, it does so with a very low success rate. In contrast, iDP3 naturally handles a wide range of objects, thanks to its use of 3D representations.

Diffusion Policy
+Finetuned R3M
+Augmentation
Training Scene
Training Object & New Object
Training Scene
New Object
iDP3
Training Scene
Training Object & New Object
Training Scene
New Object

Generalization Ability 3: Scene Generalization

We find that iDP3 generalizes to a wide range of real-world unseen environments, with robust and smooth behavior, while Diffusion Policy presents very jittering behavior in the new scene, and even fails to grasp the training object.

Diffusion Policy
+Finetuned R3M
+Augmentation
New Scene
Training Object & New Object
New Scene
New Object
iDP3
New Scene
Training Object & New Object
New Scene
New Object

Visualizations of Egocentric 3D Visual Representations

We visualize our egocentric 3D representations, together with the corresponding images. These videos also highlight the complexity of diverse real-world scenes.

Pick & Place
Pour
Wipe
Pour
Pour
Wipe
Pour
Wipe
Pick & Place
Pick & Place
Pick & Place
Pick & Place
Pour
Wipe
Wipe
Wipe

Conclusions

This work presents a real-world imitation learning system that enables a full-sized humanoid robot to autonomously perform practical manipulation tasks in diverse real-world environments, using training data collected solely in the lab. The key is the usage of the Improved 3D Diffusion Policy (iDP3), a new 3D visuomotor policy for general-purpose robots. Through extensive experiments, we demonstrate the impressive generalization capabilities of iDP3 in the real world.

Limitations

(1) While our data collection system using Apple Vision Pro is easy to set up, it becomes physically tiring for humans after about 30 minutes, making large-scale data collection difficult.
(2) The RealSense L515 depth sensor produces noisy point clouds, showing the need for better perception techniques.
(3) Collecting fine-grained manipulation skills, like turning a screw, is time-consuming due to teleoperation limitations with Apple Vision Pro.
(4) We avoided using the robot's lower body, as maintaining balance is still a challenge.
Overall, scaling up high-quality data is the main bottleneck of our system.

BibTeX

@article{ze2024humanoid_manipulation,
  title   = {Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies},
  author  = {Yanjie Ze and Zixuan Chen and Wenhao Wang and Tianyi Chen and Xialin He and Ying Yuan and Xue Bin Peng and Jiajun Wu},
  year    = {2024},
  journal = {arXiv preprint arXiv:2410.10803}
}