Introducing DNAct: Diffusion Guided Multi-Task 3D Policy Learning.
We combine neural rendering pre-training and diffusion models to learn a generalizable policy with a strong 3D semantic scene understanding.
DNAct leverages NeRF as a 3D pre-training approach. By distilling 2D features from foundation models into a 3D space, we pre-train a 3D encoder to learn a unified representation of semantics and geometry via volumetric rendering.
Our 3D pre-training approach brings out-of-domain generalization ability! We show this by using out-of-distribution data from five unseen tasks in the pre-training stage, denoted as DNAct*. It outperforms baselines with over 20% improvement, which utilizes in-domain data.
Another insight is formulating representation learning as an action reconstruction problem with a diffusion model. We optimize the learned representation by adding a diffusion objective. This helps distinguish different modes in demonstration data and improves the robustness.