MoE-DP: An MoE-Enhanced Diffusion Policy for Robust Long-Horizon Robotic Manipulation with Skill Decomposition and Failure Recovery

Abstract

MoE-DP Framework

Diffusion policies have emerged as a powerful framework for robotic visuomotor control, yet they often lack the robustness to recover from subtask failures in long-horizon, multi-stage tasks and their learned representations of observations are often difficult to interpret. In this work, we propose the Mixture of Experts-Enhanced Diffusion Policy (MoE-DP), where the core idea is to insert a Mixture of Experts (MoE) layer between the visual encoder and the diffusion model. This layer decomposes the policy's knowledge into a set of specialized experts, which are dynamically activated to handle different phases of a task. We demonstrate through extensive experiments that MoE-DP exhibits a strong capability to recover from disturbances, significantly outperforming standard baselines in robustness. On a suite of 6 long-horizon simulation tasks, this leads to a 36% average relative improvement in success rate under disturbed conditions. This enhanced robustness is further validated in the real world, where MoE-DP also shows significant performance gains. We further show that MoE-DP learns an interpretable skill decomposition, where distinct experts correspond to semantic task primitives (e.g., approaching, grasping). This learned structure can be leveraged for inference-time control, allowing for the rearrangement of subtasks without any re-training.

Realworld tasks under disturbed condition

DP (Baseline) — Duck place drawer close

MoE-DP (Ours) — Duck place drawer close

DP (Baseline) — Duck place bowl transport

MoE-DP (Ours) — Duck place bowl transport

DP (Baseline) — Pick two cubes

MoE-DP (Ours) — Pick two cubes

Simulation tasks under disturbed condition

DP (Baseline) — Coffee
Preparation

MoE-DP (Ours) — Coffee
Preparation

DP (Baseline) — Kitchen
Cleanup

MoE-DP (Ours) — Kitchen
Cleanup

DP (Baseline) — Hammer
cleanup

MoE-DP (Ours) — Hammer
cleanup

DP (Baseline) — Table
cleanup

MoE-DP (Ours) — Table
cleanup

DP (Baseline) — Kitchen

MoE-DP (Ours) — Kitchen

DP (Baseline) — Mug
cleanup

MoE-DP (Ours) — Mug
cleanup

MoE-DP can rearrange the order of subtasks under VLM guidance

Norm (normal order) — Duck drawer

VLM-guided (rearranged order) — Duck drawer

Norm (normal order) — Duck strawberry

VLM-guided (rearranged order) — Duck strawberry

Norm (normal order) — Pick two cubes

VLM-guided (rearranged order) — Pick two cubes

Method

Overview of MoE-DP with high-level guidance

In its autonomous mode, the system encodes observation inputs (images and robot state) into a feature vector, which is then fed to an MoE layer. The MoE’s router automatically selects the appropriate expert for the current observation. The output of the selected expert then serves as a conditioning input for the Diffusion Policy during action generation. While the router typically operates autonomously, the architecture supports high-level control: an external agent, such as a human operator or a Vision-Language Model (VLM), can guide the policy by overriding the router’s default selection. This capability enables flexible behaviors, such as reordering subtasks to generalize to novel sequences not seen during training..

Overview of the VLM-based planning and control framework

Our system leverages a VLM for high-level task planning in two stages. First, at the skill summarization (①) stage, the VLM builds a textual knowledge base of the robot’s capabilities by analyzing annotated frames from a demonstration that follows the same execution sequence as the training data. Second, at the task execution stage (②), the VLM uses this knowledge, a high-level goal, and a real-time image to reason about the current task stage and predict the appropriate expert to activate. This hierarchical architecture enables the system to dynamically plan and rearrange the order of subtasks without any re-training, translating abstract goals into concrete robotic actions.