Diffusion policies have emerged as a powerful framework for robotic visuomotor control, yet they often lack the robustness to recover from subtask failures in long-horizon, multi-stage tasks and their learned representations of observations are often difficult to interpret. In this work, we propose the Mixture of Experts-Enhanced Diffusion Policy (MoE-DP), where the core idea is to insert a Mixture of Experts (MoE) layer between the visual encoder and the diffusion model. This layer decomposes the policy's knowledge into a set of specialized experts, which are dynamically activated to handle different phases of a task. We demonstrate through extensive experiments that MoE-DP exhibits a strong capability to recover from disturbances, significantly outperforming standard baselines in robustness. On a suite of 6 long-horizon simulation tasks, this leads to a 36% average relative improvement in success rate under disturbed conditions. This enhanced robustness is further validated in the real world, where MoE-DP also shows significant performance gains. We further show that MoE-DP learns an interpretable skill decomposition, where distinct experts correspond to semantic task primitives (e.g., approaching, grasping). This learned structure can be leveraged for inference-time control, allowing for the rearrangement of subtasks without any re-training.
In its autonomous mode, the system encodes observation inputs (images and robot state) into a feature vector, which is then fed to an MoE layer. The MoE’s router automatically selects the appropriate expert for the current observation. The output of the selected expert then serves as a conditioning input for the Diffusion Policy during action generation. While the router typically operates autonomously, the architecture supports high-level control: an external agent, such as a human operator or a Vision-Language Model (VLM), can guide the policy by overriding the router’s default selection. This capability enables flexible behaviors, such as reordering subtasks to generalize to novel sequences not seen during training..
Our system leverages a VLM for high-level task planning in two stages. First, at the skill summarization (①) stage, the VLM builds a textual knowledge base of the robot’s capabilities by analyzing annotated frames from a demonstration that follows the same execution sequence as the training data. Second, at the task execution stage (②), the VLM uses this knowledge, a high-level goal, and a real-time image to reason about the current task stage and predict the appropriate expert to activate. This hierarchical architecture enables the system to dynamically plan and rearrange the order of subtasks without any re-training, translating abstract goals into concrete robotic actions.