Union-of-Experts: A Novel Approach to Mixture-of-Experts for Efficient AI Models

Expert Association: Hierarchical Routing for Equivalently Decomposed Transformers

The Mixture-of-Experts (MoE) concept has proven to be a promising approach for increasing the performance of AI models while maintaining computational efficiency. MoE models allow training specialized "experts" for different parts of the input data, thereby increasing model capacity without linearly increasing computational costs. A new research article now introduces an innovative approach: Union-of-Experts (UoE). This approach aims to address the weaknesses of existing MoE models and improve the interaction between experts.

A major problem with conventional MoE models is that experts operate in isolation. The interaction and knowledge exchange between the experts are limited. Furthermore, MoE architectures have not been effectively extended to attention blocks, making further efficiency gains difficult. UoE takes a different approach by decomposing the transformer into an equivalent group of experts and implementing dynamic routing on both the input data and the experts.

Core Innovations of Union-of-Experts

UoE is characterized by three key innovations: First, the equivalent expert decomposition is performed for both MLP blocks and attention blocks, based on matrix partitioning in tensor parallelism. Second, two routing paradigms are introduced: patch-wise data selection and expert selection, to apply routing at different levels. Third, the architecture of the UoE model includes Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). Fourth, the parallel implementation of UoE's routing and computation operations has been developed and optimized for efficiency based on hardware processing analysis.

The SMHA component shares some similarities with NSA (DeepSeek) and MoBA (Moonshot.AI), but was developed independently over a year. UoME, on the other hand, represents a novel architecture that not only adopts the multi-expert and selective routing paradigms of existing MoE models but also allows the activated experts to function as a cohesive whole, similar to an MLP of the same scale.

Performance and Potential

Initial experiments show promising results. The UoE model outperforms both full attention and state-of-the-art MoEs and efficient transformers, including the recently introduced DeepSeek-V3 architecture, in various image and language processing tasks. Applying the concepts of equivalent decomposition and routing to a full transformer model offers significant advantages. The equivalent decomposition allows for more efficient use of model capacity, and dynamic routing ensures that the most relevant experts are activated for the given input.

The development of UoE represents an important step in the advancement of MoE models. By improving the interaction between experts and extending to attention blocks, UoE opens up new possibilities for scaling and increasing the efficiency of AI models. Further research and development in this area are promising and could lead to further breakthroughs in AI research. In particular, the combination of equivalent decomposition, dynamic routing, and integration into both MLP and attention blocks sets UoE apart from existing approaches.

Bibliography: - https://www.arxiv.org/abs/2503.02495 - http://paperreading.club/page?id=289074