HybriMoE: Optimizing Mixture-of-Experts Inference with Hybrid CPU-GPU Resource Management

Efficient MoE Inference: HybriMoE Optimally Utilizes CPU and GPU

Mixture-of-Experts (MoE) models have shown great promise as they increase model capacity without a proportional increase in computational cost. However, the enormous size of these models places high demands on available memory, often requiring offloading of experts on resource-constrained platforms and incurring significant performance penalties. Hybrid CPU-GPU inference has been proposed to leverage CPU computing power and thus reduce the burden of loading experts. However, this approach faces major challenges: Firstly, the activation patterns of experts in MoE models are very unstable, making the fixed allocation strategies used in previous work inefficient. Secondly, hybrid CPU-GPU scheduling for MoE is complex due to varying expert sizes, structures, and uneven workload distribution.

To address these challenges, HybriMoE has been developed, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces three key innovations: a dynamic intra-layer scheduling strategy to balance workloads between CPU and GPU, an impact-oriented inter-layer prefetching algorithm, and a score-based caching algorithm to mitigate the instability of expert activation.

Dynamic Resource Allocation and Predictive Caching

HybriMoE's dynamic intra-layer scheduling allows for flexible distribution of the computational load between CPU and GPU, based on the current demands of the model. Instead of statically assigning experts to a specific hardware, HybriMoE analyzes the expected computation time and memory requirements of each expert and dynamically decides whether execution on the CPU or GPU is more efficient. This enables optimal utilization of both hardware components and minimizes latency.

The impact-oriented inter-layer prefetching algorithm anticipates the experts needed for future computations and proactively loads them into the cache. By preemptively providing the data, latency is reduced and inference speed is accelerated. The score-based caching algorithm ensures that the most frequently needed experts are kept in the cache, while less important experts are removed. This intelligent cache management helps to compensate for the instability of expert activation and optimize the model's performance.

Significant Performance Gains in Practice

HybriMoE was implemented based on the kTransformers framework and evaluated using three widely used MoE-based Large Language Models (LLMs). The results show an average speedup of 1.33x in the prefill phase and 1.70x in the decode phase compared to current hybrid MoE inference frameworks. These performance gains highlight the potential of HybriMoE to significantly improve the efficiency of MoE models and enable their deployment on resource-constrained platforms.

HybriMoE represents a significant advancement in the inference of MoE models. By combining dynamic resource scheduling and intelligent cache management, it effectively addresses the challenges of hybrid CPU-GPU inference and significantly increases performance. This opens up new possibilities for the use of MoE models in various application areas.

Sources: - http://www.arxiv.org/abs/2504.05897 - https://mengli.me/publication/dac-2025-hybrimoe/ - https://mengli.me/news/dac-2025-accepted/ - https://arxiv.org/list/cs.DC/recent - https://openreview.net/forum?id=N5fVv6PZGz - https://getianao.github.io/papers/raise22ccgrid.pdf - https://www.researchgate.net/publication/362119407_RAISE_Efficient_GPU_Resource_Management_via_Hybrid_Scheduling