AI Heap
Published on

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

arXiv:2504.02263 - [arXiv,PDF]
Authors
  • Name
    Ruidong Zhu
  • Name
    Ziheng Jiang
  • Name
    Chao Jin
  • Name
    Peng Wu
  • Name
    Cesar A. Stuardo
  • Name
    Dongyang Wang
  • Name
    Xinlei Zhang
  • Name
    Huaping Zhou
  • Name
    Haoran Wei
  • Name
    Yang Cheng
  • Name
    Jianzhe Xiao
  • Name
    Xinyi Zhang
  • Name
    Lingjun Liu
  • Name
    Haibin Lin
  • Name
    Li-Wen Chang
  • Name
    Jianxi Ye
  • Name
    Xiao Yu
  • Name
    Xuanzhe Liu
  • Name
    Xin Jin
  • Name
    Xin Liu
  • Affiliation
    University of Science and Technology
  • Affiliation
    Tsinghua University
  • Affiliation
    Peking University
  • Affiliation
    Fudan University
  • Affiliation
    University of Chile
  • Affiliation
    Zhejiang University
  • Affiliation
    Shanghai Jiao Tong University
  • Affiliation
    Nanjing University
  • Affiliation
    Harbin Institute of Technology
  • Affiliation
    Sun Yat-sen University
  • Affiliation
    Beijing Normal University
  • Affiliation
    Xi'an Jiaotong University
  • Affiliation
    Southeast University
  • Affiliation
    Huazhong University of Science and Technology
  • Affiliation
    National Taiwan University
  • Affiliation
    University of Electronic Science and Technology of China
  • Affiliation
    Nankai University
  • Affiliation
    Tianjin University
  • Affiliation
    South China University of Technology
  • Affiliation
    Wuhan University
Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE’s sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.