Huawei Unveils Pangu Ultra: A 135 Billion Parameter Dense Language Model

Pangu Ultra: A New Milestone in the Field of Large Language Models

The development of large language models (LLMs) is progressing rapidly. With Pangu Ultra, Huawei presents a new model trained on Ascend Neural Processing Units (NPUs) with 135 billion parameters and a dense transformer architecture. This advancement is remarkable not only due to the sheer size of the model but also because of the associated technical challenges regarding optimization and system stability.

Challenges and Solutions in Training

Training such a large language model presents developers with immense challenges. Instabilities during the training process, which manifest, for example, in sudden increases in loss, can significantly impair the model's performance. To address these problems, the developers of Pangu Ultra introduced a so-called "depth-scaled sandwich normalization." This technique contributes significantly to stabilizing the training process and enables efficient training of models of this magnitude.

The Training Process of Pangu Ultra

Pangu Ultra was trained with 13.2 trillion tokens. This enormous amount of data, consisting of diverse and high-quality texts, forms the basis for the model's impressive capabilities. To make the training efficient, 8,192 Ascend NPUs were used. In addition to "depth-scaled sandwich normalization," further system optimizations were implemented to optimally utilize the computing power.

After pre-training, Pangu Ultra underwent a further training step, the so-called post-training. This serves to specifically improve the model's logical reasoning abilities. This additional training phase further enhanced performance in various areas.

Performance Comparison and Results

Pangu Ultra was evaluated using various benchmarks and compared with other leading LLMs. The results show that Pangu Ultra significantly surpasses the performance of dense LLMs like Llama 405B and Mistral Large 2. It is also noteworthy that Pangu Ultra can compete with DeepSeek-R1, a significantly larger, sparsely populated model. These results underscore the efficiency of Pangu Ultra's dense model design and the performance of the Ascend NPUs.

Availability and Outlook

Pangu Ultra and the associated system architecture are available to Huawei's commercial customers. This powerful LLM opens up new possibilities for various application areas, from text generation and processing to complex tasks requiring logical reasoning. The development of Pangu Ultra demonstrates the potential of Ascend NPUs for training extremely large language models and underscores the importance of innovative optimization strategies for the further development of AI.

Bibliographie: https://arxiv.org/abs/2504.07866 https://paperreading.club/page?id=298669 https://twitter.com/gm8xx8/status/1910548309980119105 https://arxiv.org/list/cs.CL/recent https://x.com/gm8xx8?lang=de https://papers.cool/arxiv/cs.CL https://chatpaper.com/chatpaper/zh-CN?id=3&date=1744300800&page=1 https://www-file.huawei.com/-/media/corp2020/pdf/publications/huawei-research/2024/huawei-research-issue6-en.pdf https://paperreading.club/category?cate=Language_Model