SPAM Optimizer Improves Stability and Efficiency of Large Language Model Training

More Stable Training of Large Language Models with the SPAM Optimizer

Large language models (LLMs) have made impressive progress in recent years, enabling applications in a wide variety of fields. However, training these models remains resource-intensive and prone to instabilities. A major reason for this is so-called gradient and loss spikes, which disrupt the learning process and often lead to costly measures such as restoring checkpoints or restarting experiments.

Recent research investigates the causes and effects of these spikes and introduces a new optimizer called Spike-Aware Adam with Momentum Reset (SPAM). This is intended to improve the stability and efficiency of LLM training.

Gradient Spikes: A Widespread Problem

The study analyzes gradient spikes during LLM training and shows that they occur in various architectures and datasets. The spikes can be up to 1000 times larger than typical gradients and significantly impair model performance. This leads to instability in the training process and slows down progress.

SPAM: A New Approach to Optimization

To solve this problem, SPAM was developed. The optimizer relies on two main mechanisms: momentum reset and spike-aware gradient clipping. Resetting the momentum mitigates the effects of the gradient spikes. Spike-aware gradient clipping limits the size of the gradients to prevent extreme outliers.

Increased Efficiency Through Sparse Momentum

Another advantage of SPAM is the ability to use sparse momentum. This involves storing and updating only a subset of the momentum terms, which reduces memory requirements and makes training more efficient. This is particularly relevant for large models, which, due to their high memory requirements, can often only be trained on powerful hardware.

Experimental Results

In various experiments, including pre-training and fine-tuning of LLMs, SPAM showed compelling results. Both in pre-training models with 60 million to 1 billion parameters and in 4-bit LLM pre-training, SPAM outperformed other optimizers such as Adam and its variants. SPAM also demonstrated its effectiveness in the areas of reinforcement learning and time series prediction.

Especially under memory constraints, SPAM showed advantages over other memory-efficient optimizers like GaLore and Adam-Mini.

Conclusion

The presented research underscores the importance of reducing gradient spikes for stable and efficient LLM training. The SPAM optimizer offers a promising approach to solving this problem and could contribute to further advancing the development and application of LLMs. The implementation of SPAM is publicly available, allowing the research community and developers to test the method and integrate it into their own projects.

Bibliography Huang, T., Zhu, Z., Jin, G., Liu, L., Wang, Z., & Liu, S. (2025). SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training. arXiv preprint arXiv:2501.06842. https://openreview.net/forum?id=L9eBxTCpQG https://openreview.net/pdf/a03ba443f00476de6b012cc5af169958f87fa80c.pdf https://paperreading.club/page?id=277927 https://huggingface.co/papers https://arxiv.org/list/cs.AI/new https://icml.cc/virtual/2024/session/35596 https://www.gla.ac.uk/schools/computing/research/researchsections/ida-section/events/ https://www.jmlr.org/papers/volume24/22-1144/22-1144.pdf ```