Adaptive Computation Pruning Boosts Efficiency of Forgetting Transformers

Top post
Efficiency Improvements for Transformer Models: Adaptive Computation Pruning for the Forgetting Transformer
The ever-growing size of Transformer models leads to an increased demand for computing power and memory. This poses a challenge for the application of these models in resource-constrained environments. A promising approach to address this challenge is "pruning," which involves the targeted removal of less relevant components of the model. A new research article introduces an innovative pruning method called "Adaptive Computation Pruning" (ACP) for the Forgetting Transformer (FoX).
The Forgetting Transformer, an advancement of the classic Transformer model, integrates a "forget gate" into the softmax attention mechanism. This allows the model to actively forget irrelevant information and focus on the relevant context. A notable characteristic of FoX is that many attention heads tend to forget information quickly, making their output at any given time primarily based on the local context.
ACP leverages this characteristic of FoX by dynamically removing computations influenced by heavily dampened input-output dependencies through the forget gate. Specifically, a dynamically adjusted threshold is used to ensure that the removed attention weights remain negligible. This approach allows for a significant reduction in computational effort without compromising model performance.
The application of ACP during the pre-training of language models with FoX shows promising results. The number of FLOPs (Floating Point Operations) in the softmax attention could be reduced by approximately 70%, independent of the model and context lengths. This resulted in an increase in training throughput of about 10% to 35%. Notably, longer context lengths lead to greater computational savings. These speed improvements were achieved without performance losses.
The functionality of ACP is based on the observation that certain computations in the attention mechanism become negligible due to the forget gate. By identifying and eliminating these redundant computations, the computational effort can be reduced without affecting the accuracy of the model. The dynamic adjustment of the threshold ensures that only truly irrelevant computations are removed.
The research results suggest that ACP is a promising approach for increasing the efficiency of Transformer models. The achieved speed improvements without performance losses open up new possibilities for the application of large language models in resource-constrained environments. Future research could focus on the application of ACP in other areas of deep learning, as well as on optimizing the method for different hardware architectures.
The implementation of ACP is publicly available and can be used and further developed by other researchers. This promotes collaboration and progress in the field of efficient deep learning.
Bibliography: Lin, Z., Obando-Ceron, J., He, X. O., & Courville, A. (2025). Adaptive Computation Pruning for the Forgetting Transformer. arXiv preprint arXiv:2504.06949. https://arxiv.org/abs/2504.06949 https://arxiv.org/html/2504.06949v1 https://www.researchgate.net/publication/390638983_Adaptive_Computation_Pruning_for_the_Forgetting_Transformer https://github.com/zhixuan-lin/arctic-fox https://www.themoonlight.io/review/adaptive-computation-pruning-for-the-forgetting-transformer https://www.themoonlight.io/fr/review/adaptive-computation-pruning-for-the-forgetting-transformer https://openaccess.thecvf.com/content/CVPR2024/papers/Ilhan_Resource-Efficient_Transformer_Pruning_for_Finetuning_of_Large_Models_CVPR_2024_paper.pdf https://chatpaper.com/chatpaper/fr/paper/127976 https://papers.nips.cc/paper_files/paper/2023/file/ced46a50befedcb884ccf0cbe8c3ad23-Paper-Conference.pdf https://www.sciencedirect.com/science/article/abs/pii/S0951832023005197 ```