Boosting Large Language Model Efficiency with Sleep-Time Compute

Compute Power in Sleep Mode: Increasing Efficiency for Large Language Models

Large language models (LLMs) have made impressive progress in recent years and are being used in an increasing number of areas. Scaling compute power at runtime, especially during the inference phase, is a crucial factor for the performance of these models, particularly for complex tasks. However, this scaling comes with high latencies and significant costs. A new approach, called "Sleep-Time Compute," promises to remedy this.

The concept of Sleep-Time Compute is based on the idea of letting the models "prepare" themselves in a certain way before they are confronted with specific requests. Similar to a person who intensively studies the material before an exam, the model uses the time when no requests need to be processed to perform relevant calculations in advance. By anticipating possible user requests and pre-calculating useful information, the computational effort during the actual inference phase, i.e., answering the request, can be significantly reduced.

To demonstrate the effectiveness of this approach, modified versions of two reasoning tasks – Stateful GSM-Symbolic and Stateful AIME – were used. The results show that Sleep-Time Compute can reduce the computational effort during inference by a factor of 5 without affecting the accuracy of the results. Furthermore, by scaling the Sleep-Time Compute resources, accuracy could be increased by up to 13% for Stateful GSM-Symbolic and 18% for Stateful AIME.

Another advantage of Sleep-Time Compute becomes apparent when processing multiple related queries to the same context. For this purpose, Multi-Query GSM-Symbolic was developed, an extension of GSM-Symbolic that considers multiple related queries per context. By amortizing the Sleep-Time Compute resources across multiple queries, the average computational effort per query could be reduced by a factor of 2.5.

The effectiveness of Sleep-Time Compute strongly depends on the predictability of user requests. The better the possible requests can be anticipated, the more efficiently the compute power can be used in sleep mode. This was confirmed by further analyses, which show a strong correlation between the predictability of user requests and the efficiency of Sleep-Time Compute.

Finally, Sleep-Time Compute was applied to a realistic software development task in a case study. The results of this study underline the potential of the approach to improve the efficiency of LLMs in practically relevant scenarios.

Sleep-Time Compute represents a promising method for increasing the efficiency and performance of large language models. By strategically using compute power during periods of low utilization, both the latencies and the costs of inference can be reduced while simultaneously improving the accuracy of the results. The predictability of user requests plays a crucial role in the effectiveness of the procedure. Future research will focus on further optimization and application of Sleep-Time Compute in various application areas.

Bibliography: - https://arxiv.org/abs/2504.13171 - https://arxiv.org/html/2408.03314v1 - https://chatpaper.com/chatpaper/?id=2&date=1744905600&page=1 - https://openreview.net/forum?id=4FWAwZtd2n - https://openai.com/index/learning-to-reason-with-llms/ - https://paperreading.club/page?id=300128 - https://www.themoonlight.io/de/review/s1-simple-test-time-scaling - https://ml-research.github.io/people/kkersting/ - https://openreview.net/pdf/65629934a437338aa3ec4ba8de19b913b7a6ad99.pdf - https://icml.cc/virtual/2024/papers.html