Probabilistic Inference Improves LLM Inference Scaling

Scaling LLM Inference Time: A Probabilistic Approach

Large Language Models (LLMs) have achieved impressive performance gains by increasing model size and data volume. However, recent developments indicate diminishing returns from these scaling methods. This motivates the exploration of methods for scaling the computational power used during inference, i.e., the application of the model.

Previous approaches to scaling inference time, mostly working with reward models, treat the task as a search problem. However, this approach is susceptible to so-called "reward hacking," which arises from approximation errors in the reward models.

A new research approach instead considers the scaling of inference time as a probabilistic inference task. Instead of directly searching for the mode of the state distribution of a state-space model with an approximated likelihood function, sampling methods are used to explore the typical set of this distribution.

Specifically, a novel approach to scaling inference time is proposed, based on particle-based Monte Carlo methods. These methods allow approximating the probability distribution of solutions by moving a set of "particles" through the state space. Each particle represents a possible solution, and its movement is guided by the likelihood function.

Empirical evaluations show that this probabilistic approach exhibits a 4 to 16 times better scaling rate compared to deterministic search methods on various challenging mathematical reasoning tasks.

The results show, for example, that the model Qwen2.5-Math-1.5B-Instruct can surpass the accuracy of GPT-4o in just 4 rollouts using this method, while Qwen2.5-Math-7B-Instruct achieves GPT-4o-level accuracy in only 32 rollouts.

This research not only introduces an effective method for scaling inference time but also connects the extensive literature on probabilistic inference with the scaling of LLM inference time to develop more robust algorithms in future work. This opens up new possibilities for the more efficient use of LLMs, especially in computationally intensive applications.

The research results suggest that particle-based Monte Carlo methods are a promising approach for scaling the inference time of LLMs. By viewing inference as a probabilistic inference task and utilizing sampling methods, the challenges associated with traditional search methods can be overcome. The improved scaling rate and the achieved accuracies indicate significant potential for future developments in this area.

Especially for companies like Mindverse, specializing in the development of AI solutions, these advancements in LLM research are of great importance. The more efficient use of LLMs through improved scaling of inference time enables the development of more powerful and cost-effective AI applications, such as chatbots, voicebots, AI search engines, and knowledge systems. Integrating these new methods into Mindverse's existing platforms could lead to significant improvements in performance and efficiency, thus providing added value to customers.

Bibliography: https://arxiv.org/abs/2502.01618 https://probabilistic-inference-scaling.github.io/ http://paperreading.club/page?id=281551 https://www.en.wisostat.statistik.uni-muenchen.de/research/workshop/costnet2020/pdf_files/costnet/abstractbyshkin.pdf https://proceedings.mlr.press/v33/wood14.html https://paperswithcode.com/paper/probabilistic-inference-in-language-models https://www.econ.upf.edu/~omiros/papers/CUP_chapter.pdf https://researchportal.tuni.fi/files/13050803/nurminen_1499.pdf https://typeset.io/pdf/probabilistic-topological-maps-47csbfu5gz.pdf https://www.aanda.org/articles/aa/full_html/2023/03/aa45098-22/aa45098-22.html