Process-Based Self-Rewarding Improves Reasoning in Large Language Models

Self-Learning Language Models Through Process-Based Self-Reward

Large language models (LLMs) have demonstrated impressive performance in various application areas. To further improve the performance of these models, they are often trained with human-annotated preference data. However, this method reaches the limits of human capacity. As an alternative, the concept of self-reward has been developed, in which LLMs evaluate their own outputs and generate training data from them.

Previous approaches to self-reward have proven ineffective in the area of mathematical reasoning and can even lead to performance degradation. New research now proposes process-based self-reward for language models. This approach integrates multi-step thinking, step-by-step evaluation by the LLM itself acting as a kind of judge, and step-by-step optimization of preferences within the self-reward paradigm.

Process-based self-reward allows the LLM to solve complex mathematical problems step-by-step and evaluate each step itself. Through the iterative application of this process, the model's ability to draw mathematical conclusions improves. Initial results show that this approach can significantly increase the performance of LLMs in various mathematical benchmarks.

How Process-Based Self-Reward Works

Process-based self-reward is based on three core components:

First, multi-step thinking: Instead of solving a problem in a single step, it is broken down into smaller, more manageable sub-steps. The LLM generates a solution for each step and then evaluates it itself.

Second, step-by-step evaluation by the LLM: The LLM acts as a judge and evaluates the quality of its own partial solutions. This evaluation flows into the self-reward and serves as the basis for optimizing the model.

Third, step-by-step optimization of preferences: Based on the self-evaluation, the LLM's preferences are gradually adjusted. This iterative process allows the model to continuously improve its performance.

Potential and Outlook

Process-based self-reward has the potential to significantly expand the capabilities of LLMs in the area of mathematical reasoning. Through the iterative application of this approach, LLMs may even be able to surpass human capabilities. This development is particularly relevant for companies like Mindverse, which specialize in the development of AI-based solutions. Process-based self-reward could form the basis for new, more powerful chatbots, voicebots, AI search engines, and knowledge systems.

Research in this area is still in its early stages, but the initial results are promising. Future research could focus on the application of process-based self-reward in other areas, such as text generation or code creation. Furthermore, investigating the ethical implications of this technology is of great importance.

Bibliographie: - https://arxiv.org/abs/2503.03746 - https://arxiv.org/html/2401.10020v1 - https://huggingface.co/papers/2401.10020 - https://raw.githubusercontent.com/mlresearch/v235/main/assets/yuan24d/yuan24d.pdf - https://openreview.net/forum?id=0NphYCmgua - https://dl.acm.org/doi/10.5555/3692070.3694459 - https://github.com/raulc0399/self-rewarding-language-models - https://www.marktechpost.com/2024/01/22/this-ai-paper-from-meta-and-nyu-introduces-self-rewarding-language-models-that-are-capable-of-self-alignment-via-judging-and-training-on-their-own-generations/ - https://github.com/YiyangZhou/CSR - https://proceedings.neurips.cc/paper_files/paper/2024/file/5c20c00504e0c049ec2370d0cceaf3c4-Paper-Conference.pdf