Self Training Rerankers Improves Code Generation Models

Self-Learning Systems for Improving Code Generation

Generating high-quality code that solves complex programming tasks presents a challenge, especially for current decoder-based models. These often produce stochastic outputs, where even small errors can render the entire solution unusable. Utilizing multiple solution proposals can significantly improve the quality of the final result. An effective approach to optimizing code generation is the combination of a code generation model with a so-called reranker model. This selects the best solution from the generated proposals.

A promising approach in this area is the iterative self-training of reranker models using Proximal Policy Optimization (PPO). This procedure aims to improve both the accuracy of the reranking and the entire code generation process. In contrast to traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, this approach focuses on the development of a robust reward/reranker model. This model improves the quality of the generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker.

The iterative self-training refines the training dataset by re-evaluating outputs, identifying negatively rated examples with high scores, and integrating them into the training loop. This process leads to a continuous improvement in model performance. Through repeated application of this cycle, the model learns to increasingly better assess the quality of the generated code proposals and select the best solutions.

Evaluation and Results

Evaluations on the MultiPL-E dataset demonstrate the potential of this approach. A model with 13.4 billion parameters surpassed a 33 billion parameter model in code generation quality while being three times faster. Furthermore, it achieved performance comparable to GPT-4 and even surpassed it in one programming language. These results underscore the effectiveness of iterative self-training for reranker models and open up new possibilities for the development of more efficient and powerful code generation systems.

The combination of code generation models and self-learning rerankers represents an important step towards automated code generation. By iteratively improving the reranking process, even complex programming tasks can be solved more efficiently and with higher quality. Future research could focus on extending this approach to other programming languages and more complex code generation scenarios.

The Role of AI Partners like Mindverse

Companies like Mindverse, which act as AI partners and develop customized solutions such as chatbots, voicebots, AI search engines, and knowledge systems, play a crucial role in the further development and application of such innovative technologies. By providing all-in-one content tools for AI text, content, images, and research, they enable the efficient implementation and integration of advanced AI models in various application areas.

Bibliography

Huang, J., Wang, Y., Gu, X., Wu, Y.: CYCLE: Learning to Self-Refine the Code Generation. In: Proceedings of the 31st ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. S. 1638–1649. Association for Computing Machinery, New York, NY, USA (2023).
Sorokin, N., Sedykh, I., Malykh, V.: Iterative Self-Training for Code Generation via Reinforced Re-Ranking. (2025).
Zuo, Y., Liu, Y., Zhang, H., Wang, S.: Iterative Self-Training for Code Generation via Reinforced Re-Ranking. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2024).

```