Active Learning Improves Efficiency of Process Reward Model Training

Top post
Efficient Training of Process Reward Models through Active Learning
Training large language models (LLMs) requires immense amounts of data and computational power. A promising approach to improving the performance of LLMs is the use of process reward models (PRMs). These models provide step-by-step feedback during the learning process, leading to more precise and effective model optimization. However, creating training data for PRMs presents a significant challenge, both in terms of human effort and required computing capacity.
Active learning offers an innovative solution to this problem. This method allows for optimizing the annotation process by specifically selecting the most informative data points for training. Instead of randomly annotating large amounts of data, active learning focuses on the examples that promise the greatest learning effect for the model. This can significantly reduce the annotation effort without compromising model performance.
A current example of the application of active learning in the context of PRMs is ActPRM. This approach uses the uncertainty of the PRM itself to guide the selection of data points to be annotated. In the training process, after each forward pass, the uncertainty of the PRM is determined for each data point. Only data points with high uncertainty are selected for annotation. These selected data points are then annotated by a powerful, but computationally intensive model.
Compared to conventional fine-tuning methods, ActPRM shows a significant improvement in annotation efficiency. Experiments have shown that ActPRM can reduce annotation effort by up to 50% without compromising model performance. In some cases, ActPRM even led to a performance increase compared to models trained with significantly more annotated data.
The advantages of ActPRM go beyond mere annotation efficiency. By targeted selection of the training data, the quality of the PRM can be further improved. For example, ActPRM was used to filter over one million trajectories in the field of mathematical reasoning. Approximately 60% of the data was classified as relevant and used for training. The resulting PRM achieved new top scores on benchmarks like ProcessBench (75.0%) and PRMBench (65.5%) compared to models of similar size.
The combination of PRMs and active learning represents a promising approach for training LLMs. Through the efficient use of annotation resources, more powerful and robust models can be developed, paving the way for new applications in the field of artificial intelligence.
Bibliography: - https://arxiv.org/pdf/2305.20050 - https://www.researchgate.net/publication/271073519_Active_Reward_Learning - https://arxiv.org/html/2401.16635v3 - https://proceedings.neurips.cc/paper_files/paper/2022/file/476c289f685e27936aa089e9d53a4213-Paper-Conference.pdf - https://publikationen.bibliothek.kit.edu/1000169632/152531752 - https://www.roboticsproceedings.org/rss10/p31.pdf - https://rachelfreedman.github.io/assets/Barnett2022.pdf - https://openreview.net/forum?id=2RJAzSphy9 - https://www.sciencedirect.com/science/article/abs/pii/S0893608010000031 - https://www.ias.informatik.tu-darmstadt.de/uploads/Team/ChristianDaniel/ActiveRewardLearning.pdf ```