Automated Curriculum Learning Improves Efficiency of Large Language Model Post-Training

Automated Curriculum Learning for More Efficient Post-Training of Large Language Models

Large language models (LLMs) have made impressive progress in recent years, particularly through the application of reinforcement learning (RL) in post-training. These techniques have improved the ability of LLMs to solve complex tasks and apply logical reasoning. However, one often overlooked aspect is the heterogeneity of the training data. Training data for LLMs often comes from various sources and exhibits varying degrees of difficulty. This diversity presents a challenge: How can the training process be optimally designed across different data distributions to maximize learning efficiency?

Curriculum learning offers a promising solution to this problem. This concept, inspired by the way humans learn, proposes starting the learning process with simple tasks and gradually progressing to more complex tasks. In this context, a new approach called "DUMP" (Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training) has been developed, which enables automated curriculum learning at the distribution level for RL-based post-training of LLMs.

The core idea of DUMP is based on the assumption that the size of the policy advantages provides information about how much a model can still benefit from further training on a specific data distribution. Simply put: the greater the advantage, the more learning potential remains in the corresponding distribution. DUMP uses the Upper Confidence Bound (UCB) principle to dynamically adjust the sampling probabilities for different distributions. This approach prioritizes distributions with either a high average advantage (exploitation) or a low sample count (exploration). This creates an adaptive and theoretically sound training schedule.

In practice, DUMP is implemented with the Generalized Regularized Policy Optimization (GRPO) algorithm as the underlying RL algorithm. Tests on logic reasoning datasets with varying degrees of difficulty and sources demonstrate the effectiveness of this approach. The results show that DUMP significantly improves the convergence speed and final performance. This underscores the value of distribution-aware curriculum strategies in LLM post-training.

How DUMP Works

DUMP operates in an iterative process. In each iteration, the model's performance is evaluated on the various data distributions. Based on the determined policy advantages and the previous sample count, DUMP calculates a priority for each distribution using the UCB principle. Distributions with high priority are sampled more frequently in the next training step. This process repeats continuously, dynamically adapting the training schedule to the model's learning progress.

Advantages of DUMP

The application of DUMP offers several advantages for the post-training of LLMs:

Increased Learning Efficiency: By prioritizing learning-rich distributions, training time is reduced and the model's performance improves faster.
Improved Performance: DUMP leads to an overall higher performance of the model, as it optimally adapts to the different difficulty levels of the data.
Theoretical Foundation: The UCB algorithm provides a solid theoretical basis for the dynamic adaptation of the training schedule.
Adaptability: DUMP can be applied to various RL algorithms and datasets.

Future Research

Research in the field of curriculum learning for LLMs is still in its early stages. Future work could focus on developing even more complex curriculum strategies that, for example, consider the dependencies between different distributions. The application of DUMP to other task areas beyond logic reasoning is also a promising field of research.

DUMP represents an important step towards more efficient and effective post-training of LLMs. By considering the heterogeneity of the training data and applying dynamic curriculum strategies, LLMs can reach their full potential and become even more powerful tools for a variety of applications.

Bibliography:

Wang, Z., Cui, G., Wan, K., & Zhao, W. (2025). DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training. arXiv preprint arXiv:2504.09710.
Romac, J., Pavlin, G., & Praprotnik, M. (2021). TeachMyAgent: a benchmark for curriculum learning in reinforcement learning. Proceedings of the 38th International Conference on Machine Learning, 139, 9142-9152.
https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
https://arxiv.org/abs/2203.04166
https://epub.uni-luebeck.de/items/0cb02afc-3e66-4547-ad31-1c5412479a38
https://hal.science/hal-03173198/file/TeachMyAgent.pdf
https://www.researchgate.net/publication/389510129_LLM_Post-Training_A_Deep_Dive_into_Reasoning_Large_Language_Models
https://www.researchgate.net/publication/383060918_The_AI_Scientist_Towards_Fully_Automated_Open-Ended_Scientific_Discovery/fulltext/66bb1ea5299c327096c417ac/The-AI-Scientist-Towards-Fully-Automated-Open-Ended-Scientific-Discovery.pdf
https://www.astralcodexten.com/p/open-thread-376/comments?utm_source=post&utm_medium=web&triedRedirect=true