DataDecide Project Offers Insights into Efficient Language Model Pretraining

Efficient Pretraining of Language Models: Insights from the DataDecide Project

Pretraining large language models (LLMs) is a computationally intensive and costly process. Selecting the optimal training data is therefore crucial for the efficiency and performance of the resulting model. The DataDecide project from the Allen Institute for AI (AI2) provides valuable insights into this process and offers recommendations for the selection of training data, benchmarks, and evaluation methods.

DataDecide: A Comprehensive Experiment

DataDecide is an open-source project that provides a comprehensive collection of models, data, and evaluations. As part of the project, controlled pretraining experiments were conducted with 25 different corpora. These corpora differ in their origin, deduplication, and filtering and comprise up to 100 billion tokens. The models were trained with up to 1 billion parameters and three different random seeds, resulting in over 30,000 model checkpoints. The evaluation was carried out using 10 different downstream tasks.

Predicting Model Performance: Scaling as Key

A central goal of DataDecide is to determine how accurately the performance of smaller models can predict the performance of larger models. The results show that ranking models at a small scale (e.g., 150 million parameters) provides a strong basis for predicting the best models at larger target scales (e.g., 1 billion parameters). In about 80% of the cases, the prediction based on the smaller model agreed with the actual result at the larger model size.

Scaling Laws and Their Limits

Although scaling laws play an important role in LLM research, none of the eight scaling methods investigated could surpass the prediction accuracy of the simple comparison of the smaller models. However, DataDecide provides a valuable foundation for the development and evaluation of future scaling laws.

Efficient Metrics for Data Selection

The use of continuous likelihood metrics as a proxy in small experiments proved to be particularly efficient. Benchmarks like MMLU, ARC, HellaSwag, MBPP, and HumanEval achieved a prediction accuracy of over 80% for the target size of 1 billion parameters, with only 0.01% of the computational cost compared to fully training the large models.

Implications for Practice

The results of the DataDecide project offer practical recommendations for LLM developers: Ranking smaller models is an effective and cost-efficient way to predict the performance of larger models. The use of continuous likelihood metrics in combination with selected benchmarks enables efficient evaluation of training data. By providing models, data, and evaluations, DataDecide promotes transparency and open exchange in LLM research and contributes to the development of more efficient training methods.

Bibliographie: https://arxiv.org/abs/2504.11393 https://allenai.org/blog/datadecide https://arxiv.org/html/2504.11393v1 https://www.youtube.com/watch?v=xvPLSztOzXg https://github.com/allenai/DataDecide http://paperreading.club/page?id=299767 https://chatpaper.com/chatpaper/?id=5&date=1744732800&page=1 https://openreview.net/forum?id=huuKoVQnB0 https://openreview.net/forum?id=zWqr3MQuNs ```