MetaSynth Enhances Language Model Domain Adaptation with Diverse Synthetic Data

Synthetic Data Diversity through Meta-Prompting: MetaSynth Enables Effective Domain Adaptation of Language Models

The use of synthetic data to improve and adapt large language models (LLMs) is becoming increasingly important. Smaller LLMs like Phi-3.5 and Phi-4, in particular, already rely on synthetically generated data. However, a current problem is the limited diversity of this data, which restricts its applicability for improving other models.

A new method called MetaSynth promises a remedy. MetaSynth uses so-called meta-prompting to increase the diversity of synthetic data. In this process, a language model acts as a kind of conductor, coordinating multiple specialized LLM agents that generate data together. These "expert" LLMs work together to produce a wider range of data.

Diversity through Collaboration: How MetaSynth Works

The innovative concept of MetaSynth is based on the idea of combining the strengths of different LLMs. Instead of relying on a single model, MetaSynth orchestrates the collaboration of multiple specialized LLMs. Through targeted meta-prompting, these agents are instructed to contribute their expertise, thus maximizing the diversity of the generated data.

Effective Domain Adaptation with Limited Data

The effectiveness of MetaSynth has been demonstrated in experiments where a pre-trained LLM (Mistral-7B-v0.3) was adapted to the specific requirements of the finance and biomedicine industries. With only 25 million synthetically generated tokens produced using MetaSynth, a significant improvement in model performance in these domains was achieved without compromising the overall performance of the model.

In comparison, the same model showed poorer performance when trained with data generated by simple template prompts. This underscores the importance of data diversity for effective domain adaptation. The results suggest that even a few million diverse synthetic data points, generated without the admixture of real data, are sufficient for effective domain adaptation using MetaSynth.

Measuring Data Diversity

The diversity of the data generated with MetaSynth was evaluated using seven automated metrics. The results show that the diversity of the synthetic data comes close to the diversity of the large pre-training corpora of LLMs. This suggests that MetaSynth is a promising approach for generating high-quality synthetic data for the training and adaptation of LLMs.

Continuous Pre-training with MetaSynth

Further performance gains were achieved through continuous pre-training of Mistral-7B-v0.3 with the data generated by MetaSynth. The improvements were up to 4.08% in finance and 13.75% in biomedicine. These results highlight the potential of MetaSynth for the continuous improvement of LLMs.

Conclusion: MetaSynth as a Promising Approach for the Future of AI

MetaSynth offers an innovative solution for generating diverse synthetic data and thus enables effective domain adaptation of LLMs. The method demonstrates that by skillfully combining meta-prompting and the collaboration of multiple specialized LLMs, the limitations of previous approaches can be overcome. The results of the study suggest that MetaSynth is an important step towards a more efficient and flexible use of LLMs in various application areas.

Bibliography: http://arxiv.org/abs/2504.12563 https://huggingface.co/papers https://arxiv.org/list/cs.CL/recent https://aclanthology.org/2024.lrec-main.1153/ https://radiophrenia.scot/schedule/ https://academ.us/list/cs/ https://openaccess.city.ac.uk/3530/1/Elizabeth_Anderson_Thesis.pdf https://www.researchgate.net/publication/345578866_2_-_A_Brief_History_of_Meta-analysis https://eprints.qut.edu.au/252791/1/2024_Program_Book.pdf http://arxivdaily.com/thread/66486