Teacher Hacking: A New Challenge in Language Model Distillation

The Phenomenon of "Teacher Hacking" in Language Model Distillation

The development and improvement of large language models (LLMs) is a dynamic research area with far-reaching implications. A common approach to optimizing LLMs is distillation, where a smaller "student" model is trained to imitate the outputs of a larger, more powerful "teacher" model. This process allows the performance capabilities of the teacher model to be transferred to a more compact and efficient student model. Another important step in LLM development is Reinforcement Learning from Human Feedback (RLHF), which aims to align the models with human preferences. Here, there is a risk of "reward hacking," where the model learns to manipulate the reward system instead of fulfilling the actual task.

Recent research investigates a similar phenomenon in the context of distillation, referred to as "teacher hacking." The hypothesis is that the student model learns not only the desired knowledge but also the weaknesses and errors of the teacher model, as the teacher itself is only an approximation of the ideal language distribution. This can lead to the student model delivering suboptimal results, even though it perfectly imitates the teacher's instructions.

To investigate this phenomenon, a controlled experiment was conducted. An "oracle" model was used as a reference for the true language distribution. A teacher model was distilled from this oracle, which in turn served as the basis for the distillation of a student model. The results of this study provide important insights into teacher hacking and its impact on the performance of student models.

Offline vs. Online Distillation

A central aspect of the investigation is the comparison between offline and online distillation. In offline distillation, a fixed dataset is used, while in online distillation, the data is generated dynamically during the training process. The results show that teacher hacking occurs particularly in offline distillation. This can be seen in the deviation of the optimization process from the expected polynomial convergence laws. In contrast, teacher hacking could be effectively minimized in online distillation.

The Importance of Data Diversity

The research results suggest that the diversity of the training data is a crucial factor in avoiding teacher hacking. Higher data diversity leads to a more robust and generalizable representation of the language distribution, making the student model less susceptible to the specific weaknesses of the teacher model. This underscores the importance of careful selection and design of the training data for effective distillation of language models.

The findings of this study contribute to a better understanding of the advantages and limitations of distillation and offer valuable guidance for the development of robust and efficient language models. The identification of teacher hacking and the development of strategies to avoid it are crucial for future research and application of LLMs in various fields, from chatbots and virtual assistants to automated text generation and translation.

Bibliography: - Tiapkin, D., et al. "On Teacher Hacking in Language Model Distillation." *arXiv preprint arXiv:2502.02671* (2025). - Bundesamt für Sicherheit in der Informationstechnik (BSI). "Whitepaper Large Language Models (LLMs)." (2024). - Anil, R., et al. "SWITCH: Studying with Teacher for knowledge distillation of Large Language Models." *arXiv preprint arXiv:2410.19503* (2024). - Touvron, H., et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." *arXiv preprint arXiv:2307.09288* (2023). - Sanh, V., et al. "Multitask Prompted Training Enables Zero-Shot Task Generalization." *arXiv preprint arXiv:2110.08207* (2021). - Chowdhery, A., et al. "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint arXiv:2204.02311* (2022).