Supervised Training and Translationese in Large Language Models

The Influence of Supervised Training on Translationese in Large Language Models

Large language models (LLMs) have made remarkable progress in machine translation, demonstrating impressive performance in various languages. Nevertheless, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge in LLM-based translation systems. Although LLMs are pre-trained on massive corpora of natural utterances, they still exhibit translationese errors and generate unexpectedly unnatural translations. This is due to biases introduced during supervised fine-tuning (SFT).

The discrepancy between training data and real language usage plays a crucial role. Training data for machine translation often consists of carefully crafted, highly accurate translations, which do not always reflect natural language flow. This focus on literal accuracy in training can lead to LLMs failing to adequately capture and reproduce idiomatic expressions, slang, or culturally specific nuances. The result is translations that may be grammatically correct but sound stiff and unnatural, typical characteristics of translationese.

A recent study investigates the causes of translationese in LLMs and proposes methods to minimize it. The researchers argue that biases in supervised training are the main cause of this phenomenon. They systematically analyze the occurrence of translationese in LLM-generated translations and investigate its origin in the training process. The results show that the quality of the training data has a significant impact on the naturalness of the translations.

To reduce the negative effects of translationese, the researchers propose various strategies. One approach is to optimize the "gold references," i.e., the ideal translations in the training data. By "polishing" these references, for example, by considering stylistic aspects and idiomatic expressions, the naturalness of the translations can be improved. Another approach is filtering unnatural training instances. By removing examples that already exhibit characteristics of translationese, the LLM can be prevented from learning and reproducing these patterns.

Empirical evaluations show that these approaches significantly reduce translationese while simultaneously improving the naturalness of the translation. These results were validated by both human evaluations and automatic metrics. The research underscores the need for training-aware adjustments to optimize the translation performance of LLMs and pave the way for more fluent and target-language consistent translations.

For companies like Mindverse, which specialize in AI-powered content creation, these findings are of great importance. The development of customized solutions such as chatbots, voicebots, AI search engines, and knowledge systems requires a deep understanding of the challenges and opportunities of machine translation. By integrating the latest research findings, companies like Mindverse can further improve the quality of their AI-based translation tools and offer their customers even more precise and natural translations.

Bibliography: - Li, Y., Zhang, R., Wang, Z., Zhang, H., Cui, L., Yin, Y., Xiao, T., & Zhang, Y. (2025). Lost in Literalism: How Supervised Training Shapes Translationese in LLMs. arXiv preprint arXiv:2503.04369. - Federal Office for Information Security (BSI). (2024, December 6). Working Paper on Large Language Models (LLMs). - Pęzik, P., Wróblewska, A., & Kocoń, J. (2023). The Impact of Training Data on the Quality of Neural Machine Translation. In Book of Abstracts of the 13th Language Resources and Evaluation Conference (p. 1). - Klyueva, N., Zaytseva, T., & Rubtsova, Y. (2018). Machine Translation of Phraseological Units. In Proceedings of the 2018 International Conference on Information Science and Communications Technologies: Applications, Trends and Opportunities (ICISCT) (pp. 1-4). IEEE. - Tufis, D., Ion, R. A., & Celikkaya, N. B. (2019, November). CONSILR 2019: Proceedings of the 12th International Conference on Software, Knowledge, Information Management and Applications. - Carpuat, M., Simard, M., Blackwood, G., Costa-jussà, M. R., & Way, A. (2017). Data-driven Approaches to Machine Translation Evaluation. In Statistical Machine Translation. - Radford, A., Kim, J. W., Xu, T., Brock, A., McLeavy, N., Sutskever, I., & Zaremba, W. (2022). Language Models are Unsupervised Multitask Learners. - Large language model. (n.d.). In Wikipedia. Retrieved October 26, 2024, from https://en.wikipedia.org/wiki/Large_language_model ```

Supervised Training and Translationese in Large Language Models

Top post

The Influence of Supervised Training on Translationese in Large Language Models

Related blog

Bilingual Language Models: How Shared Grammatical Structures Emerge

Using Vectors to Combat Hallucinations: A New Method for Improving LLM Factual Accuracy

LLMVoX: A New Lightweight and Efficient Approach to Text-to-Speech for Large Language Models