New Resources and Models for Historical Turkish NLP
Top post
Foundations for Processing Historical Turkish Texts: Resources and Models
The rapid advancements in Natural Language Processing (NLP), particularly through large language models (LLMs), have significantly improved the automatic processing of many languages. However, the focus has primarily been on widely spoken languages like English. Historical languages, especially those with limited digital resources, have often been neglected. This article highlights the challenges and advancements in NLP for historical Turkish and introduces new resources and models that form the basis for future research.
The Challenges of Historical Turkish
Ottoman Turkish, the longest-lasting historical variant of Turkish, changed significantly in vocabulary and syntax over six centuries. Most extant documents originate from the 18th to 20th centuries, therefore this article focuses on this period. As the digitization of historical texts progresses, the need for automated analysis methods grows. However, the development of such methods is hampered by the lack of annotated data, dictionaries, and linguistic references. Although modern Turkish and historical Turkish are related, the differences in semantics, vocabulary, and grammar make it difficult to directly apply modern NLP techniques to historical texts.
New Resources for NLP in Historical Turkish
To address these challenges, several resources were developed as part of this study:
- HisTR: The first Named-Entity-Recognition (NER) dataset for historical Turkish, consisting of 812 manually annotated sentences from the 17th to 19th centuries. - OTA-BOUN: The first manually annotated dependency treebank for historical Turkish with 514 sentences from various literary works, including part-of-speech tags and dependency relations. - Ottoman Text Corpus (OTC): A cleaned text corpus from the 15th to 20th centuries, encompassing texts of various genres and suitable for diverse linguistic purposes.Transformer-based Models
In addition to the resources, transformer-based models for dependency parsing, part-of-speech tagging, and named-entity recognition were trained and evaluated. These models serve as a benchmark for future research and provide initial insights into the performance of NLP methods for historical Turkish. The results show promising progress, but also highlight existing challenges, such as adaptation to different domains and language variations across different time periods.
Outlook and Availability
This study provides important foundations for NLP in historical Turkish. The presented resources and models, along with text preprocessing tools, are publicly available at https://huggingface.co/bucolin. They are intended to serve as a basis for future research and promote the development of more powerful NLP methods for historical texts. Further research is necessary to address the challenges of domain adaptation and language variation and to further improve the performance of the models.
Bibliography Özateş, Ş. B., Tıraş, T. E., Adak, E. E., Doğan, B., Karagöz, F. B., Genç, E. E., & Taşdemir, E. F. B. (2025). Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models. arXiv preprint arXiv:2501.04828v1. Çöltekin, Ç., Doğruöz, A. S., & Çetinoğlu, Ö. (2023). Resources for Turkish natural language processing: A critical survey. Language Resources and Evaluation, 57, 449–488. Gökçeoğlu, M., Çöltekin, Ç., & Sever, H. (2022). Multi-label Text Classification on Ottoman Turkish. Proceedings of the International Conference on Computational Linguistics.