AfriHate: A Multilingual Dataset for Hate Speech Detection in African Languages

AfriHate: A Milestone in Detecting Hate Speech and Offensive Language in African Languages

Hate speech and offensive language are global problems that require a deep understanding of the socio-cultural background to be effectively detected and moderated. Especially in the Global South, moderation is often lacking, or censorship occurs through contextless keyword detection. Often, prominent personalities are the focus of moderation, while targeted hate campaigns against minorities are overlooked. These deficits are primarily due to the lack of high-quality data in local languages and the lack of involvement of local communities in the processes of data collection, annotation, and moderation.

AfriHate, a multilingual collection of datasets on hate speech and offensive language in 15 African languages, addresses this problem. Each dataset was annotated by native speakers familiar with the local culture. The languages include Algerian Arabic, Amharic, Igbo, Kinyarwanda, Hausa, Moroccan Arabic, Nigerian Pidgin, Oromo, Somali, Swahili, Tigrinya, Twi, isiXhosa, Yorùbá, and isiZulu. This allows for a nuanced view of hate speech and offensive language that goes beyond simple keyword detection and takes cultural context into account.

The Challenges of Data Collection and Annotation

The creation of AfriHate was associated with various challenges. Collecting data on hate speech and offensive language is complex and time-consuming. Researchers often resort to keywords, hashtags, or user accounts to create datasets. Additionally, insights from moderators and affected communities are essential. Considering different dialects and ensuring balanced representation of different population groups pose further hurdles. Data collection is particularly difficult for low-resource languages.

Classification Baselines and the Influence of LLMs

As part of the project, various classification baselines were performed with and without Large Language Models (LLMs). The results show that performance strongly depends on the respective language. Multilingual models can improve performance in low-resource settings. While the use of LLMs offers potential, it also presents challenges, such as the tendency towards hallucinations and biases that can arise from the training data. The research findings underscore the need to consider the nuances of individual languages and adapt the models accordingly.

The Significance of AfriHate for Research and Society

The release of AfriHate, including the datasets, individual annotations, and manually curated lexica for hate speech and offensive language, represents an important contribution to the research community. These resources provide a basis for further research on hate speech and offensive language, African languages, and the study of dissent. They enable the development of tools for the automatic moderation of online content, improving both efficiency and contextual understanding. Furthermore, they can contribute to the development of AI-powered sentiment analysis systems, which can be useful for various applications, such as social media monitoring. Ultimately, AfriHate contributes to creating a safer online environment and protecting freedom of expression in the digital space.

Bibliography: Muhammad, S. H., et al. (2025). AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages. arXiv preprint arXiv:2501.08284v1. "AfriHate: Hate and Offensive Speech Detection for African Languages." IRCAI Global Top 100 List. Aliyu, S. M., Wajiga, G. M., & Murtala, M. (2024). A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages. arXiv preprint arXiv:2406.02169. Yimam, S. M. Curriculum Vitae. University of Hamburg. "2021 African Language Awardees." Lacuna Fund. Nehdi, T. M. LinkedIn Post. "Nehdi/TuniziBigBench · Datasets at Hugging Face." ChatPaper. "AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages." Ousidhoum, N., et al. (2023). Multilingual Hate Speech and Offensive Language Detection. ResearchGate.