COIG-P: A New Large-Scale Chinese Preference Dataset

COIG-P: A New Benchmark for Chinese Preference Datasets
The development of large language models (LLMs) is progressing rapidly, and aligning these models with human values and preferences is crucial for their successful deployment. However, in the field of Chinese natural language processing, there have been limitations regarding the available datasets for training such models. COIG-P (Chinese Open Instruction Generalist - Preference), a new, extensive, and high-quality dataset for Chinese preferences, now fills this gap.
Challenges and Solutions
Previous Chinese preference datasets suffered from various problems: small size, limited topic diversity, and a lack of data validation. Furthermore, the creation of such datasets was only scalable to a limited extent due to the high effort required for human annotations. COIG-P addresses these challenges with an innovative approach: a fully automated, LLM-based pipeline for dataset creation.
For COIG-P, 92,000 high-quality Chinese search queries were initially collected and carefully filtered. Subsequently, 15 leading LLMs generated response pairs in a Chosen-Rejected format, presenting a preferred and a rejected response in each case. These pairs were then evaluated automatically, without human intervention.
Scope and Structure of COIG-P
COIG-P comprises 1,009,000 Chinese preference pairs covering six different domains: chat, code, math, logic, novels, and role-playing. This broad thematic diversity enables comprehensive training of LLMs and improves their ability to respond adequately to different requests.
The Chinese Reward Model (CRM) and CRBench
To reduce the effort required for evaluating response pairs by LLMs, the researchers developed an 8-billion parameter Chinese Reward Model (CRM). In addition, a Chinese reward benchmark (CRBench) was created to evaluate the performance of the CRM. Tests with AlignBench show that COIG-P delivers significantly better results than other Chinese preference datasets and improves the performance of models like Qwen2/2.5 and Infinity-Instruct-3M-0625 by 2% to 12%.
Evaluation and Results
The results on CRBench demonstrate the strong and robust evaluation capability of the CRM. In experiments, the CRM was used to identify inferior response pairs in a test dataset from COIG-P. The results showed that the CRM is comparable to GPT-4o in its efficiency and cost-effectiveness.
Outlook and Significance
COIG-P represents a significant advancement in the field of Chinese natural language processing. The dataset and the associated reward model provide developers with valuable resources for training and improving LLMs. The automated creation of COIG-P also opens up new possibilities for scaling and diversifying preference datasets in the future.
Bibliographie: http://arxiv.org/abs/2504.05535 https://paperreading.club/page?id=298262 https://arxiv.org/html/2403.18058v2 https://paperswithcode.com/dataset/cvalues https://openreview.net/pdf/e952397e21e2d9cdad0d7a2e69553a59118c09d9.pdf https://www.researchgate.net/publication/389316382_Cheems_A_Practical_Guidance_for_Building_and_Evaluating_Chinese_Reward_Models_from_Scratch https://openreview.net/attachment?id=174YRjhwKc&name=pdf https://2025.naacl.org/program/accepted_papers/ https://www.researchgate.net/publication/370071108_Chinese_Open_Instruction_Generalist_A_Preliminary_Release https://aclanthology.org/2024.acl-long.853.pdf ```