Unveiling the Mechanisms of Long Chain-of-Thought Reasoning in Large Language Models

The Secret of Long Chain-of-Thought Reasoning in Large Language Models

Large language models (LLMs) continually amaze with their ability to solve complex tasks and generate human-like text. A key to this success lies in so-called "Chain-of-Thought" (CoT) reasoning, a process where the model thinks through a problem step-by-step and verbalizes its solution path. Particularly fascinating is "long" CoT reasoning, which enables LLMs to handle multi-step problems, correct errors, and even backtrack. Research is intensively investigating how this long CoT reasoning emerges in LLMs and how it can be optimized.

A recent study examines the mechanisms of long CoT reasoning and identifies key factors that enable LLMs to generate such complex thought processes. The researchers conducted extensive experiments with Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) and arrived at interesting findings.

SFT, the fine-tuning of the model with labeled data, is not strictly necessary, but simplifies the training process and increases efficiency. Interestingly, the ability for complex thinking tends to increase with increasing computational power during training. However, this development is not guaranteed. Therefore, so-called "reward shaping," the design of reward signals during RL training, plays a crucial role in stabilizing the growth of CoT length.

Another important factor is the scaling of verifiable reward signals in RL. The study found that using noisy solutions extracted from the internet, combined with filtering mechanisms, holds great potential, especially for tasks outside the training data range (Out-of-Distribution, OOD), such as in the field of STEM reasoning (Science, Technology, Engineering, and Mathematics).

The research also shows that basic abilities like error correction are already present in the base models. However, to effectively leverage these abilities for complex tasks using RL, significant computational power is required. Measuring the progress in the development of these abilities also requires a nuanced approach.

The results of this study provide valuable practical guidance for optimizing training strategies to improve long CoT reasoning in LLMs. A better understanding of these mechanisms is crucial to unlock the full potential of LLMs and make them usable for even more complex tasks. This is particularly relevant for companies like Mindverse, which specialize in the development of AI solutions, including chatbots, voicebots, AI search engines, and knowledge systems. By integrating the latest research findings, these systems can be made even more powerful and efficient to meet the demands of the future.

Bibliography: - https://huggingface.co/papers - https://arxiv.org/html/2401.14295v3 - https://arxiv.org/abs/2401.14295 - https://www.thoughtworks.com/en-de/insights/blog/generative-ai/demystifying-deepseek - https://openreview.net/forum?id=b2XfOm3RJa - https://aclanthology.org/2025.coling-main.719.pdf - https://github.com/RUCAIBox/LLMSurvey - https://www.sciencedirect.com/science/article/pii/S0885230824000834 - https://qdata.github.io/deep2Read/fmadapt/L20/ - https://aclanthology.org/2023.emnlp-main.41.pdf