S1-Bench Evaluates System-1 Thinking Capabilities of Large Language Models

Top post
Intuitive Thinking versus Analytical Approach: S1-Bench Evaluates System-1 Capabilities of Large Language Models
Large language models (LLMs) have made remarkable progress in complex reasoning tasks in recent years. Through explicit thought steps, known as "Chains of Thought," they can solve complex problems and draw logical conclusions. However, this focus on analytical thinking, also called System-2 thinking, potentially obscures a weakness in another area: intuitive, rapid System-1 thinking. To address this gap, S1-Bench was developed, a new benchmark that evaluates the abilities of LLMs in simple tasks requiring intuitive thinking.
System-1 thinking is characterized by speed, automaticity, and low cognitive load. It is the type of thinking we use to make everyday decisions, recognize faces, or solve simple arithmetic problems. In contrast, System-2 thinking is slow, analytical, and requires conscious effort. While LLMs excel in System-2 tasks, their performance in System-1 tasks has been little explored so far.
S1-Bench offers a series of simple, diverse, and intuitively understandable questions from various domains and languages. These questions are designed to test the System-1 capabilities of LLMs. The initial results of a comprehensive evaluation of 22 LLMs are revealing: They show significantly lower efficiency compared to smaller, traditional language models. The LLMs' responses were on average 15.5 times longer than those of the smaller models. Moreover, the LLMs often identified the correct answer early on but continued processing unnecessarily, with some models even producing numerous errors.
These results suggest rigid thinking patterns in current LLMs. They underscore the need for further development to achieve balanced thinking that encompasses both System-1 and System-2 capabilities and can adapt flexibly to the complexity of the task. The ability to switch between intuitive and analytical thinking is crucial for truly intelligent Artificial Intelligence. S1-Bench provides a valuable tool for measuring progress in this area and promoting the development of LLMs towards a more comprehensive cognitive architecture.
The development of S1-Bench is an important step in highlighting the limitations of current LLMs and directing research towards more flexible and efficient AI models. Future research should focus on identifying the causes of the observed weaknesses in System-1 tasks and developing new training methods that promote a balance between intuitive and analytical thinking.
Bibliography: - Zhang, Wenyuan, et al. "S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models." arXiv preprint arXiv:2504.10368 (2025). - ChatPaper. "ChatPaper." chatpaper.com/chatpaper/?id=3&date=1744646400&page=1. - "L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution." ResearchGate. - "System-2 Research." GitHub, open-thought/system-2-research.