VCR-Bench: A New Benchmark for Video Reasoning in AI

Top post
Video Reasoning Put to the Test: VCR-Bench – A New Benchmark for AI
Artificial intelligence (AI) is making rapid progress, particularly in the field of visual understanding and reasoning. Chain-of-Thought (CoT) Reasoning, a method that allows AI models to draw logical conclusions step-by-step, has significantly improved the performance of large language models (LLMs) and large vision-language models (LVLMs). But what about the application of CoT to videos? Here, a gap exists in the evaluation and understanding of the capabilities of current AI systems.
Existing benchmarks for video understanding mostly focus on simple tasks such as object recognition or action classification. They do not provide a sufficient basis for evaluating the complex thought processes of AI models when analyzing videos. The ability to discern whether errors are rooted in perception or in the actual reasoning ability is lacking. To close this gap, VCR-Bench was developed – a new benchmark specifically designed to comprehensively evaluate the video CoT capabilities of LVLMs.
VCR-Bench: Structure and Methodology
VCR-Bench comprises 859 videos of varying lengths and content, along with 1034 carefully selected question-answer pairs. Each pair is manually annotated with a step-by-step CoT rationale, with each step marked to indicate its association with perception or reasoning. This detailed annotation allows for a precise analysis of the strengths and weaknesses of the AI models.
To test the performance of LVLMs in various aspects of video reasoning, seven different task areas have been defined. These areas cover, among other things, the understanding of temporal sequences, spatial relationships, causalities, and intentions. To evaluate the entire CoT process, the so-called CoT score is used, which is based on the step-by-step tagged CoT rationales.
Initial Results and Insights
Comprehensive tests with current LVLMs on VCR-Bench show that there is still considerable room for improvement. Even the best-performing model only achieves a CoT score of 62.8% and an accuracy of 56.7%. Most models even fall below 40%. Interestingly, the models perform worse on average on perception steps than on the actual reasoning steps. This suggests that the processing of spatio-temporal information in videos represents a central challenge for current LVLMs.
The results show a strong positive correlation between the CoT score and accuracy. This confirms the validity of the VCR-Bench framework and underscores the importance of CoT reasoning for solving complex video reasoning tasks.
Outlook and Significance for AI Research
VCR-Bench offers AI research a valuable tool for the evaluation and further development of LVLMs. The benchmark makes it possible to uncover the actual weaknesses in complex video reasoning tasks and to work specifically on improvements. VCR-Bench is intended to serve as a standardized evaluation framework and to drive the development of more powerful AI systems for video understanding. This is not only relevant for academic research but also for numerous application areas such as the development of intelligent assistance systems, autonomous vehicles, or automated video analysis.
Bibliography: Qi, Y., Zhao, Y., Zeng, Y., Bao, X., Huang, W., Chen, L., Chen, Z., Zhao, J., Qi, Z., & Zhao, F. (2025). VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning. arXiv preprint arXiv:2504.07956. anonymous. (2023). Chain-of-Verification Reduces Hallucination in Large Language Models. arXiv preprint arXiv:2311.16103. anonymous. (2025). Improving Factuality and Reasoning in Language Models through Chain-of-Verification. arXiv preprint arXiv:2503.12605. anonymous. (2024). Chain of thought prompting elicits reasoning in large language models. anonymous. (2024). Video-of-Thought. anonymous. (2025). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). anonymous. (2024). Advances in Neural Information Processing Systems 37 (NeurIPS 2024). anonymous. (2024) A Comprehensive Survey on Chain-of-Thought Prompting for Large Language Models: Techniques, Applications, and Challenges. IEEE Transactions on Pattern Analysis and Machine Intelligence. anonymous. (2024). Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models’ Reasoning Capacity. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Scofield7419. (2024). Video-of-Thought. GitHub repository. anonymous. (2024). Chain-of-Thought as a Cognitive Probe for Large Language Models: A Diagnostic Tool for Revealing Strengths and Weaknesses. IEEE Access.