VisualPuzzles Benchmark Tests Multimodal Reasoning in AI Models

Multimodal Reasoning: VisualPuzzles Sets New Standards for AI Evaluation

Evaluating the abilities of AI models in the realm of multimodal reasoning, meaning the ability to combine and process information from different sources like text and images, presents a significant challenge. Existing benchmarks often mix reasoning processes with domain-specific knowledge. This makes it difficult to isolate and evaluate general reasoning ability, especially in scenarios that don't require expert knowledge. A new benchmark called VisualPuzzles addresses this problem by focusing on visual reasoning while minimizing the need for specialized knowledge.

VisualPuzzles comprises a variety of questions covering five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. A significant portion of the questions originates from manually translated logic puzzles from the Chinese civil service examination. This approach allows for an assessment of the ability of AI models to apply complex reasoning processes without relying on extensive prior knowledge.

VisualPuzzles Compared to Other Benchmarks

Comparisons with established benchmarks like MMMU show that VisualPuzzles requires significantly less domain-specific knowledge while simultaneously querying more complex thought processes. This allows for a more precise evaluation of the actual multimodal reasoning ability of AI systems. While benchmarks like MMMU often query factual knowledge, VisualPuzzles focuses on the ability to draw logical conclusions from visual and textual information.

Challenges for Current AI Models

Evaluations show that even state-of-the-art multimodal language models lag significantly behind human performance on VisualPuzzles. Interestingly, strong performance on knowledge-intensive benchmarks does not necessarily correlate with success on task-oriented, knowledge-lean tests like VisualPuzzles. This underscores the need to consider and evaluate the reasoning abilities of AI models in a more differentiated manner.

Influence of Model Size and Inference Methods

The influence of model size and different inference methods, such as "Thinking Modes" to increase computational power during inference, was also investigated. The results show inconsistent improvements across different models and task types. No clear correlation between model size and performance could be established. This suggests that simply scaling models does not necessarily lead to an improvement in multimodal reasoning abilities.

Different Solution Approaches

The analysis of the AI models' solution approaches reveals different patterns compared to benchmarks with a stronger knowledge orientation. This highlights the unique nature of VisualPuzzles and its focus on reasoning processes. VisualPuzzles thus offers a new perspective on the evaluation of AI models that goes beyond mere factual knowledge and domain-specific expertise.

Outlook

VisualPuzzles represents an important step in the development of more robust and meaningful benchmarks for multimodal reasoning. The results of the evaluations highlight the challenges that still exist in developing AI models with true reasoning capabilities. Future research can build upon VisualPuzzles to advance the development and evaluation of AI models in the field of multimodal reasoning and to gain a deeper understanding of the cognitive abilities of AI systems.

Bibliography: Song, Y., Ou, T., Kong, Y., Li, Z., Neubig, G., & Yue, X. (2025). VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge. arXiv preprint arXiv:2504.10342. Kast, D. (2024). Multimodal Inference with Deep Neural Networks. Dissertation, University of Hamburg.