MCTS-Guided Data Selection Improves Visual Reasoning in AI

Efficient Learning for Visual Reasoning: How ThinkLite-VL Achieves More with Less Data

Improving the visual reasoning abilities of AI models typically requires large amounts of training data. However, a new approach, presented in the paper "SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement", shows that significant performance improvements can be achieved with significantly less data. The key lies in selecting the right training data and a method called Monte Carlo Tree Search (MCTS).

The Challenge of Data Selection

Training large language models (LLMs) for visual reasoning through Reinforcement Fine-Tuning (RFT) is computationally intensive and requires extensive datasets. The efficiency of this process depends heavily on the quality and difficulty of the training data. More challenging examples can accelerate the model's learning curve, but identifying this "challenging" data is often difficult.

MCTS: A New Approach to Evaluating Data Difficulty

The researchers propose an innovative application of MCTS to quantify the difficulty of training data. MCTS, a search method often used in game-based AI systems, allows for measuring the effort a model requires to solve a particular problem. The more iteration steps MCTS needs to arrive at a solution, the more difficult the underlying problem. This approach enables a targeted selection of training data that optimally promotes the learning process.

ThinkLite-VL: Proof of the Efficiency of Data-Reduced Training

Using an initial dataset of 70,000 open-source training examples, the researchers demonstrate the effectiveness of their method. Using MCTS, they filtered the data down to 11,000 particularly challenging examples and trained the Qwen2.5-VL-7B-Instruct model with it. The result, ThinkLite-VL, surpasses the performance of the original model by 7% – and this with a significantly reduced training dataset and without the use of Knowledge Distillation.

Impressive Results Compared to the Competition

ThinkLite-VL achieves compelling results in benchmarks, surpassing comparable 7B models for visual reasoning. Particularly noteworthy is the performance in the MathVista benchmark, where ThinkLite-VL, with an accuracy of 75.1%, even outperforms models like Qwen2.5-VL-72B, GPT-4o, and O1. This highlights the potential of MCTS-based data selection for the development of more efficient and powerful AI models.

Conclusion: Less is More

The study shows that the targeted selection of training data through MCTS can make a significant contribution to improving the visual reasoning of AI models. ThinkLite-VL proves that significant performance gains are possible with less, but carefully selected, data. This approach opens new perspectives for the development of resource-saving yet powerful AI systems.

Bibliographie: - https://arxiv.org/list/cs.CV/new - https://chatpaper.com/chatpaper/zh-CN?id=4&date=1744300800&page=1 - https://arxiv.org/pdf/2406.07394 - https://github.com/gabrielchua/daily-ai-papers - https://nips.cc/virtual/2024/poster/96309 - https://ml-research.github.io/people/kkersting/ - https://openreview.net/pdf/893aca5e4e6ee8109c7e6c341856e5dfc2c7a12b.pdf - https://paperswithcode.com/paper/sra-mcts-self-driven-reasoning-aurmentation - https://github.com/ThreeSR/Awesome-Inference-Time-Scaling - https://www.ml.informatik.tu-darmstadt.de/ - https://huggingface.co/papers/2504.07934