Supervised Fine-Tuning vs. Reinforcement Learning: New Insights into Training Visual Language Models

Fine-tuning through Supervised Fine-Tuning or Reinforcement Learning? New Insights into Training Visual Language Models

Training large visual language models (LVLMs) is a complex undertaking. The current common approach, based on a combination of Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL), is now being questioned by new research findings. A recent study suggests that SFT can negatively impact subsequent RL by creating so-called "pseudo-reasoning paths." These mimic the reasoning processes of expert models, but often lead to lengthy, hesitant, and less informative steps that result in incorrect conclusions.

To systematically investigate these effects, the multimodal dataset "VLAA-Thinking" was developed. This dataset was created through a six-step process that includes image descriptions, distillation of reasoning processes, rewriting of answers, and their verification. VLAA-Thinking provides high-quality, step-by-step visual reasoning paths for SFT as well as a more challenging RL split from the same data source. Using this dataset, extensive experiments were conducted comparing SFT, RL, and their combinations.

The results show that while SFT helps models learn reasoning formats, it often forces them into imitative, rigid thinking patterns that hinder further learning. In contrast, the RL approach, based on Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating perception and cognition signals, promotes more authentic and adaptive reasoning behavior.

Of particular note is the VLAA-Thinker model, based on Qwen2.5VL 3B. It achieved the top position in the Open LMM Reasoning Leaderboard among the 4B-sized LVLMs, surpassing the previous state-of-the-art by 1.8%. These results are promising and could provide valuable insights into the development of reasoning-capable LVLMs.

The Challenge of Reasoning for AI Models

The development of AI models that can solve complex reasoning tasks is a central challenge of current research. While SFT teaches models to recognize and reproduce existing patterns, RL aims to promote independent problem-solving behavior through reward and punishment. However, the study shows that the combination of both methods does not always lead to the desired results. The "pseudo-reasoning paths" learned through SFT can hinder RL training, as the model tends to prefer these familiar paths instead of exploring new, potentially more efficient solutions.

Outlook and Significance for the Future of LVLMs

The findings of this study are particularly relevant for companies like Mindverse, which specialize in the development of AI solutions. The development of chatbots, voicebots, AI search engines, and knowledge systems requires models that not only retrieve information but also understand complex relationships and draw conclusions. Optimizing training methods for LVLMs is therefore crucial for progress in this field. The presented results suggest that a critical review of the common SFT-RL paradigms is necessary and alternative approaches, such as the presented RL approach with GRPO and a mixed reward module, are promising.

The research findings underscore the importance of continuous development and optimization of training methods for AI models. The development of robust, flexible, and truly reasoning-capable LVLMs remains an exciting challenge for the future.

Quellen: - https://arxiv.org/abs/2504.11468 - https://arxiv.org/html/2504.11468v1 - https://twitter.com/HEI/status/1912743816307634557 - https://huggingface.co/datasets/UCSC-VLAA/VLAA-Thinking - https://www.researchgate.net/publication/390670599_VLM-R1_A_Stable_and_Generalizable_R1-style_Large_Vision-Language_Model - https://www.reddit.com/r/LocalLLaMA/comments/1k0cpx4/sft_can_significantly_undermine_subsequent_rl_by/ - https://medium.com/@sahin.samia/deepseek-r1-explained-pioneering-the-next-era-of-reasoning-driven-ai-3eeb5ac4d4a0 - https://huggingface.co/papers?q=Open%20LMM%20Reasoning%20Leaderboard - https://github.com/LightChen233/Awesome-Long-Chain-of-Thought-Reasoning - https://x.com/cihangxie/status/1911840306661888247 ```