Reinforcement Learning Enhances Deliberation in Vision-Language Models

Self-Reflection in Focus: VL-Rethinker Optimizes Decision-Making of Image-Text Models through Reinforcement Learning

Artificial intelligence (AI) is developing rapidly, especially in the field of multimodal models that can process both text and images. A promising approach to improving these models is the integration of "slow-thinking" mechanisms that enable conscious reflection and review of decisions. A current example of this is VL-Rethinker, an image-text model trained through reinforcement learning (RL) to optimize its self-reflection capabilities.

The Challenge: Fast Decisions vs. Thorough Analysis

Previous "fast-thinking" models, such as GPT-4, are characterized by rapid decision-making. However, they reach their limits with more complex tasks that require deeper analysis. "Slow-thinking" models like GPT-o1 and DeepSeek-R1, on the other hand, which integrate explicit reflection, show significantly better results in demanding mathematical and scientific benchmarks. In the multimodal domain, i.e., the combined processing of image and text, however, the advantages of "slow-thinking" models have so far been limited. VL-Rethinker addresses this very point and aims to strengthen self-reflection in image-text models.

The Approach: Reinforcement Learning and Forcing Reflection

VL-Rethinker uses the GRPO algorithm, a variant of reinforcement learning, to train the model. To address the problem of "vanishing gradients," a novel technique called "Selective Sample Replay" (SSR) is used. SSR allows the model to learn from past experiences by selectively revisiting relevant examples. Furthermore, VL-Rethinker integrates a mechanism called "Forced Rethinking." Here, at the end of an initial solution attempt, the model is presented with a textual "rethinking trigger" that explicitly prompts it to review its answer. This combination of SSR and "Forced Rethinking" promotes self-reflection and leads to a significant improvement in performance.

Successes and Outlook

The results of VL-Rethinker are promising. In benchmarks such as MathVista, MathVerse, and MathVision, the model achieves new highs of 80.3%, 61.8%, and 43.9%, respectively. VL-Rethinker also achieves considerable results in multidisciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, closing the gap to leading "slow-thinking" models like GPT-o1. VL-Rethinker thus demonstrates the potential of reinforcement learning and targeted self-reflection for optimizing image-text models. Future research could focus on further improving the "rethinking" mechanisms and applying this approach to other multimodal tasks.

Bibliography: - https://www.arxiv.org/abs/2504.02587 - https://huggingface.co/papers - https://arxiv.org/pdf/2504.02587 - https://github.com/TIGER-AI-Lab - https://huggingface.co/papers/2504.02587 - https://www.esann.org/sites/default/files/proceedings/2024/ES2024-181.pdf - https://proceedings.neurips.cc/paper_files/paper/2024/file/ed45d6a03de84cc650cae0655f699356-Paper-Conference.pdf - https://www.researchgate.net/publication/390213937_ViLBench_A_Suite_for_Vision-Language_Process_Reward_Modeling/download - https://papers.nips.cc/paper_files/paper/2024/file/5c20c00504e0c049ec2370d0cceaf3c4-Paper-Conference.pdf - https://github.com/yaotingwangofficial/Awesome-MCoT