Data Augmentation Improves Visual Reasoning in Vision-Language Models

Visual Reasoning through Data Augmentation: A New Approach in Reinforcement Learning for Vision-Language Models

Combining visual perception and language processing is one of the greatest challenges in Artificial Intelligence. Vision-Language Models (VLMs) are designed to link these two worlds, enabling complex tasks such as image captioning, visual question-answering systems, or interaction with the physical world through robots. A promising approach to improving VLMs is Reinforcement Learning (RL), which trains the model to learn optimal action strategies through rewards and penalties. A new research paper presents an innovative method that optimizes the RL training of VLMs through targeted data augmentation.

The presented method, called NoisyRollout, aims to improve the exploration capability of VLMs while increasing robustness to imperfect visual perception. Conventional RL methods for VLMs often reach their limits when it comes to effectively exploring the search space and thus finding optimal strategies. Furthermore, errors in visual perception, such as noise or blur, can negatively influence subsequent reasoning processes.

NoisyRollout addresses these challenges by introducing targeted noise into the training process. Specifically, the model is presented with both unaltered and slightly noisy images during training. This mixture of clean and distorted images leads to greater diversity in visual perception and the resulting thought patterns of the model. By being confronted with different visual interpretations, the model learns to draw more robust conclusions and not rely on individual pixels or features.

Another important aspect of NoisyRollout is the use of a so-called noise-annealing schedule. The strength of the noise is gradually reduced during the course of training. At the beginning of training, the model benefits from the strong noise signals, which promote exploration. As training progresses, the noise is reduced to ensure the stability and scalability of the training and to enable convergence to an optimal strategy.

The results of the study show that NoisyRollout achieves excellent performance with a comparatively small amount of training data (2.1K samples). On five different benchmarks, encompassing both reasoning and perception tasks, NoisyRollout surpasses the performance of existing open-source RL models. Particularly noteworthy is the performance improvement on out-of-domain benchmarks, suggesting improved generalization ability of the model. At the same time, in-domain performance remains comparable or even better than conventional methods.

The presented method offers a promising approach to improving VLMs by combining reinforcement learning and data augmentation. The targeted introduction of noise into the training process promotes exploration and increases robustness to imperfect visual perception. The results of the study suggest that NoisyRollout has the potential to drive the development of more powerful and robust VLMs, thus opening up new application areas in fields such as robotics, image understanding, and human-computer interaction.

Bibliography: Liu, X., Ni, J., Wu, Z., Du, C., Dou, L., Wang, H., Pang, T., & Shieh, M. Q. (2025). NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation. arXiv preprint arXiv:2504.13055. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140), 1-67. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901. Thorben, J., Ranftl, R., & Kolesnikov, A. (2022). A Study on Data Augmentation for GAN Training. arXiv preprint arXiv:2210.04561. Wang, Z., Yuan, Z., Yu, W., Yu, J., Liu, Z., & Sun, M. (2024). Masked Image Modeling with Local Multi-Scale Reconstruction. arXiv preprint arXiv:2405.17416. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don't stop pretraining: Adapt language models to domains and tasks. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 834-846).

Data Augmentation Improves Visual Reasoning in Vision-Language Models

Top post

Visual Reasoning through Data Augmentation: A New Approach in Reinforcement Learning for Vision-Language Models

Related blog

New Occlusion-Robust Vision Transformer Method for Real-Time Drone Tracking

Complex-Edit: A New Benchmark for AI Image Editing

Meta Releases Open Source PerceptionLM and VideoBench for Detailed Visual Understanding