InternVL3: A New Open-Source Multimodal Model Achieves State-of-the-Art Performance

Top post
A New Milestone for Open-Source Multimodal Models: InternVL3
The development of multimodal AI models, capable of processing both text and visual information, is advancing rapidly. A promising new contribution to this field is InternVL3, an open-source model that is attracting attention due to its innovative training approach and impressive performance results.
Unlike many existing multimodal models, which are often based on adapting pure text models, InternVL3 pursues a native multimodal approach. This means that text and image data are not processed sequentially, but jointly from the beginning of the training process. This approach simplifies the training process and reduces the challenges that frequently arise when subsequently integrating visual data into text-based models.
A core component of InternVL3 is the so-called "Variable Visual Position Encoding" (V2PE). This technique allows the model to process extended multimodal contexts and capture the relationships between text and image elements more precisely. This is supplemented by advanced post-training techniques such as Supervised Fine-Tuning (SFT) and Mixed Preference Optimization (MPO), which further optimize the model's performance.
The developers of InternVL3 have also placed great emphasis on scalability and an optimized training infrastructure. This allows the model to be trained with large amounts of data, thus maximizing its performance. The results speak for themselves: In extensive tests, InternVL3 has shown convincing performance in a variety of multimodal tasks. Particularly noteworthy is the achieved score of 72.2 on the MMMU benchmark, making InternVL3 the current state-of-the-art among open-source MLLMs.
Compared to proprietary models such as ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, InternVL3 demonstrates comparably high performance while maintaining strong capabilities in pure text processing. This underscores the potential of open-source models to compete with commercial solutions in the field of multimodal AI.
Particularly noteworthy is the developers' decision to make both the training data and the model weights publicly available. This step, in the spirit of open science, allows the research community to build upon the results of InternVL3 and advance the development of future multimodal models. For companies like Mindverse, specializing in AI-based content solutions, open-source models like InternVL3 open up exciting new possibilities. Integrating such models into applications like chatbots, voicebots, or AI search engines could lead to innovative solutions and an enhanced user experience.
InternVL3 represents an important step in the development of powerful and accessible multimodal AI models. The combination of an innovative training approach, impressive performance, and a commitment to open science makes this model a promising candidate for future applications in various fields.
Bibliographie: https://internvl.github.io/blog/2025-04-11-InternVL-3.0/ https://chatpaper.com/chatpaper/?id=4&date=1744646400&page=1 https://github.com/friedrichor/Awesome-Multimodal-Papers https://huggingface.co/papers/2412.05271 https://openreview.net/forum?id=Gf1uBeuUJW https://bsky.app/profile/merve.bsky.social https://in.linkedin.com/in/janak-kapuriya-2b3980155 https://buttondown.com/ainews/archive/ainews-not-much-happened-today-2885/