MM-IFEngine Advances Multimodal Instruction Following for AI

Multimodal Instruction Following: MM-IFEngine Paves the Way for More Complex AI Interactions

Interaction with Artificial Intelligence (AI) is rapidly evolving. Moving away from simple text commands towards more complex, multimodal instructions that combine image and text. A crucial factor for the success of this development is the ability of AI models to accurately understand and execute these multimodal instructions – the so-called "Instruction Following" (IF) capability. A research team has now introduced MM-IFEngine, a new method that promises significant advancements in this area.

Previous approaches to multimodal instruction following suffered from various limitations. Training data was scarce, benchmarks were too simple, and evaluation methods were inaccurate, especially for tasks with precise output requirements. MM-IFEngine addresses these challenges with a multi-stage approach.

At the heart of MM-IFEngine is a pipeline for generating high-quality image-instruction pairs. This pipeline enables the creation of extensive, diverse, and high-quality training data. The result is the MM-IFInstruct-23k dataset, which is ideally suited for Supervised Fine-Tuning (SFT). Furthermore, this dataset was expanded to MM-IFDPO-23k to enable Direct Preference Optimization (DPO).

To evaluate the performance of AI models in the field of multimodal instruction following, MM-IFEval was developed. This benchmark is characterized by its complexity and diversity. It includes specifications at the composition level for the output responses as well as specifications at the perception level, which are tied to the input images. The evaluation is carried out via a comprehensive pipeline that integrates both rule-based assessments and evaluation models.

Conducted SFT and DPO experiments show that fine-tuning Multi-Modal Large Language Models (MLLMs) with MM-IFInstruct-23k and MM-IFDPO-23k leads to significant improvements on various IF benchmarks. For example, improvements of 10.2% on MM-IFEval, 7.6% on MIA, and 12.3% on IFEval were achieved.

The development of MM-IFEngine and MM-IFEval represents an important step towards more effective and precise multimodal instruction following. The availability of extensive and high-quality training data and a robust benchmark makes it possible to train and evaluate the capabilities of MLLMs more effectively. This opens up new possibilities for the development of AI systems that can handle complex tasks in various application areas.

The publication of the data and evaluation code on GitHub underscores the research team's commitment to Open Science and allows other researchers to build on these results and further advance the development in the field of multimodal instruction following.

Bibliographie: Ding, S., Wu, S., Zhao, X., Zang, Y., Duan, H., Dong, X., Zhang, P., Cao, Y., Lin, D., & Wang, J. (2025). MM-IFEngine: Towards Multimodal Instruction Following. arXiv preprint arXiv:2504.07957. Inoue, N., Takagi, I., & Joty, S. (2023). Towards Flexible Multi-Modal Document Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11751-11761). Alayrac, J. B., Donahue, J., Epstein, D., Snajder, J., Clark, I., Gururangan, S., ... & Simonyan, K. (2024). MMT-IF: A Challenging Multi-modal Multi-turn Instruction Following Foundation Model Benchmark. Research at Google. Huang, J., Wang, C., Xue, H., Zhu, C., & Zhao, W. (2024). Multimodal Instruction Following with Frozen Vision-Language Models. Advances in Neural Information Processing Systems, 37. Wu, S., Zhang, P., Cao, Y., Zang, Y., Duan, H., Dong, X., ... & Wang, J. (2024). MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment. arXiv preprint arXiv:2407.01509. Zhang, P., Wu, S., Zang, Y., Duan, H., Dong, X., Cao, Y., ... & Wang, J. (2024). MM-IF: A Challenging Multi-modal Multi-turn Instruction Following Foundation Model Benchmark. arXiv preprint arXiv:2409.18216. ChatPaper. (n.d.). MM-IF: A Challenging Multi-modal Multi-turn Instruction Following Foundation Model Benchmark. Retrieved from https://chatpaper.com/chatpaper/zh-CN?id=4&date=1744300800&page=1 Russel, S., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach (3rd ed.). Pearson Education Limited. Ricoh. (n.d.). Printing with Various Paper Types. Retrieved from http://support.ricoh.com/bb_v1oi/pub_e/oi_view/0001036/0001036157/view/printer/unv/0021.htm