Complex-Edit: A New Benchmark for AI Image Editing

A New Benchmark for Image Editing: Complex-Edit Tests AI Models with Complex Instructions

AI-powered image editing has made tremendous progress in recent years. But how well do AI models actually handle complex editing tasks? A new benchmark called Complex-Edit aims to answer precisely this question. Developed by a research team, Complex-Edit offers a comprehensive testing environment to systematically evaluate the performance of AI-based image editing programs.

GPT-4 for Generating Complex Instructions

The special feature of Complex-Edit is the way the test instructions are generated. GPT-4 is used here, a large language model capable of automatically creating diverse and complex editing instructions. The process follows a so-called "Chain-of-Edit" pipeline: First, individual, atomic editing steps are generated, which are then linked to form more complex instructions. This creates a wide range of challenges for the AI models, from simple adjustments to complex image manipulations.

Diverse Metrics and Automated Evaluation

To objectively measure the performance of the AI models, Complex-Edit uses a range of metrics. These evaluate various aspects of image editing, such as the accuracy of the implementation of the instructions, the preservation of important image elements, and the overall aesthetics of the result. In addition, an automated evaluation pipeline based on Vision-Language Models (VLMs) is used, enabling efficient evaluation on a large scale.

Key Findings from the Benchmark

The initial results of the Complex-Edit benchmark already provide interesting insights into the strengths and weaknesses of current AI models:

Open-source models show significantly lower performance compared to proprietary, closed-source models, especially with increasing complexity of the instructions.

With increasing complexity of the instructions, it becomes more difficult for the models to preserve important elements of the original image and ensure aesthetic quality.

Decomposing a complex instruction into individual, sequentially executed steps leads to a decrease in performance across various metrics.

A simple "Best-of-N" strategy, where the best of several generated images is selected, improves the results both for direct editing and for the stepwise approach.

A "curse of synthetic data" is evident: Models trained with synthetic data tend to produce increasingly artificial images as the complexity of the instructions increases. Interestingly, this phenomenon can also be observed in the outputs of GPT-4.

Outlook

Complex-Edit offers a valuable tool for tracking progress in the field of AI-powered image editing and driving the development of more powerful models. The insights from the benchmark can help to better understand the challenges in processing complex instructions and to make targeted improvements to the algorithms. Especially for companies like Mindverse, which work on customized AI solutions, benchmarks like Complex-Edit provide an important basis for the development and optimization of image editing functions in chatbots, voicebots, and other AI applications.

Bibliographie: Yang, S., Hui, M., Zhao, B., Zhou, Y., Ruiz, N., & Xie, C. (2025). Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark. arXiv preprint arXiv:2504.13143. https://arxiv.org/abs/2504.13143 https://arxiv.org/html/2504.13143v1 https://paperreading.club/page?id=300247 https://neurips.cc/virtual/2024/poster/97473 https://github.com/wangkai930418/awesome-diffusion-categorized https://openreview.net/forum?id=1dpmeH6IHa¬eId=Ood0ELpm0d https://www.researchgate.net/publication/384235529_Emu_Edit_Precise_Image_Editing_via_Recognition_and_Generation_Tasks https://huggingface.co/papers?q=image%20editing https://www.mdpi.com/2076-3417/15/3/1079 https://github.com/showlab/Awesome-Video-Diffusion