Visual In-Context Learning Enables Universal Image Generation with VisualCloze

Visual Context Recognition: A New Approach to Universal Image Generation

The rapid advancements in the field of generative AI, particularly in diffusion models, have revolutionized image generation. Despite impressive results, development mostly focuses on specialized models, each trained for a specific task. This is inefficient when a variety of different requirements need to be met. Universal models offer a potential solution, but face challenges such as generalizing task instructions, adequate task distribution, and a unified architecture design.

A promising approach to address these challenges is VisualCloze, a universal framework for image generation. VisualCloze supports a wide range of in-domain tasks, generalization to unseen tasks, the unification of multiple tasks in a single step, and even reverse generation. In contrast to existing methods that rely on language-based task instructions, which can lead to ambiguities and limited generalization, VisualCloze integrates visual in-context learning. This means the model learns to identify tasks based on visual examples.

The Challenge of Task Distribution and the Solution with Graph200K

Another problem with universal image generation models is the inherent sparsity of visual task distribution. This makes it difficult to learn transferable knowledge between different tasks. To address this problem, Graph200K was developed, a graph-structured dataset that establishes various interconnected tasks. Graph200K increases task density and promotes the transfer of knowledge between tasks, leading to more robust generalization.

The Strength of Inpainting Models

A notable aspect of VisualCloze is the utilization of the generative capabilities of pre-trained inpainting models. The developers found that the unified formulation of image generation in VisualCloze shares a consistent objective with image inpainting. This allows leveraging the strong generative priors of inpainting models without modifying the model's architecture. This simplifies the training process and enables efficient use of existing resources.

Diverse Application Possibilities

VisualCloze opens up a variety of application possibilities. In addition to supporting common image generation tasks, it also allows the combination of multiple tasks in a single step, for example, the simultaneous application of style transfer and object removal. Furthermore, VisualCloze supports reverse generation, i.e., the reconstruction of the conditions that led to a given image – a function that can be valuable for analyzing and understanding generated images.

Future Developments

VisualCloze represents an important step towards universal image generation models. Future research could focus on expanding the range of tasks, improving generalization capabilities, and developing more efficient training methods. The combination of visual in-context learning with graph-based datasets like Graph200K offers a promising foundation for the development of powerful and flexible image generation systems.

Sources: - Li, Zhong-Yu et al. “VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning.” arXiv preprint arXiv:2504.07960 (2025). - https://arxiv.org/abs/2312.03584 - https://arxiv.org/abs/2311.13601 - https://huggingface.co/papers/2412.01824 - https://hal.science/hal-03933089v2/file/ecir-2023-vf-authors.pdf - https://openreview.net/pdf/ec15a23348aa44671b854fbba130455ae69f9bd8.pdf - https://chatpaper.com/chatpaper/zh-CN?id=4&date=1744300800&page=1 - https://ivonajdenkoska.github.io/contextdiffusion/main.html - https://openreview.net/pdf/2c2260fc62d6a180e12e943725968e430205fe0a.pdf - https://www.reddit.com/r/StableDiffusion/comments/1dujfkk/what_is_the_image_model_equivalent_of_in_context/ - https://openai.com/index/introducing-4o-image-generation/ ```