CoMPaSS Enables Precise Multi-Object Control in Text-to-Image Generation

Top post
Precise Object Control in Text-to-Image AI Models: CoMPaSS Enables More Realistic Scene Generation
The generation of images from text descriptions has made enormous progress in recent years through the use of diffusion models. Despite impressive results, existing methods often offer only limited possibilities for precise control of the generated objects, especially regarding their spatial arrangement and orientation. A new method called CoMPaSS (Control Multi-object Pose and Scene Synthesis) addresses this problem and enables targeted control of the orientation of multiple objects in a scene.
CoMPaSS extends existing text-to-image diffusion models with the ability to precisely control the orientation of individual objects. The core of the method lies in the use of so-called "compass tokens." These tokens, each assigned to an object, encode the desired spatial orientation and are passed to the diffusion model along with the text tokens. A compact neural network, the "compass encoder," generates these tokens based on the specified object orientation.
The training of CoMPaSS is performed on a synthetically generated dataset consisting of scenes with one or two 3D objects in front of a neutral background. It turned out that direct training leads to undesirable entanglement between the objects and insufficient orientation control. To counteract this, the developers intervene in the generation process and restrict the cross-attention maps of each compass token to the corresponding object area. This intervention significantly improves the precision of the orientation control.
The results show that the trained model possesses remarkable generalization capabilities. It can control the orientation of complex objects not included in the training dataset and also generate scenes with more than two objects. Furthermore, CoMPaSS can be combined with personalization methods to precisely control the orientation of new objects in various contexts.
The evaluation of CoMPaSS includes extensive tests and a user study. The results demonstrate that the method achieves significantly improved orientation control and text fidelity compared to existing methods. CoMPaSS opens up new possibilities for the creation of realistic and detailed images from text descriptions and could be used in various application areas in the future, such as the creation of virtual environments or the automated generation of product visualizations.
The development of CoMPaSS underscores the ongoing effort to improve the control and precision of text-to-image AI models. The ability to specifically control the spatial arrangement of objects is an important step towards even more realistic and flexible image generation.
Bibliography: - https://arxiv.org/abs/2504.06752 - https://arxiv.org/html/2504.06752v1 - https://huggingface.co/papers/date/2025-04-11 - https://x.com/RishubhParihar/status/1895557951567589669 - https://paperreading.club/page?id=298461 - https://www.reddit.com/r/ninjasaid13/comments/1jvnhvn/250406752_compass_control_multi_object/ - https://www.researchgate.net/publication/387140297_CoMPaSS_Enhancing_Spatial_Understanding_in_Text-to-Image_Diffusion_Models - https://openreview.net/forum?id=tMKz4IgSZQ - https://link.springer.com/book/10.1007/b101929 - https://cvpr.thecvf.com/Conferences/2025/AcceptedPapers