Boosting Generative Model Training with Pretrained Representations

Top post
More Efficient Training of Generative Models through Embedded Representation Warmup
Generative models, especially diffusion models, have made impressive progress in recent years in generating high-dimensional data such as images. Despite their ability to generate realistic and detailed content, they still face challenges compared to self-supervised learning methods regarding training efficiency and the quality of learned representations. A significant bottleneck lies in the insufficient use of semantically rich representations during the training process, which considerably impairs the convergence speed.
Current research shows that the early layers of neural networks, in particular, play a crucial role in processing and transforming features. In these layers, the so-called "representation processing region," the learning of semantic and structural patterns takes place, forming the basis for subsequent generation. Inefficient use of this region leads to a slower learning process and suboptimal results.
To address this challenge, the "Embedded Representation Warmup" (ERW) framework has been developed. ERW is a two-stage, modular approach that initializes the early layers of the diffusion model with high-quality, pre-trained representations. In the first phase, the "warmup," these representations, for example, from models like DINOv2 or other self-supervised encoders, are integrated into the diffusion model. This minimizes the effort required for learning representations from scratch.
In the second phase, the full training, the model continues with standard diffusion training. The influence of the pre-trained representations is gradually reduced, giving the model the opportunity to focus on refining the generation process. This approach enables faster convergence and increased performance.
Theoretical analyses demonstrate that the effectiveness of ERW depends on its precise integration into the representation processing region. The targeted initialization of the relevant layers maximizes the benefit of the pre-trained representations. Empirical studies show that ERW not only significantly increases training speed – achieving up to a 40-fold acceleration compared to state-of-the-art methods like REPA – but also improves the quality of the learned representations.
The implications of this research are far-reaching. ERW offers a promising tool for optimizing the training of generative models and enables more efficient use of resources. The combination of pre-trained representations and targeted training opens up new possibilities for developing more powerful and efficient generative AI systems. Publishing the code and weights on platforms like Hugging Face facilitates further research and application of ERW in practice.
Bibliography: - Liu, D., Sun, P., Li, X., & Lin, T. (2025). Efficient Generative Model Training via Embedded Representation Warmup. arXiv preprint arXiv:2504.10188. - https://arxiv.org/html/2504.10188v1 - https://paperreading.club/page?id=299285 - https://aclanthology.org/2024.emnlp-main.454.pdf - https://chatpaper.com/chatpaper/zh-CN?id=5&date=1744646400&page=1 - https://proceedings.neurips.cc/paper_files/paper/2024/file/04a80267ad46fc730011f8760f265054-Paper-Conference.pdf - https://www.researchgate.net/publication/389589405_Efficient_and_scalable_huge_embedding_model_training_via_distributed_cache_management - https://opus4.kobv.de/opus4-haw/files/3916/2306.00637.pdf - https://openreview.net/pdf?id=ZgDNrpS46k - https://nips.cc/virtual/2024/papers.html