Refining Diffusion Models with a Weak-to-Strong Approach

From Weak to Strong Diffusion Models: A Reflection Approach

Diffusion models have established themselves as powerful tools for content generation, from images and videos to text and audio. Their goal is to match the learned distribution as closely as possible to the distribution of real-world data. This is achieved through gradient score matching, a process that optimizes the agreement between the data generated by the model and the training data. Despite impressive progress, challenges remain. The quality of the training data, the modeling strategies, and the architecture of the model itself can lead to discrepancies between the generated outputs and the real data.

A promising approach to bridging this gap is the so-called "Weak-to-Strong Diffusion" (W2SD) framework. W2SD leverages the estimated difference between existing weak and strong models (the "Weak-to-Strong difference") to approximate the discrepancy between an ideal model and a strong model. Simply put, the model learns from the strengths and weaknesses of different models.

At the heart of the W2SD approach is a reflexive operation that alternates between denoising and inversion using the Weak-to-Strong difference. Denoising refers to the process of removing noise from the data, while inversion describes the reverse process, i.e., adding noise. Through this iterative process, guided by the Weak-to-Strong difference, the latent variables are steered along the sampling trajectories towards the real data distribution. Theoretically, this approach allows for improved adaptation to the real data distribution and thus higher quality of the generated content.

The flexibility of W2SD is a key advantage. By strategically selecting the weak and strong model pairs, the approach can be adapted to various use cases. For example, different versions of the same model (e.g., DreamShaper vs. SD1.5) or different expert models within a Mixture-of-Experts (MoE) model can be used. The choice of models influences how the Weak-to-Strong difference is calculated and thus the optimization of the diffusion model.

Extensive experiments demonstrate the effectiveness of W2SD. Improvements are evident in various areas, including human preference, aesthetic quality, and prompt accuracy. W2SD achieves state-of-the-art results across various modalities (e.g., image, video), architectures (e.g., UNet-based, DiT-based, MoE), and benchmarks. For example, Juggernaut-XL with W2SD was able to increase the win rate in the HPSv2 benchmark to up to 90% compared to the original results. Notably, the performance gain from W2SD significantly outweighs the additional computational cost.

The cumulative improvements from various Weak-to-Strong differences underscore the practical utility and applicability of W2SD. By combining different model pairs, the strengths of various models can be effectively leveraged, and the weaknesses of individual models can be compensated for. This opens up new possibilities for the development of even more powerful diffusion models and the generation of content that comes even closer to real data in quality and diversity.

Bibliography: - http://www.arxiv.org/abs/2502.00473 - https://arxiv.org/html/2502.00473v1 - https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/04633.pdf - https://paperreading.club/page?id=281673 - https://pixart-alpha.github.io/PixArt-sigma-project/ - https://openreview.net/pdf/620820e80add47ee89c60c5b49b9ba36d6f24d4f.pdf - https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/04633-supp.pdf - https://www.researchgate.net/publication/386076902_PIXART-Sigma_Weak-to-Strong_Training_of_Diffusion_Transformer_for_4K_Text-to-Image_Generation - https://github.com/dair-ai/ML-Papers-of-the-Week - https://huggingface.co/papers/2404.01294

Refining Diffusion Models with a Weak-to-Strong Approach

Top post

From Weak to Strong Diffusion Models: A Reflection Approach

Related blog

YuE: Open Foundation Model for Long-Form Music Generation

SegAgent Advances Pixel-Precise Image Understanding in Multimodal Language Models

Query-Oriented Token Assignment Improves Long Video Comprehension