Scaling Visual Tokenizers: Impacts on Image and Video Reconstruction and Generation

Visual Tokenization in Focus: Scaling for Reconstruction and Generation

The world of artificial intelligence is constantly evolving, and the generation of images and videos using AI models is a field that has made enormous progress in recent years. A key concept in this area is visual tokenization, which allows pixel information to be compressed into a latent space, thus forming the basis for modern generative models. While the scaling of transformer-based generators has been the focus of recent advances, the tokenization component itself has been scaled less frequently. This raises questions about how design decisions in the development of autoencoders affect both the reconstruction objective and downstream generative performance.

Recent research investigates the scaling of autoencoders to answer these questions. The researchers replace the typical convolutional backbone with an improved vision transformer architecture for tokenization (ViTok). By training ViTok on large image and video datasets that go far beyond ImageNet-1K, data limitations in scaling tokenization are lifted. The study examines how scaling the autoencoder's bottleneck affects reconstruction and generation.

The Influence of Scaling

It turns out that scaling the bottleneck strongly correlates with reconstruction performance. The larger the bottleneck, the better the reconstruction. However, the relationship to generation is more complex. Simply scaling the bottleneck does not necessarily lead to improved generation performance. The researchers also investigated the effect of separately scaling the encoder and decoder of the autoencoder. They found that scaling the encoder provides only minimal benefits for reconstruction or generation. Scaling the decoder, on the other hand, improves reconstruction, but the effects on generation are inconsistent.

ViTok: A Lightweight Autoencoder

Building on these findings, ViTok was developed as a lightweight autoencoder. It achieves competitive performance compared to other state-of-the-art autoencoders on reconstruction tasks on ImageNet-1K and COCO (256p and 512p) and surpasses existing autoencoders on 16-frame 128p video reconstruction for UCF-101 – all with 2-5x fewer FLOPs. In combination with diffusion transformers, ViTok shows competitive performance in image generation for ImageNet-1K and sets new standards for class-conditional video generation on UCF-101.

Conclusion

The scaling of visual tokenizers is a complex topic with far-reaching implications for the reconstruction and generation of images and videos. The research results show that simple scaling is not always the best solution and that careful consideration of various factors is required. ViTok, a lightweight autoencoder based on these findings, offers a promising alternative to existing approaches and opens up new possibilities for the development of more efficient and powerful generative AI models.

Bibliography: - Philippe Hansen-Estruch et al., "Learnings from Scaling Visual Tokenizers for Reconstruction and Generation", arXiv:2501.09755 [cs.CV], 2025. - Keyu Tian et al., "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction", arXiv:2404.02905v2 [cs.CV], 2024. - Shengju Qian et al., "What Makes for Good Tokenizers in Vision Transformer?", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, pp. 13011-13023, Nov. 2023. - https://vitok.github.io/ - Hugging Face Papers ```