GigaTok: Scaling Visual Tokenizers for Improved Autoregressive Image Generation

Large-Scale Visual Tokenizers: GigaTok Revolutionizes Autoregressive Image Generation

Autoregressive (AR) image generation has made enormous progress in recent years. A central component of this technology is visual tokenizers, which convert images into compact, discrete tokens. These tokens serve as the basis for autoregressive models, which generate images by predicting the next token. Previous research has shown that while scaling visual tokenizers improves image reconstruction quality, it often impairs the quality of generated images. A new model called GigaTok addresses this problem and presents an innovative approach that improves image reconstruction, generation, and representation learning when scaling visual tokenizers.

The Challenge of Scaling

The increasing complexity of the latent space when scaling tokenizers has been identified as the main reason for the dilemma between reconstruction and generation. GigaTok addresses this challenge with what is called semantic regularization. This approach aligns the features of the tokenizer with semantically consistent features of a pre-trained visual encoder. This prevents excessive complexity of the latent space during scaling, leading to a simultaneous improvement in reconstruction and autoregressive generation.

Three Key Practices for Scaling Tokenizers

Building on semantic regularization, GigaTok follows three key practices for scaling tokenizers:

1. Using 1D tokenizers for better scalability

2. Prioritizing decoder scaling when expanding encoder and decoder

3. Using entropy loss to stabilize training for billion-scale tokenizers

Results and Significance of GigaTok

According to the developers, by scaling to 3 billion parameters, GigaTok achieves state-of-the-art performance in image reconstruction, autoregressive generation, and the quality of autoregressive representation. These results underscore the potential of GigaTok to significantly advance autoregressive image generation and open up new possibilities in areas such as computer vision and creative image editing.

Outlook

GigaTok represents an important step in the development of powerful visual tokenizers. The combination of semantic regularization and the three scaling practices allows for an improved balance between reconstruction and generation. Future research could focus on further optimizing semantic regularization and exploring new scaling strategies to further increase the performance of visual tokenizers and unlock new applications.

Further Research and Development

The development of GigaTok is embedded in the ongoing research in the field of autoregressive image generation. The continuous improvement of visual tokenizers plays a crucial role in the further development of this technology. Future work could focus on the investigation of different regularization methods and the development of even more efficient scaling strategies to further improve the performance and applicability of autoregressive models.

Bibliographie: https://github.com/SilentView/GigaTok https://chatpaper.com/chatpaper/?id=4&date=1744560000&page=1 https://huggingface.co/papers/2412.02692 https://arxiv.org/html/2412.02692v1 https://arxiv.org/html/2404.02905v1 https://neurips.cc/virtual/2024/poster/94115 https://github.com/FoundationVision/VAR https://openreview.net/forum?id=jQP5o1VAVc https://gweb-research-parti.web.app/parti_paper.pdf https://gregrobison.medium.com/autoregressive-models-for-image-generation-principles-architectures-and-analysis-aea15e582260