Computational Limits of Visual Autoregressive Models

```html

Visual Autoregressive Models (VAR): A Look at Computational Limits

Visual generative models have made enormous progress in recent years, finding applications in areas such as image enhancement, augmented reality, medical diagnostics, and creative applications. Methods like Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and diffusion models have significantly expanded the possibilities of image generation, leading to improvements in realism, diversity, and overall quality.

VAR models represent a notable paradigm shift in image generation. Instead of the conventional "next-token prediction," VAR models introduce a coarse-to-fine "next-scale prediction" approach. This allows autoregressive transformers to learn visual distributions more efficiently and outperform diffusion-based alternatives. Moreover, VAR models demonstrate robust zero-shot capabilities in tasks like image inpainting and editing.

Despite their strengths, there is a critical need to investigate the computational limits of VAR models and develop efficient algorithms. While VAR models represent an improvement over previous autoregressive methods, the computational complexity of O(n⁴) remains a challenge, where n is the height and width of the final (largest) VQ code map. This work investigates the computational limits and the potential for more efficient algorithms for VAR models.

A central question is whether VAR model computations can be performed faster than O(n⁴). This work provides a positive answer and identifies the conditions under which VAR computations can achieve sub-quadratic time complexity.

A crucial factor is the norm of the input matrices used in the attention mechanisms of VAR models. A critical threshold for this norm is established. Above this threshold, assuming the Strong Exponential Time Hypothesis (SETH) from fine-grained complexity theory, an algorithm for VAR models with sub-quartic runtime is impossible.

Specifically, two results are presented: First, if the hidden dimension d = O(log n) and the upper bound of the matrix entries R = o(log n), there exists an algorithm with a runtime of O(n²⁺ᵒ⁽¹⁾) that approximates the output of the VAR model with an additive error of 1/poly(n). Second, if d = O(log n) and R = Θ(log n), it is impossible, assuming SETH, to approximate the output of the VAR model with an additive error of 1/poly(n) in a truly sub-quartic time O(n⁴⁻ᵠ⁽¹⁾).

These results are based on a detailed analysis of the three main components of VAR models: the VAR transformer, the feature map reconstruction block, and the VQVAE decoder. By leveraging low-rank approximations guided by the derived criteria, efficient constructions are presented that support the theoretical results.

This work makes a significant contribution to the understanding of the computational efficiency of VAR models from a theoretical perspective. The presented techniques offer promising approaches for developing scalable and efficient image generation methods within the framework of VAR models. They pave the way for future research on optimizing VAR algorithms and improving their practical applicability.

Bibliography

```