LLaVA-Mini: Efficient Multimodal Model Achieves High Performance with Single Vision Token

```html

Efficient Multimodal Models for Image and Video Processing: LLaVA-Mini and the Single Vision Token Concept

The development of large multimodal models (LMMs) like GPT-4 has significantly driven interest in efficient solutions in this area. Traditional LMMs encode visual inputs into so-called vision tokens – continuous representations – and integrate these, along with text instructions, into the context of large language models (LLMs). However, the large number of parameters and the multitude of context tokens, particularly vision tokens, lead to significant computational costs. Previous approaches to increasing the efficiency of LMMs have mainly focused on replacing the LLM backbone with smaller models. In doing so, the crucial question of token quantity has often been neglected.

LLaVA-Mini takes a different approach. This efficient LMM operates with a minimal number of vision tokens. To achieve a high compression ratio while preserving visual information, an analysis was first conducted on how LMMs process vision tokens. It was found that most vision tokens play their primary role in the early layers of the LLM backbone, where they integrate visual information into text tokens. Based on this, LLaVA-Mini introduces a so-called modality pre-fusion. Visual information is merged with the text tokens in advance. This allows for extreme compression of the vision tokens passed to the LLM backbone, ultimately down to just a single token per image.

LLaVA-Mini is a universal, large multimodal model that can efficiently process images, high-resolution images, and videos. Tests with eleven image-based and seven video-based benchmarks show that LLaVA-Mini, with only one vision token, surpasses the performance of LLaVA-v1.5, which operates with 576 tokens. Efficiency analyses show that LLaVA-Mini reduces the number of floating-point operations (FLOPs) by 77%, achieves latencies of under 40 milliseconds, and can process over 10,000 video frames on a GPU with 24 GB of memory.

The architecture of LLaVA-Mini is based on the idea of pre-fusing the modalities. A compression module, based on cross-attention with learnable compression queries, reduces the number of vision tokens. The pre-fusion of modalities allows visual information to be integrated into text tokens early on. This reduces the need for a large number of vision tokens in the LLM backbone. Using only a single token per image significantly reduces the computational cost and memory requirements.

The advantages of LLaVA-Mini are manifold. The reduced computational cost enables faster response times. The lower memory requirement allows for the processing of longer videos and larger images. The performance of LLaVA-Mini is comparable to, or even better than, that of models operating with a significantly higher number of vision tokens. These efficiency improvements open up new possibilities for the use of LMMs in resource-constrained environments, such as on mobile devices or in real-time applications.

The development of LLaVA-Mini represents an important step towards more efficient LMMs. By focusing on reducing the number of tokens, a new path for optimizing performance and resource consumption is revealed. Future research could focus on further improving compression methods and adapting the approach to other LMM architectures. LLaVA-Mini demonstrates the potential for efficient and powerful multimodal models and could form the basis for future innovations in this field.

Bibliography:

```