OTTER: A Novel Vision-Language-Action Model for Robotics

OTTER: A New Approach for Vision-Language-Action Models

Robotics is currently experiencing rapid progress, driven by the development of increasingly powerful AI models. A particularly promising area is that of Vision-Language-Action (VLA) models. These models aim to generate robot actions based on visual observations and linguistic instructions. A novel approach in this field is OTTER (A Vision-Language-Action Model with Text-Aware Visual Feature Extraction), which offers an efficient and effective method for controlling robots through natural language and visual information.

Challenges of Conventional VLA Models

Previous VLA models are often based on the fine-tuning of pre-trained Vision-Language Models (VLMs). Visual and linguistic features are fed independently into downstream policies. However, this leads to a degradation of the pre-trained semantic alignments and requires high computational effort. Fine-tuning these large models is resource-intensive and requires extensive datasets.

OTTER's Innovative Approach: Text-Aware Visual Feature Extraction

OTTER takes a different path. Instead of processing all visual features, OTTER selectively extracts only the task-relevant visual features that are semantically aligned with the linguistic instruction. These features are then passed to the policy transformer. This innovative approach allows the pre-trained vision-language encoders to be frozen. This preserves the rich semantic information gained through extensive pre-training and allows it to be used effectively.

Advantages of OTTER

A key advantage of OTTER lies in its strong zero-shot generalization ability. This means that the model is able to react to new objects and environments without having been explicitly trained on them beforehand. This is made possible by maintaining the pre-trained semantic alignments. OTTER requires significantly less training data and resources compared to other VLA models. OTTER's architecture, based on a frozen, pre-trained CLIP model (400 million parameters) and a relatively small policy network (approximately 20-30 million parameters), allows training on a single workstation within 12 hours.

Experimental Results

OTTER has demonstrated its performance in both simulated and real-world experiments. The model significantly outperforms existing VLA models and shows impressive zero-shot generalization to new objects and environments. The results underscore OTTER's potential for use in real-world robotics applications.

Future Prospects

OTTER opens up new possibilities for the development of robust and flexible robot systems. The ability to derive complex actions from visual and linguistic information is an important step towards more intuitive and efficient human-robot interaction. The open-source availability of the code, checkpoints, and dataset allows the research community to build upon OTTER's results and further advance the development of VLA models. The promising results suggest that text-aware visual feature extraction will be a key concept for future VLA models.

Bibliographie: Huang, H., Liu, F., Fu, L., Wu, T., Mukadam, M., Malik, J., Goldberg, K., & Abbeel, P. (2025). OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction. arXiv preprint arXiv:2503.03734. https://arxiv.org/abs/2503.03734 https://www.researchgate.net/publication/389617061_OTTER_A_Vision-Language-Action_Model_with_Text-Aware_Visual_Feature_Extraction https://arxiv.org/html/2503.03734v1 https://ottervla.github.io/ https://www.xueshuxiangzi.com/downloads/2025_3_6/2503.03734.pdf https://www.aimodels.fyi/papers/arxiv/otter-vision-language-action-model-text-aware https://github.com/jonyzhang2023/awesome-embodied-vla-va-vln http://paperreading.club/page?id=289320 https://koaning.github.io/arxiv-frontpage/ https://www.computer.org/csdl/journal/tp/2024/08/10445007/1URb458Rah2