FAST and FAST+: Efficient Action Tokenization for Vision-Language-Action Models

Efficient Action Tokenization for Robotics: FAST and the Future of Vision-Language-Action Models

Robotics is currently undergoing a profound transformation, driven by advancements in Artificial Intelligence. Vision-Language-Action models (VLAs) play a central role in this. These models enable robots to perform complex tasks by linking visual information, linguistic instructions, and actions. A crucial factor for the efficiency and performance of VLAs is the tokenization of actions, i.e., the conversion of continuous motion sequences into discrete symbols that can be processed by the models.

The Challenge of Action Tokenization

Traditional methods for action tokenization, based on simple, dimension-wise discretization, reach their limits with complex, high-frequency robot movements. These methods cannot adequately capture the subtleties and nuances of skillful manipulations, leading to inaccurate and inefficient robot actions. Especially for tasks that require high precision and fast reactions, such as folding laundry or grasping small objects, these methods often fail completely.

FAST: A New Approach to Action Tokenization

To address this challenge, FAST (Frequency-space Action Sequence Tokenization) was developed, a novel method for action tokenization based on the Discrete Cosine Transform (DCT). Similar to image compression in the JPEG format, FAST uses DCT to efficiently compress action sequences and convert them into a sequence of discrete tokens. These tokens represent the actions in frequency space, preserving important information about the movement patterns while simultaneously reducing the amount of data.

By combining DCT with Byte-Pair Encoding (BPE), a compression algorithm frequently used in training large language models, FAST achieves even higher efficiency. The resulting tokens are densely packed and allow complex action sequences to be represented with a significantly smaller number of tokens than with conventional methods.

FAST+: A Universal Action Tokenizer

Based on FAST, FAST+ was developed, a universal action tokenizer trained on an extensive database of over one million real-world robot movements. FAST+ can be used as a black-box tokenizer for a variety of robotic systems, action spaces, and control frequencies. This significantly simplifies the development and training of VLA models, as no complex adaptation of the tokenization to the respective robot platform is required anymore.

Advantages of FAST and FAST+

The use of FAST and FAST+ offers numerous advantages for the training and application of VLA models:

- More efficient representation of actions: By compressing the action sequences, the amount of data is reduced, leading to faster training and more efficient inference. - Higher precision and dexterity: The DCT-based tokenization allows for a more accurate capture of complex motion sequences, resulting in more precise and skillful robot actions. - Scalability: FAST and FAST+ enable the training of VLA models on large datasets, further improving the performance of the models. - Simplified development: The use of a universal tokenizer like FAST+ reduces the development effort for VLA models.

Application Examples and Future Prospects

FAST and FAST+ have already achieved impressive results in the field of robotics. VLAs trained with FAST have successfully performed complex tasks such as folding laundry, clearing tables, and packing groceries. Furthermore, it has been shown that FAST-based models are capable of generalizing to new environments and implementing linguistic instructions in real-time.

The development of FAST and FAST+ represents an important step towards a future where robots are able to handle complex tasks in the real world. Efficient action tokenization paves the way for more powerful and versatile VLA models, which have the potential to fundamentally change robotics and open up new fields of application.

Bibliography: https://huggingface.co/papers/2501.09747 https://www.physicalintelligence.company/download/fast.pdf https://www.physicalintelligence.company/research/fast https://huggingface.co/physical-intelligence/fast https://arxiv.org/abs/2407.00114 https://openreview.net/pdf/9ac0b98a230a3ae07dbc5ece257b8f7484e528eb.pdf https://www.arxiv.org/abs/2412.13303 https://openvla.github.io/ https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00808.pdf https://proceedings.neurips.cc/paper_files/paper/2023/file/5e84e4413268b713f0d4a1b23a9dae57-Paper-Conference.pdf