Query-Oriented Token Assignment Improves Long Video Comprehension

Top post
Efficient Understanding of Long Videos through Targeted Token Assignment
Understanding long videos presents a particular challenge for Artificial Intelligence. The sheer amount of visual data requires efficient processing methods to optimally utilize computing resources while ensuring the accuracy of the results. A promising approach to address this challenge is the reduction of visual redundancy through the targeted selection of relevant image segments, represented by so-called "tokens". While existing methods often rely on post-processing token pruning in the decoder layers, they frequently neglect the semantic relationship between the visual tokens and the actual task, which is defined by a query.
A novel approach, QuoTA (Query-oriented Token Assignment), attempts to close this gap by performing token assignment at the input layer, taking into account the relevance of the visual information to the respective query. This query-oriented token selection approach is crucial because it adapts the visual processing to the specific requirements of the task, thus optimizing the use of the token budget while preserving semantically relevant content.
How QuoTA Works
QuoTA evaluates the importance of individual frames in relation to the query, enabling a one-time assignment of visual tokens before the cross-modal interactions in the decoder layers. This contrasts with conventional methods that perform token pruning only after processing in the decoder layers. By selecting relevant frames early on, the computational load is reduced and the efficiency of the model is increased.
Another important aspect of QuoTA is the decoupling of the query using Chain-of-Thoughts (CoT) Reasoning. This technique allows for a more precise assessment of frame importance by the underlying Large Video-Language Model (LVLM). By decomposing the query into a chain of thoughts, the model can better grasp the semantic relationship between the query and the visual content.
QuoTA is characterized by its plug-and-play functionality, which allows existing LVLMs to be easily extended. Integrating QuoTA into established models does not require fundamental changes to the architecture and enables a seamless improvement in performance.
Experimental Results and Outlook
Extensive tests have shown that implementing QuoTA in combination with LLaVA-Video-7B leads to an average performance increase of 3.2% across six benchmarks (including Video-MME and MLVU) while using the same visual token budget as the base model. These results underscore the potential of QuoTA for improving the understanding of long videos.
The targeted assignment of tokens based on relevance to the query represents a promising approach for the efficient processing of long videos. QuoTA offers an elegant solution that improves the performance of existing LVLMs without fundamentally changing the architecture. Future research could focus on further optimizing query decoupling and adapting to different video domains.
Bibliography: - Luo, Y., et al. "QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension." arXiv preprint arXiv:2503.08689 (2025). - https://huggingface.co/papers/2503.08689 - https://chatpaper.com/chatpaper/fr?id=4&date=1741708800&page=1 - https://hype.replicate.dev/?filter=past_week - https://github.com/dair-ai/ML-Papers-of-the-Week/blob/main/README.md - https://iclr.cc/virtual/2025/events/spotlight-posters - https://huggingface.co/papers?q=generic%20passages - https://hype.replicate.dev/ - https://eccv.ecva.net/virtual/2024/session/90 - https://github.com/dair-ai/ML-Papers-of-the-Week - https://www.researchgate.net/publication/382080633_Video-STaR_Self-Training_Enables_Video_Instruction_Tuning_with_Any_Supervision