Evaluating Real-Time Video Understanding with OVO-Bench

Introduction

The rapid development of large language models (LLMs) has led to impressive advances in the field of video understanding in recent years. Models can now interpret complex scenes, answer questions about video content, and even summarize videos. However, a crucial aspect concerning the real-time processing of video streams continues to pose a challenge: so-called online video understanding. This article highlights the difficulties that video LLMs face in this area and introduces the new benchmark OVO-Bench, which was specifically developed to evaluate these capabilities.

The Importance of the Time Factor

The main difference between offline and online video understanding lies in the consideration of the point in time at which a question is asked. Offline models analyze the entire video before generating an answer. In contrast, online models must process the video stream incrementally and dynamically adapt their answers to the specific time at which the question is posed. This temporal awareness is essential for a comprehensive understanding of videos in real time.

OVO-Bench: A New Benchmark for Online Video LLMs

Previous benchmarks have not adequately addressed the aspect of temporal awareness. To close this gap, OVO-Bench (Online-Video-Benchmark) was developed. This benchmark explicitly focuses on the ability of video LLMs to understand events and answer questions that relate to specific points in time within a video.

Three Scenarios for Evaluation

OVO-Bench tests video LLMs in three different scenarios: * **Backward Tracking:** The model must refer to past events to answer the question. * **Real-Time Understanding:** The model must understand and answer events that are occurring at the current time in the video. * **Forward-Looking Active Response:** The model can delay the answer until enough future information is available to answer the question accurately. These scenarios cover a wide range of requirements placed on online video LLMs.

Structure and Scope of OVO-Bench

OVO-Bench comprises 12 different tasks with 644 unique videos and approximately 2,800 detailed, human-curated meta-annotations with precise timestamps. The combination of automated generation processes and manual review ensures the high quality of the data. A specially developed evaluation pipeline enables the systematic querying of video LLMs along the video timeline.

Current Challenges for Video LLMs

Initial evaluations of nine different video LLMs with OVO-Bench have shown that current models, despite good performance in traditional benchmarks, still struggle with online video understanding. There is a significant gap between the performance of the models and that of human subjects.

Outlook and Future Research

OVO-Bench is intended to advance the development of video LLMs and inspire future research in the field of online video understanding. The development of models capable of capturing the temporal context of video content in real time is an important step towards comprehensive AI-powered video analysis. For companies like Mindverse, which develop customized AI solutions, OVO-Bench offers a valuable resource for evaluating and optimizing their technologies.

Bibliography

Li, Y. et al. (2025). OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?. arXiv preprint arXiv:2501.05510.
Hong, W. et al. (2025). MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models. arXiv preprint arXiv:2501.02955.
PKU-YuanGroup. (n.d.). Video-Bench. GitHub repository. https://github.com/PKU-YuanGroup/Video-Bench
Huang, Z. et al. (2025). Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method. arXiv preprint arXiv:2501.00584.
Bouamor, H., Pino, J., & Bali, K. (Eds.). (2023). Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics.
EgoAlpha. (n.d.). prompt-in-context-learning. GitHub repository. https://github.com/EgoAlpha/prompt-in-context-learning/blob/main/historynews.md
Technologie Campus Grafenau. (n.d.). Publications. https://thd-web-lb-ext.th-deg.de/en/research/technology-campuses/tc-grafenau/tc-grafenau-publications
Kanazawa, N. et al. (2024). YOLO-World: Real-Time Open-Vocabulary Object Detection. https://www.researchgate.net/publication/384182925_YOLO-World_Real-Time_Open-Vocabulary_Object_Detection
International Conference on Robotics and Automation (ICRA). (2023). ICRA@40 Booklet. https://icra40.ieee.org/wp-content/uploads/sites/661/ICRA@40-Booklet-Final.pdf

```