Tarsier2: A New Standard in Video Understanding and Detailed Description

From Detailed Video Descriptions to Comprehensive Video Understanding: Tarsier2 Sets New Standards

The development of large multimodal models that can process both visual and linguistic information is advancing rapidly. A promising approach in this area is detailed video description, which goes far beyond simple image captioning and requires a deep understanding of the video content. With Tarsier2, a research team now presents a Large Vision-Language Model (LVLM) that sets new standards in automated video description while demonstrating remarkable capabilities in general video understanding.

Three-Stage Upgrade for Improved Performance

Tarsier2 builds upon its predecessor model, Tarsier, and achieves significant performance gains through three key improvements:

Expansion of Training Data: The amount of video-text pairs used for pre-training has been increased from 11 million to 40 million. This expansion encompasses not only the volume of data but also the diversity of content, resulting in a more robust and generalizable model.

Fine-Tuned Temporal Alignment: During supervised fine-tuning, precise temporal alignment between visual and textual information is performed. This allows the model to more accurately capture the dynamic processes in videos and translate them into detailed descriptions.

Model-Based Sampling and DPO Training: Through model-based sampling, preference data is automatically generated, which is used in Direct Preference Optimization (DPO) training. This method optimizes the model to generate higher-quality and more accurate descriptions.

Convincing Results Compared to Leading Models

Comprehensive experiments demonstrate the performance of Tarsier2. Compared to established proprietary models like GPT-4o and Gemini 1.5 Pro, Tarsier2-7B consistently performs better in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B achieves an improvement in the F1 score of 2.8% over GPT-4o and 5.8% over Gemini 1.5 Pro. In direct comparison tests by human evaluators, Tarsier2-7B shows a performance advantage of 8.6% over GPT-4o and 24.9% over Gemini 1.5 Pro.

Versatility as a Generalist in Video Understanding

Tarsier2's capabilities are not limited to video description. The model also achieves new best results in 15 public benchmarks, encompassing tasks such as video question answering, video grounding, hallucination tests, and embodied question answering. These results underscore the versatility of Tarsier2 as a robust and generalist vision-language model.

From Research to Application: Potential for AI-Powered Content Creation

Developments in the field of video-language models open up new possibilities for AI-powered content creation. Automated video descriptions can improve the accessibility of videos, support search engine optimization, and facilitate the creation of metadata. Furthermore, models like Tarsier2 offer the potential for innovative applications in areas such as video analysis, video summarization, and interactive video experiences.

Outlook: Future Research and Development

Research on Large Vision-Language Models is dynamic and promising. Future work could focus on further scaling the models, integrating additional modalities such as audio, and developing more robust evaluation methods. The advancements in this area contribute to closing the gap between human and machine perception of videos and open up new avenues for interaction with visual content.

Bibliographie https://arxiv.org/html/2407.00634v2 https://arxiv.org/abs/2306.05424 https://aclanthology.org/2024.emnlp-main.898.pdf https://x.com/_akhaliq?lang=de https://aclanthology.org/2024.acl-long.679/ https://www.researchgate.net/publication/384211381_Video-ChatGPT_Towards_Detailed_Video_Understanding_via_Large_Vision_and_Language_Models https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding https://encord.com/blog/vision-language-models-guide/ https://vision-cair.github.io/Goldfish_website/