AI Model Generates Personalized Videos with Multiple Subjects in Open-Set Conditions

Personalized Video Generation: Multiple Subjects in Open-Set Scenarios

Personalized video generation has made significant strides in recent years. It allows for the synthesis of videos containing specific concepts like people, animals, or places. However, previous methods often focused on limited applications, required time-consuming optimization for each individual subject, or only supported a single subject. A new approach promises a remedy: Video Alchemist.

Video Alchemist: A New Approach to Personalized Video Generation

Video Alchemist is a video model with integrated multi-subject and open-set personalization capabilities for foreground objects and background. This eliminates the need for time-consuming optimization during the testing phase. The model is based on a novel diffusion transformer module that fuses each conditional reference image and its associated textual description of the subject using cross-attention layers. Developing such a comprehensive model presents two challenges: dataset creation and evaluation.

The Challenge of Dataset Creation

Paired datasets of reference images and videos are extremely difficult to collect. Therefore, selected video frames were used as reference images for training Video Alchemist, and a clip of the target video was synthesized. While models can easily denoise training videos based on reference images, they do not generalize well to new contexts. To mitigate this issue, an automatic data construction pipeline with extensive image augmentations was developed. This pipeline extracts object segments from the target videos and uses specific data augmentations to ensure that the model focuses on the identity of the subject in the reference images. This reduces the so-called "copy-and-paste" effect, which often occurs in reconstruction-based methods.

The Challenge of Evaluation

Evaluating open-set video personalization is another challenge. Traditional video personalization evaluation methods use similarity scores between the generated video and the reference images. However, this metric is not applicable to multiple entities as it cannot focus on each subject individually. To overcome these limitations, MSRVTT-Personalization was introduced, a comprehensive and robust evaluation protocol for personalization tasks. This new benchmark allows for evaluation in various conditioning modes, including conditioning on face crops, single or multiple arbitrary subjects, and combinations of foreground objects and backgrounds. In contrast to image similarity, the subject accuracy of each object segment is evaluated.

Results and Outlook

Extensive experiments demonstrate that Video Alchemist significantly outperforms existing personalization methods in quantitative and qualitative evaluations. The results demonstrate the effectiveness of the proposed components and the model's ability to generate personalized videos with multiple subjects in open-set scenarios. Video Alchemist represents a significant step towards more flexible and powerful video personalization and opens up new possibilities for creative applications.

The development of models like Video Alchemist highlights the potential of AI-powered content creation tools. Companies like Mindverse, specializing in developing AI solutions for text, image, and research, are driving these innovations. From chatbots and voicebots to AI search engines and customized knowledge management systems, the possibilities of AI in content creation are diverse and will significantly shape the future of media production.

Bibliography: - https://arxiv.org/abs/2501.06187 - https://openreview.net/forum?id=popKM1zAYa - https://arxiv.org/html/2501.06187v1 - https://openreview.net/pdf/0e2fed221a1c5db8ab777a4ecaeaf7ffc16ae3ba.pdf - https://github.com/tsaishien-chen/VideoAlchemy - https://openaccess.thecvf.com/content/CVPR2024/papers/Wei_DreamVideo_Composing_Your_Dream_Videos_with_Customized_Subject_and_Motion_CVPR_2024_paper.pdf - https://paperswithcode.com/paper/customvideo-customizing-text-to-video - https://paperreading.club/page?id=277727 - https://www.chatpaper.com/chatpaper/fr?id=4&date=1736697600&page=1 - https://dl.acm.org/doi/10.1145/3641519.3657469 ```