AI Advances in Generating Narrative Videos from Short Clips

From Short Clips to Complete Stories: Advances in Narrative Video Generation

Artificial intelligence (AI)-powered video generation has made significant strides in recent years. High-quality short video clips of a few seconds are now a reality. However, the creation of longer videos that convey coherent and informative events, thus supporting a cohesive narrative, continues to pose challenges for research.

A central aspect for the success of AI models is the availability of extensive and high-quality training data. In the field of narrative video generation, however, there is a lack of datasets that depict long, coherent stories. Existing video datasets are often insufficiently labeled, or the existing descriptions are too superficial to capture the complex processes of a narrative. Moreover, not all video types are suitable for learning narratives.

Against this backdrop, the development of specialized datasets is gaining importance. One example is the recently introduced "CookGen" dataset, which focuses on cooking videos. Cooking videos are particularly well-suited for narrative video generation because they typically offer clear, step-by-step instructions with unambiguous actions and visual states. "CookGen" comprises around 200,000 video clips with an average duration of 9.5 seconds and was compiled from existing datasets such as "YouCook2" and "HowTo100M." The videos were carefully filtered, labeled, and tagged with appropriate actions to ensure high data quality. Both automatic speech recognition (ASR) and manual annotations, as well as state-of-the-art Vision-Language Models (VLMs), were used for labeling.

VideoAuteur: A New Approach to Narrative Video Generation

Besides the development of datasets, the architecture of the AI models plays a crucial role. "VideoAuteur" is a new, autoregressive pipeline for generating long narrative videos. It consists of two main components: a "Long Narrative Video Director" and a visually conditioned video generation model.

The "Long Narrative Video Director" creates a coherent narrative flow by generating a sequence of visual embeddings or keyframes that represent the logical progression of the story. These embeddings serve as the basis for the video generation model to create the actual video sequences. The "Director" operates according to an interleaved procedure where actions, descriptions, and visual states are generated sequentially, with each step building upon the previous one. This allows for the creation of coherent and semantically consistent videos.

The visually conditioned video generation model uses the visual embeddings generated by the "Director" to create the individual video clips. By integrating action sequences, descriptions, and visual states, this approach goes beyond traditional image-to-video methods and enables continuous visual conditioning, which contributes to higher visual quality and coherence of the generated videos. In addition, robust error handling mechanisms have been implemented to ensure the stability of the output.

Outlook

The development of "CookGen" and "VideoAuteur" represents an important step towards the generation of longer, narrative videos. The combination of a specialized dataset and an innovative model architecture allows for the creation of videos that are not only visually appealing but also coherent and informative in terms of content. Research in this field is dynamic and promising. Future work could, for example, focus on expanding the scope of application to other domains beyond cooking or improving the interaction with the generated videos. The generation of videos that tell complex stories opens up new possibilities for creative applications and could fundamentally change the way we consume and create videos.

Bibliography: https://arxiv.org/abs/2501.06173 https://arxiv.org/html/2501.06173v1 https://www.chatpaper.com/chatpaper/de/paper/97333 https://videoauteur.github.io/ https://github.com/lambert-x/VideoAuteur https://www.alphaxiv.org/abs/2501.06173 https://www.researchgate.net/publication/381332555_Vidgen_Long-Form_Text-to-Video_Generation_with_Temporal_Narrative_and_Visual_Consistency_for_High_Quality_Story-Visualisation_Tasks https://github.com/showlab/Awesome-Video-Diffusion https://dl.acm.org/doi/10.1145/3607540.3617142