GS-DiT: Enhancing 4D Video Generation with Gaussian Splatting and Dense Point Tracking

From Point Tracking to Video Generation: GS-DiT Expands the Possibilities of 4D Video Control

Control over four-dimensional content (4D) – that is, 3D content plus time – is a crucial factor for the further development of video generation. It enables the use of complex camera techniques such as multi-camera shots and dolly zoom effects, which have been difficult to implement with existing methods. However, directly training Video Diffusion Transformers (DiT) for 4D control requires elaborate multi-view videos.

GS-DiT, a novel framework, addresses this challenge by integrating pseudo-4D Gaussian fields into video generation. Inspired by Monocular Dynamic Novel View Synthesis (MDVS), a technique that optimizes 4D representations and renders videos based on various 4D elements like camera position and object motion, GS-DiT constructs a pseudo-4D Gaussian field using dense 3D point tracking. This field is then rendered for all video frames. Subsequently, a pre-trained DiT is fine-tuned to generate videos under the guidance of the rendered video.

Efficient Point Tracking as the Key

A central component of GS-DiT is the efficient Dense 3D Point Tracking (D3D-PT) method. This method enables the construction of the pseudo-4D Gaussian field and surpasses SpatialTracker, the current state-of-the-art in sparse 3D point tracking, in both accuracy and speed. D3D-PT accelerates inference by two orders of magnitude and thus optimizes the training of GS-DiT.

New Possibilities for Video Control

In the inference phase, GS-DiT can generate videos with identical dynamic content but considering different camera parameters. This addresses a significant limitation of current video generation models. GS-DiT extends the 4D controllability of Gaussian Splatting beyond mere camera position and enables advanced cinematic effects by manipulating the Gaussian field and camera intrinsics. For example, focal length, aperture, and distortion can be adjusted to achieve the desired look.

Generalization and Creative Applications

GS-DiT demonstrates strong generalization capabilities and offers a powerful tool for creative video production. By combining Gaussian Splatting and Video Diffusion Transformers, GS-DiT opens up new avenues for creating videos with complex camera movements and dynamic content. The technology could find future applications in various fields such as film, advertising, and virtual reality.

Bibliographie Bian, W., Huang, Z., Shi, X., Li, Y., Wang, F.-Y., & Li, H. (2025). GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking. arXiv preprint arXiv:2501.02690. Patas, J. (2025, January 7). GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking [Tweet]. Twitter. https://x.com/janusch_patas/status/1876496980534599739 Paper Reading AI Learner. (2025, January 5). GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking. https://paperreading.club/page?id=276899 Zhao, Z. (2025, January 7). GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking [Tweet]. Twitter. https://twitter.com/zhenjun_zhao/status/1876481819241324815 Hugging Face. Papers. https://huggingface.co/papers 52CV. (n.d.). ECCV-2024-Papers. GitHub. https://github.com/52CV/ECCV-2024-Papers MrNeRF. (n.d.). awesome-3D-gaussian-splatting. GitHub. https://github.com/MrNeRF/awesome-3D-gaussian-splatting/blob/main/awesome_3dgs_papers.yaml The IEEE/CVF Computer Vision Foundation. (n.d.). CVPR 2024 Accepted Papers. https://cvpr.thecvf.com/Conferences/2024/AcceptedPapers ResearchGate. (n.d.). 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion. https://www.researchgate.net/publication/386502909_4Real-Video_Learning_Generalizable_Photo-Realistic_4D_Video_Diffusion Neural Information Processing Systems. (n.d.). NeurIPS 2024. https://nips.cc/virtual/2024/papers.html