Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis

CVPR 2024

Zhan Li1,2     Zhang Chen1,†     Zhong Li1,†     Yi Xu1    
1 OPPO US Research Center        2 Portland State University      

 

2
Corresponding authors


Real-time interactive demo and spiral path rendering. Scenes are from Google Immersive [1], Technicolor [2], and Neural 3D Video [3] datasets.
*The visualization of the Gaussians is inspired by Luma AI.

Abstract

Novel view synthesis of dynamic scenes has been an intriguing yet challenging problem. Despite recent advancements, simultaneously achieving high-resolution photorealistic results, real-time rendering, and compact storage remains a formidable task. To address these challenges, we propose Spacetime Gaussian Feature Splatting as a novel dynamic scene representation, composed of three pivotal components.

First, we formulate expressive Spacetime Gaussians by enhancing 3D Gaussians with temporal opacity and parametric motion/rotation. This enables Spacetime Gaussians to capture static, dynamic, as well as transient content within a scene. Second, we introduce splatted feature rendering, which replaces spherical harmonics with neural features. These features facilitate the modeling of view- and time-dependent appearance while maintaining small size. Third, we leverage the guidance of training error and coarse depth to sample new Gaussians in areas that are challenging to converge with existing pipelines.

Experiments on several established real-world datasets demonstrate that our method achieves state-of-the-art rendering quality and speed, while retaining compact storage. At 8K resolution, our lite-version model can render at 60 FPS on an Nvidia RTX 4090 GPU.


Video

 

Real-Time 8K Demo

 

 

Method


(a) Our method leverages a set of Spacetime Gaussians (STG) to represent the dynamic scenes. On top of 3D Gaussian [4], each STG is further equipped with temporal opacity, polynomial motion/rotation and time-dependent features. (b) The features are splatted into 2D image plane and then converted to color image via MLP.

We also observe that areas which have sparse Gaussians at initialization are challenging to converge to high rendering quality. Therefore, we introduce a Gaussian sampling strategy with the guidance of training error and coarse depth. We sample new Gaussians along the rays of pixels that have large errors during training.

Results

On the Neural 3D Video dataset [3], our method achieves state-of-the-art rendering quality and speed, while retaining compact storage.

Visual comparisons with other methods

Ours Dynamic 3DGS [5]
Ours K-Planes [6]
Ours HexPlane [7]
Ours MixVoxels-L [8]
Ours HyperReel [9]
Ours NeRFplayer [10]
Ours HyperReel [9]
Ours HyperReel [9]

Ablation on guided sampling

Ours Ours without guided sampling
Ours Ours without guided sampling

Comparisons on temporal consistency

From the test view video results of each method, we take a vertical column of 150 pixels across 250 frames and concatenate these columns horizontally. The resulting image patch is equivalent to a slice in the height-time space. Ours results are clearer than Dynamic 3DGS [5] and contain fewer temporal noises.

BibTeX

@article{li2023spacetime,
  title={Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis},
  author={Li, Zhan and Chen, Zhang and Li, Zhong and Xu, Yi},
  journal={arXiv preprint arXiv:2312.16812},
  year={2023}
}

 

References

[1] Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman, Matthew DuVall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive Light Field Video with a Layered Mesh Representation. ACM Transactions on Graphics, 2020.
[2] Neus Sabater, Guillaume Boisson, Benoit Vandame, Paul Kerbiriou, Frederic Babon, Matthieu Hog, Remy Gendrot, Tristan Langlois, Olivier Bureller, Arno Schubert, et al. Dataset and pipeline for multi-view light-field video. In Proceedings of the IEEE conference on computer vision and pattern recognition Workshops, 2017.
[3] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
[4] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 2023.
[5] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In 3DV, 2024.
[6] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[7] Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[8] Feng Wang, Sinan Tan, Xinghang Li, Zeyue Tian, Yafei Song, and Huaping Liu. Mixed neural voxels for fast multiview video synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
[9] Benjamin Attal, Jia-Bin Huang, Christian Richardt, Michael Zollhoefer, Johannes Kopf, Matthew O’Toole, and Changil Kim. HyperReel: High-fidelity 6-DoF video with rayconditioned sampling. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[10] Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields. IEEE Transactions on Visualization and Computer Graphics, 2023.