Diffuman4D Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Zhejiang University Ant Research
† Corresponding Author
ICCV 2025

Diffuman4D enables high-fidelity free-viewpoint rendering of human performances from sparse-view videos.

Interactive Demo

The 4DGS model on the left (right) is reconstructed from sparse input videos (sparse input videos + our generated videos).

Overview

How it works

Given sparse-view videos, Diffuman4D (1) generates 4D-consistent multi-view videos conditioned on these inputs, and (2) reconstructs a high-fidelity 4DGS model of the human performance using both the input and the generated videos.

Motivation and Solution

  • 1. Sparse-view input videos inevitably lead to noisy 4DGS reconstructions.
  • 2. Diffuman4D addresses the sparse-view challenge by generating 4D-consistent multi-view videos conditioned on the input videos.
  • 3. The generated videos enable high-quality 4DGS reconstructions, allowing free view rendering of humans in motion.

Method

We introduce a spatio-temporal diffusion model to tackle the challenge of human novel view synthesis from sparse-view videos by:

  1. Skeleton-Plücker conditioning: The encoded skeleton latents and Plücker coordinates are concatenated with the image latents at input views or the noise latents at target views, forming input samples or target samples, respectively.
  2. Sliding iterative denoising: The samples across all views and timestamps form a sample grid, which is denoised by our model using a sliding iterative mechanism and then decoded into the target videos.
  3. 4DGS reconstruction: A 4DGS model is reconstructed from the input and target videos using LongVolcap, enabling real-time novel view rendering of human performances with complex clothing and motions.
Diffuman4D pipeline

Comparisons

Acknowledgments

We would like to thank Haotong Lin, Jiaming Sun, Yunzhi Yan, Zhiyuan Yu, and Zehong Shen for their insightful discussions. We appreciate the support from Yifan Wang, Yu Zhang, Siyu Zhang, Yinji Shentu, Dongli Tan, and Peishan Yang in producing the self-captured demos. We also extend our gratitude to Ye Zhang for testing recent camera-control generative models and avatar reconstruction methods.

BibTeX

@inproceedings{jin2025diffuman4d,
  title={Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models},
  author={Jin, Yudong and Peng, Sida and Wang, Xuan and Xie, Tao and Xu, Zhen and Yang, Yifan and Shen, Yujun and Bao, Hujun and Zhou, Xiaowei},
  booktitle={ICCV},
  year={2025}
}