Abstract
Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing. Project page: https://wyhsirius.github.io/LEO-project/.









Similar content being viewed by others
Data Availability
The datasets used during and analyzed during the current study are available in the following public domain resources: \(\bullet \) FaceForensics (Rössler et al., 2018) https://github.com/ondyari/FaceForensics
\(\bullet \) CelebV-HQ (Zhu et al., 2022) https://celebv-hq.github.io
\(\bullet \) TaichiHD (Siarohin et al., 2019) https://github.com/AliaksandrSiarohin/first-order-model
The models and source data generated during and analyzed during the current study are available from the corresponding author upon reasonable request.
References
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., & Dekel, T. (2022). Text2live: Text-driven layered image and video editing. In ECCV.
Bergman, A., Kellnhofer, P., Yifan, W., Chan, E., Lindell, D., & Wetzstein, G. (2022). Generative neural articulated radiance fields. NeurIPS, 35, 19900–19916.
Bhagat, S., Uppal, S., Yin, Z., & Lim, N. (2020). Disentangling multiple features in video sequences using gaussian processes in variational autoencoders. In ECCV.
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., & Jampani, V. (2023a). Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., & Kreis, K. (2023b). Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR.
Brock, A., Donahue, J., & Simonyan, K. (2019). Large scale GAN training for high fidelity natural image synthesis. In ICLR.
Brooks, T., Hellsten, J., Aittala, M., Wang, T.-C., Aila, T., Lehtinen, J., Liu, M.-Y., Efros, A. A., & Karras, T. (2022). Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35, 31769–31781.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR.
Chan, C., Ginosar, S., Zhou, T., & Efros, A. A. (2019). Everybody dance now. In ICCV.
Chen, X., Wang, Y., Zhang, L., Zhuang, S., Ma, X., Yu, J., Wang, Y., Lin, D., Qiao, Y., & Liu, Z. (2023). Seine: Short-to-long video diffusion model for generative transition and prediction. In ICLR.
Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., & Shan, Y. (2024). Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In CVPR.
Chu, C., Zhmoginov, A., & Sandler, M. (2017). CycleGAN: a master of steganography. arXiv preprint arXiv:1712.02950
Clark, A., Donahue, J., & Simonyan, K. (2019). Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571
Denton, E. L., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), NeurIPS.
Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In CVPR.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In CVPR.
Ge, S., Hayes, T., Yang, H., Yin, X., Pang, G., Jacobs, D., Huang, J.-B., & Parikh, D. (2022). Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In NIPS.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., & Salimans, T. (2022a). Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., & Fleet, D. J. (2022b). Video diffusion models. arXiv preprint arXiv:2204.03458
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. NeurIPS, 33, 6840–6851.
Huang, X., Liu, M.-Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In ECCV.
Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-Image Translation with Conditional Adversarial Networks. In CVPR.
Jang, Y., Kim, G., & Song, Y. (2018). Video Prediction with Appearance and Motion Conditions. In ICML.
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In CVPR.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image quality of StyleGAN. In CVPR.
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In ICLR.
Li, Y., & Mandt, S. (2018). Disentangled sequential autoencoder. ICML.
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M.-H. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV.
Luo, Z., Chen, D., Zhang, Y., Huang, Y., Wang, L., Shen, Y., Zhao, D., Zhou, J., & Tan, T. (2023). Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR.
Ma, X., Wang, Y., Jia, G., Chen, X., Liu, Z., Li, Y.-F., Chen, C., & Qiao, Y. (2024). Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048
Menapace, W., Siarohin, A., Skorokhodov, I., Deyneka, E., Chen, T.-S., Kag, A., Fang, Y., Stoliar, A., Ricci, E., Ren, J., & Tulyakov, S. (2024). Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. In CVPR.
Nichol, A. Q., & Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In ICML.
Ohnishi, K., Yamamoto, S., Ushiku, Y., & Harada, T. (2018). Hierarchical video generation from orthogonal information: Optical flow and texture. In AAAI.
Pan, J., Wang, C., Jia, X., Shao, J., Sheng, L., Yan, J., & Wang, X. (2019). Video generation from single semantic label map. arXiv preprint arXiv:1903.04480
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). (2018). Improving language understanding by generative pre-training.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. In ICML.
Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2018). Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179
Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV.
Saito, M., Saito, S., Koyama, M., & Kobayashi, S. (2020). Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. IJCV.
Shen, X., Li, X., & Elhoseiny, M. (2023). Mostgan-v: Video generation with temporal motion styles. In CVPR.
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., & Sebe, N. (2019). First order motion model for image animation. In NeurIPS.
Siarohin, A., Woodford, O., Ren, J., Chai, M., & Tulyakov, S. (2021). Motion representations for articulated animation. In CVPR.
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., & Taigman, Y. (2023). Make-a-video: Text-to-video generation without text-video data. In ICLR.
Skorokhodov, I., Tulyakov, S., & Elhoseiny, M. (2022). Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR.
Song, J., Meng, C., & Ermon, S. (2021). Denoising diffusion implicit models. In ICLR.
Tian, Y., Ren, J., Chai, M., Olszewski, K., Peng, X., Metaxas, D. N., & Tulyakov, S. (2021). A good image generator is what you need for high-resolution video synthesis. In ICLR.
Tulyakov, S., Liu, M.-Y., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In CVPR.
Van Den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. NeurIPS.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS.
Villegas, R., Babaeizadeh, M., Kindermans, P.-J., Moraldo, H., Zhang, H., Saffar, M. T., Castro, S., Kunze, J., & Erhan, D. (2023). Phenaki: Variable length video generation from open domain textual descriptions. In ICLR.
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In NIPS.
Walker, J., Marino, K., Gupta, A., & Hebert, M. (2017). The pose knows: Video forecasting by generating pose futures. In ICCV.
Wang, Y. (2021). Learning to Generate Human Videos. Theses: Inria - Sophia Antipolis; Université Cote d’Azur.
Wang, Y., Bilinski, P., Bremond, F., & Dantcheva, A. (2020). G3AN: Disentangling appearance and motion for video generation. In CVPR.
Wang, Y., Bilinski, P., Bremond, F., & Dantcheva, A. (2020). Imaginator: Conditional spatio-temporal gan for video generation. In WACV.
Wang, Y., Bremond, F., & Dantcheva, A. (2021). Inmodegan: Interpretable motion decomposition generative adversarial network for video generation. arXiv preprint arXiv:2101.03049
Wang, T. Y., Ceylan, D., Singh, K. K., & Mitra, N. J. (2021). Dance in the wild: Monocular human animation with neural dynamic appearance synthesis. In 3DV.
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., & Guo, Y. (2023). Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103
Wang, T.-C., Liu, M.-Y., Tao, A., Liu, G., Kautz, J., & Catanzaro, B. (2019). Few-shot video-to-video synthesis. In NeurIPS.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018). Video-to-video synthesis. In NeurIPS.
Wang, Y., Yang, D., Bremond, F., & Dantcheva, A. (2022). Latent image animator: Learning to animate images via latent space navigation. In ICLR.
Xie, J., Gao, R., Zheng, Z., Zhu, S.-C., & Wu, Y. N. (2020). Motion-based generator model: Unsupervised disentanglement of appearance, trackable and intrackable motions in dynamic patterns. In AAAI.
Yan, W., Zhang, Y., Abbeel, P., & Srinivas, A. (2021). Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157
Yang, Z., Li, S., Wu, W., & Dai, B. (2022). 3dhumangan: Towards photo-realistic 3d-aware human image generation. arXiv preprint.
Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In ECCV.
Yu, S., Sohn, K., Kim, S., & Shin, J. (2023). Video probabilistic diffusion models in projected latent space. In CVPR.
Yu, S., Tack, J., Mo, S., Kim, H., Kim, J., Ha, J.-W., & Shin, J. (2022). Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR.
Zakharov, E., Shysheya, A., Burkov, E., & Lempitsky, V. (2019). Few-shot adversarial learning of realistic neural talking head models. In ICCV.
Zhang, L., & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models.
Zhang, D. J., Wu, J. Z., Liu, J.-W., Zhao, R., Ran, L., Gu, Y., Gao, D., & Shou, M. Z. (2023). Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818
Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In ECCV.
Zheng, Z., Zheng, L., & Yang, Y. (2018). A discriminatively learned cnn embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(1), 1–20.
Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L., Liu, Z., & Loy, C. C. (2022). CelebV-HQ: A large-scale video facial attributes dataset. In ECCV.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethic Statement
In this work, we aim to synthesize high-quality human-centric videos by combining a pretrained image animator with a proposed latent motion diffusion model. Our approach can be used for digital human, online education, and data synthesis for other computer vision tasks, etc. We note that our framework mainly focuses on learning how to model motion distribution in a pretrained image animator rather than directly model appearance. Therefore, our framework is not biased towards any specific gender, race, region, or social class. It works equally well irrespective of the difference in subjects.
Additional information
Communicated by Kwan-Yee Kenneth Wong.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, Y., Ma, X., Chen, X. et al. LEO: Generative Latent Image Animator for Human Video Synthesis. Int J Comput Vis 133, 1277–1289 (2025). https://doi.org/10.1007/s11263-024-02231-3
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02231-3
