LEO: Generative Latent Image Animator for Human Video Synthesis

Wang, Yaohui; Ma, Xin; Chen, Xinyuan; Chen, Cunjian; Dantcheva, Antitza; Dai, Bo; Qiao, Yu

doi:10.1007/s11263-024-02231-3

LEO: Generative Latent Image Animator for Human Video Synthesis

Published: 27 September 2024

Volume 133, pages 1277–1289 (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yaohui Wang ORCID: orcid.org/0009-0002-9487-6187¹,
Xin Ma^1,2,
Xinyuan Chen¹,
Cunjian Chen²,
Antitza Dantcheva³,
Bo Dai¹ &
…
Yu Qiao¹

660 Accesses
27 Citations
Explore all metrics

Abstract

Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing. Project page: https://wyhsirius.github.io/LEO-project/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Fig. 6

LaMD: Latent Motion Diffusion for Image-Conditional Video Generation

Article 03 March 2025

Video Frame Interpolation for Large Motion with Generative Prior

Multimodal Media Creation: Integrating LLMs for High-Quality Video Generation

Data Availability

The datasets used during and analyzed during the current study are available in the following public domain resources: $\bullet $ FaceForensics (Rössler et al., 2018) https://github.com/ondyari/FaceForensics

$\bullet $ CelebV-HQ (Zhu et al., 2022) https://celebv-hq.github.io

$\bullet $ TaichiHD (Siarohin et al., 2019) https://github.com/AliaksandrSiarohin/first-order-model

The models and source data generated during and analyzed during the current study are available from the corresponding author upon reasonable request.

References

Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., & Dekel, T. (2022). Text2live: Text-driven layered image and video editing. In ECCV.
Bergman, A., Kellnhofer, P., Yifan, W., Chan, E., Lindell, D., & Wetzstein, G. (2022). Generative neural articulated radiance fields. NeurIPS, 35, 19900–19916.
Google Scholar
Bhagat, S., Uppal, S., Yin, Z., & Lim, N. (2020). Disentangling multiple features in video sequences using gaussian processes in variational autoencoders. In ECCV.
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., & Jampani, V. (2023a). Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., & Kreis, K. (2023b). Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR.
Brock, A., Donahue, J., & Simonyan, K. (2019). Large scale GAN training for high fidelity natural image synthesis. In ICLR.
Brooks, T., Hellsten, J., Aittala, M., Wang, T.-C., Aila, T., Lehtinen, J., Liu, M.-Y., Efros, A. A., & Karras, T. (2022). Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35, 31769–31781.
Google Scholar
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR.
Chan, C., Ginosar, S., Zhou, T., & Efros, A. A. (2019). Everybody dance now. In ICCV.
Chen, X., Wang, Y., Zhang, L., Zhuang, S., Ma, X., Yu, J., Wang, Y., Lin, D., Qiao, Y., & Liu, Z. (2023). Seine: Short-to-long video diffusion model for generative transition and prediction. In ICLR.
Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., & Shan, Y. (2024). Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In CVPR.
Chu, C., Zhmoginov, A., & Sandler, M. (2017). CycleGAN: a master of steganography. arXiv preprint arXiv:1712.02950
Clark, A., Donahue, J., & Simonyan, K. (2019). Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571
Denton, E. L., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), NeurIPS.
Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In CVPR.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In CVPR.
Ge, S., Hayes, T., Yang, H., Yin, X., Pang, G., Jacobs, D., Huang, J.-B., & Parikh, D. (2022). Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In NIPS.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., & Salimans, T. (2022a). Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., & Fleet, D. J. (2022b). Video diffusion models. arXiv preprint arXiv:2204.03458
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. NeurIPS, 33, 6840–6851.
Google Scholar
Huang, X., Liu, M.-Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In ECCV.
Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-Image Translation with Conditional Adversarial Networks. In CVPR.
Jang, Y., Kim, G., & Song, Y. (2018). Video Prediction with Appearance and Motion Conditions. In ICML.
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In CVPR.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image quality of StyleGAN. In CVPR.
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In ICLR.
Li, Y., & Mandt, S. (2018). Disentangled sequential autoencoder. ICML.
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M.-H. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV.
Luo, Z., Chen, D., Zhang, Y., Huang, Y., Wang, L., Shen, Y., Zhao, D., Zhou, J., & Tan, T. (2023). Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR.
Ma, X., Wang, Y., Jia, G., Chen, X., Liu, Z., Li, Y.-F., Chen, C., & Qiao, Y. (2024). Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048
Menapace, W., Siarohin, A., Skorokhodov, I., Deyneka, E., Chen, T.-S., Kag, A., Fang, Y., Stoliar, A., Ricci, E., Ren, J., & Tulyakov, S. (2024). Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. In CVPR.
Nichol, A. Q., & Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In ICML.
Ohnishi, K., Yamamoto, S., Ushiku, Y., & Harada, T. (2018). Hierarchical video generation from orthogonal information: Optical flow and texture. In AAAI.
Pan, J., Wang, C., Jia, X., Shao, J., Sheng, L., Yan, J., & Wang, X. (2019). Video generation from single semantic label map. arXiv preprint arXiv:1903.04480
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). (2018). Improving language understanding by generative pre-training.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. In ICML.
Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2018). Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179
Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV.
Saito, M., Saito, S., Koyama, M., & Kobayashi, S. (2020). Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. IJCV.
Shen, X., Li, X., & Elhoseiny, M. (2023). Mostgan-v: Video generation with temporal motion styles. In CVPR.
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., & Sebe, N. (2019). First order motion model for image animation. In NeurIPS.
Siarohin, A., Woodford, O., Ren, J., Chai, M., & Tulyakov, S. (2021). Motion representations for articulated animation. In CVPR.
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., & Taigman, Y. (2023). Make-a-video: Text-to-video generation without text-video data. In ICLR.
Skorokhodov, I., Tulyakov, S., & Elhoseiny, M. (2022). Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR.
Song, J., Meng, C., & Ermon, S. (2021). Denoising diffusion implicit models. In ICLR.
Tian, Y., Ren, J., Chai, M., Olszewski, K., Peng, X., Metaxas, D. N., & Tulyakov, S. (2021). A good image generator is what you need for high-resolution video synthesis. In ICLR.
Tulyakov, S., Liu, M.-Y., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In CVPR.
Van Den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. NeurIPS.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS.
Villegas, R., Babaeizadeh, M., Kindermans, P.-J., Moraldo, H., Zhang, H., Saffar, M. T., Castro, S., Kunze, J., & Erhan, D. (2023). Phenaki: Variable length video generation from open domain textual descriptions. In ICLR.
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In NIPS.
Walker, J., Marino, K., Gupta, A., & Hebert, M. (2017). The pose knows: Video forecasting by generating pose futures. In ICCV.
Wang, Y. (2021). Learning to Generate Human Videos. Theses: Inria - Sophia Antipolis; Université Cote d’Azur.
Wang, Y., Bilinski, P., Bremond, F., & Dantcheva, A. (2020). G3AN: Disentangling appearance and motion for video generation. In CVPR.
Wang, Y., Bilinski, P., Bremond, F., & Dantcheva, A. (2020). Imaginator: Conditional spatio-temporal gan for video generation. In WACV.
Wang, Y., Bremond, F., & Dantcheva, A. (2021). Inmodegan: Interpretable motion decomposition generative adversarial network for video generation. arXiv preprint arXiv:2101.03049
Wang, T. Y., Ceylan, D., Singh, K. K., & Mitra, N. J. (2021). Dance in the wild: Monocular human animation with neural dynamic appearance synthesis. In 3DV.
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., & Guo, Y. (2023). Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103
Wang, T.-C., Liu, M.-Y., Tao, A., Liu, G., Kautz, J., & Catanzaro, B. (2019). Few-shot video-to-video synthesis. In NeurIPS.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018). Video-to-video synthesis. In NeurIPS.
Wang, Y., Yang, D., Bremond, F., & Dantcheva, A. (2022). Latent image animator: Learning to animate images via latent space navigation. In ICLR.
Xie, J., Gao, R., Zheng, Z., Zhu, S.-C., & Wu, Y. N. (2020). Motion-based generator model: Unsupervised disentanglement of appearance, trackable and intrackable motions in dynamic patterns. In AAAI.
Yan, W., Zhang, Y., Abbeel, P., & Srinivas, A. (2021). Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157
Yang, Z., Li, S., Wu, W., & Dai, B. (2022). 3dhumangan: Towards photo-realistic 3d-aware human image generation. arXiv preprint.
Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In ECCV.
Yu, S., Sohn, K., Kim, S., & Shin, J. (2023). Video probabilistic diffusion models in projected latent space. In CVPR.
Yu, S., Tack, J., Mo, S., Kim, H., Kim, J., Ha, J.-W., & Shin, J. (2022). Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR.
Zakharov, E., Shysheya, A., Burkov, E., & Lempitsky, V. (2019). Few-shot adversarial learning of realistic neural talking head models. In ICCV.
Zhang, L., & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models.
Zhang, D. J., Wu, J. Z., Liu, J.-W., Zhao, R., Ran, L., Gu, Y., Gao, D., & Shou, M. Z. (2023). Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818
Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In ECCV.
Zheng, Z., Zheng, L., & Yang, Y. (2018). A discriminatively learned cnn embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(1), 1–20.
Article MathSciNet MATH Google Scholar
Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L., Liu, Z., & Loy, C. C. (2022). CelebV-HQ: A large-scale video facial attributes dataset. In ECCV.

Download references

Author information

Authors and Affiliations

Shanghai Artificial Intelligence Laboratory, 701 Yunjin Road, Shanghai, 200030, China
Yaohui Wang, Xin Ma, Xinyuan Chen, Bo Dai & Yu Qiao
Monash University, Wellington Road, Clayton, VIC, 3800, Australia
Xin Ma & Cunjian Chen
Inria, Université Côte d’Azur, 2004 Rte des Lucioles, Valbonne, 06902, France
Antitza Dantcheva

Authors

Yaohui Wang
View author publications
Search author on:PubMed Google Scholar
Xin Ma
View author publications
Search author on:PubMed Google Scholar
Xinyuan Chen
View author publications
Search author on:PubMed Google Scholar
Cunjian Chen
View author publications
Search author on:PubMed Google Scholar
Antitza Dantcheva
View author publications
Search author on:PubMed Google Scholar
Bo Dai
View author publications
Search author on:PubMed Google Scholar
Yu Qiao
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yaohui Wang.

Ethics declarations

Ethic Statement

In this work, we aim to synthesize high-quality human-centric videos by combining a pretrained image animator with a proposed latent motion diffusion model. Our approach can be used for digital human, online education, and data synthesis for other computer vision tasks, etc. We note that our framework mainly focuses on learning how to model motion distribution in a pretrained image animator rather than directly model appearance. Therefore, our framework is not biased towards any specific gender, race, region, or social class. It works equally well irrespective of the difference in subjects.

Additional information

Communicated by Kwan-Yee Kenneth Wong.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Ma, X., Chen, X. et al. LEO: Generative Latent Image Animator for Human Video Synthesis. Int J Comput Vis 133, 1277–1289 (2025). https://doi.org/10.1007/s11263-024-02231-3

Download citation

Received: 31 January 2024
Accepted: 20 August 2024
Published: 27 September 2024
Version of record: 27 September 2024
Issue date: March 2025
DOI: https://doi.org/10.1007/s11263-024-02231-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LEO: Generative Latent Image Animator for Human Video Synthesis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LaMD: Latent Motion Diffusion for Image-Conditional Video Generation

Video Frame Interpolation for Large Motion with Generative Prior

Multimodal Media Creation: Integrating LLMs for High-Quality Video Generation

Explore related subjects

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethic Statement

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now