Stylized image generation based on multi-attribute decomposition

Yang, Shuai; Sun, Xinyue; Guo, Jing; Wang, Kai; Ke, Yongzhen; Zhang, Xingjian

doi:10.1007/s10044-025-01577-9

Stylized image generation based on multi-attribute decomposition

Original Article
Published: 22 November 2025

Volume 28, article number 202 (2025)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Shuai Yang¹,
Xinyue Sun²,
Jing Guo¹,
Kai Wang¹,
Yongzhen Ke^1,3 &
…
Xingjian Zhang¹

144 Accesses
Explore all metrics

Abstract

The pre-trained latent diffusion model has achieved excellent results in text-to-image generation, providing users with high-quality visual results and encouraging people to use creative text to control the generated images. In order to meet the user’s demand for controlling generation details, a common practice is to employ reference images to “stylize” the generating results. Although the “text + single style image” method can help users express their generational needs, this seemingly natural combination masks many problems. The semantic information contained in the describing text and the style characteristics expressed by the reference image are not always harmonious and unified, and conflicts often break out between them. For example, the description text is “color prominence”, while the reference image is a modernist concise style with medium tones. This style divergence puts the style transfer model into a dilemma. The key issue is that it is difficult to express the user’s style requirements with a single style image, which limits users’ control over the generation process at a fine-grained level. Therefore, we are committed to resolving the style conflict between text and style images, enabling users to provide two reference images for style control and to include control information on the attributes of these two style images within the text. Specifically, we propose a multi-attribute decomposition style transfer method, which extracts attribute features from style images and then utilizes a lightweight module to perform feature fusion fine-tuning training. Experimental results demonstrate that our method enables attribute-controllable style generation while maintaining good style alignment with the reference image. The code is available at https://gitee.com/yongzhenke/SIG-MAD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigating Style Similarity in Diffusion Models

Semantic guidance for precise style control in diffusion image generation

Article Open access 28 November 2025

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Data availability

The code and data are available at https://gitee.com/yongzhenke/SIG-MAD.

References

Rombach R et al (2022) High-resolution image synthesis with latent diffusion models. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang Y et al (2023) Inversion-based style transfer with diffusion models. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Gal R et al An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. The Eleventh International Conference on Learning Representations(ICLR).2023
Sohn K et al (2023) NeurIPS., StyleDrop: Text-to-Image Generation in Any Style. Neural Information Processing Systems 2023
Gatys LA et al (2017) Controlling Perceptual Factors in Neural Style Transfer. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: pp. 3730–3738
Chen J et al (2023) ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors. Proceedings of the 31st ACM International Conference on Multimedia
Wang Z et al (2023) Styleadapter: A single-pass lora-free model for stylized image generation. arXiv preprint arXiv:2309.01770
Pan Z, Zhou X, Tian H (2023) Arbitrary style guidance for enhanced Diffusion-Based Text-to-Image generation. IEEE/CVF Winter Conf Appl Comput Vis (WACV) 2022:4450–4460
Google Scholar
Lei M et al (2025) StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements. Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR). : 23443–23452
Jeong J et al (2024) Visual style prompting with swapping Self-Attention. ArXiv. abs/2402.12974
Kong S et al (2016) Photo aesthetics ranking network with attributes and content adaptation. in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer
Zhang R et al (2021) Tip-Adapter: Training-free CLIP-Adapter for better Vision-Language modeling. ArXiv. abs/2111.03930
Mansimov E et al Generating Images from Captions with Attention. International Conference on Learning Representations(ICLR).2016
Reed S, Akata Z, Yan X et al (2016) Generative adversarial text to image synthesis.International conference on machine learning(ICML), : 1060–1069
Ding M et al (2021) CogView: Mastering Text-to-Image Generation via Transformers. in Neural Information Processing Systems
Ramesh A et al (2021) Zero-shot text-to-image generation. International conference on machine learning(ICML), : 8821–8831
Wu C et al (2022) Nüwa: Visual synthesis pre-training for neural visual world creation. European conference on computer vision(ECCV), : 720–736
Yu J et al (2022) Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. Trans. Mach. Learn. Res., 2022
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process systems(NIPS) 33:6840–6851
Google Scholar
Sohl-Dickstein JN et al (2015) Deep unsupervised learning using nonequilibrium thermodynamics. International conference on machine learning(ICML), : 2256–2265
Song J, Meng C, Ermon S (2021) Denoising Diffusion Implicit Models. International Conference on Learning Representations(ICLR)
Avrahami O, Lischinski D, Fried O (2021) Blended Diffusion for Text-driven Editing of Natural Images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), : pp. 18187–18197
Saharia C et al (2021) Palette: Image-to-Image Diffusion Models. ACM SIGGRAPH 2022 Conference Proceedings
Saharia C et al (2021) Image Super-Resolution via iterative refinement. IEEE Trans Pattern Anal Mach Intell 45:4713–4726
Google Scholar
Wang C et al (2021) S3RP: Self-Supervised Super-Resolution and prediction for Advection-Diffusion process. ArXiv. abs/2111.04639
Meng C et al (2021) SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. in International Conference on Learning Representations
Lugmayr A et al (2022) RePaint: Inpainting using Denoising Diffusion Probabilistic Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022: pp. 11451–11461
Romero A et al (2022) NTIRE 2022 Image Inpainting Challenge: Report. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), : pp. 1149–1181
Zhang Y-x et al (2023) ProSpect: prompt spectrum for Attribute-Aware personalization of diffusion models. ACM Trans Graphics (TOG) 42:1–14
Article Google Scholar
Zhang L, Rao A, Agrawala M (2023) Adding Conditional Control to Text-to-Image Diffusion Models [J]. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), : 3813-24
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. in Proceedings of the IEEE conference on computer vision and pattern recognition
Houlsby N et al (2019) Parameter-efficient transfer learning for NLP. in International conference on machine learning. PMLR
Li J et al (2023) Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. in International conference on machine learning. PMLR
Zeng Y et al (2024) What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Han Z et al (2024) Stylebooth: Image style editing with multimodal instruction. arXiv preprint arXiv:2404.12154
Karayev S et al (2014) Recognizing Image Style. Proceedings of the British Machine Vision Conference (BMVC)
Radford A et al (2021) Learning Transferable Visual Models From Natural Language Supervision. in International Conference on Machine Learning
Heusel M et al (2017) Gans trained by a two time-scale update rule converge to a local Nash equilibrium. Adv Neural Inf Process Syst, 30
Xing P, Wang H, Sun Y et al Csgo: Content-style composition in text-to-image generation. The Eleventh International Conference on Learning Representations(ICLR).2025
Rombach R et al (2021) High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), : pp. 10674–10685

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Tiangong University, Tianjin, China
Shuai Yang, Jing Guo, Kai Wang, Yongzhen Ke & Xingjian Zhang
School of Software Engineering, Tiangong University, Tianjin, China
Xinyue Sun
National Demonstration Center for Experimental Engineering Training Education, Tiangong University, Tianjin, China
Yongzhen Ke

Authors

Shuai Yang
View author publications
Search author on:PubMed Google Scholar
Xinyue Sun
View author publications
Search author on:PubMed Google Scholar
Jing Guo
View author publications
Search author on:PubMed Google Scholar
Kai Wang
View author publications
Search author on:PubMed Google Scholar
Yongzhen Ke
View author publications
Search author on:PubMed Google Scholar
Xingjian Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

Shuai Yang Conceptualization, Formal analysis, Methodology, Software, Validation, Writing – original draft, Writing – review & editing; Xinyue Sun Formal analysis, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing; Jing Guo Formal analysis, Validation, Writing – review & editing; Kai Wang Methodology, Validation, Writing – review & editing; Yongzhen Ke Conceptualization, Methodology, Project administration, Resources, Writing – review & editing; Xingjian Zhang Data curation, Methodology, Validation, Writing – review & editing.

Corresponding author

Correspondence to Yongzhen Ke.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, S., Sun, X., Guo, J. et al. Stylized image generation based on multi-attribute decomposition. Pattern Anal Applic 28, 202 (2025). https://doi.org/10.1007/s10044-025-01577-9

Download citation

Received: 12 July 2025
Accepted: 05 November 2025
Published: 22 November 2025
Version of record: 22 November 2025
DOI: https://doi.org/10.1007/s10044-025-01577-9

Keywords

Profiles

Shuai Yang View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stylized image generation based on multi-attribute decomposition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Investigating Style Similarity in Diffusion Models

Semantic guidance for precise style control in diffusion image generation

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Explore related subjects

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now