close
Skip to main content
Log in

Stylized image generation based on multi-attribute decomposition

  • Original Article
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

The pre-trained latent diffusion model has achieved excellent results in text-to-image generation, providing users with high-quality visual results and encouraging people to use creative text to control the generated images. In order to meet the user’s demand for controlling generation details, a common practice is to employ reference images to “stylize” the generating results. Although the “text + single style image” method can help users express their generational needs, this seemingly natural combination masks many problems. The semantic information contained in the describing text and the style characteristics expressed by the reference image are not always harmonious and unified, and conflicts often break out between them. For example, the description text is “color prominence”, while the reference image is a modernist concise style with medium tones. This style divergence puts the style transfer model into a dilemma. The key issue is that it is difficult to express the user’s style requirements with a single style image, which limits users’ control over the generation process at a fine-grained level. Therefore, we are committed to resolving the style conflict between text and style images, enabling users to provide two reference images for style control and to include control information on the attributes of these two style images within the text. Specifically, we propose a multi-attribute decomposition style transfer method, which extracts attribute features from style images and then utilizes a lightweight module to perform feature fusion fine-tuning training. Experimental results demonstrate that our method enables attribute-controllable style generation while maintaining good style alignment with the reference image. The code is available at https://gitee.com/yongzhenke/SIG-MAD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

The code and data are available at https://gitee.com/yongzhenke/SIG-MAD.

References

  1. Rombach R et al (2022) High-resolution image synthesis with latent diffusion models. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  2. Zhang Y et al (2023) Inversion-based style transfer with diffusion models. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  3. Gal R et al An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. The Eleventh International Conference on Learning Representations(ICLR).2023

  4. Sohn K et al (2023) NeurIPS., StyleDrop: Text-to-Image Generation in Any Style. Neural Information Processing Systems 2023

  5. Gatys LA et al (2017) Controlling Perceptual Factors in Neural Style Transfer. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: pp. 3730–3738

  6. Chen J et al (2023) ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors. Proceedings of the 31st ACM International Conference on Multimedia

  7. Wang Z et al (2023) Styleadapter: A single-pass lora-free model for stylized image generation. arXiv preprint arXiv:2309.01770

  8. Pan Z, Zhou X, Tian H (2023) Arbitrary style guidance for enhanced Diffusion-Based Text-to-Image generation. IEEE/CVF Winter Conf Appl Comput Vis (WACV) 2022:4450–4460

    Google Scholar 

  9. Lei M et al (2025) StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements. Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR). : 23443–23452

  10. Jeong J et al (2024) Visual style prompting with swapping Self-Attention. ArXiv. abs/2402.12974

  11. Kong S et al (2016) Photo aesthetics ranking network with attributes and content adaptation. in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer

  12. Zhang R et al (2021) Tip-Adapter: Training-free CLIP-Adapter for better Vision-Language modeling. ArXiv. abs/2111.03930

  13. Mansimov E et al Generating Images from Captions with Attention. International Conference on Learning Representations(ICLR).2016

  14. Reed S, Akata Z, Yan X et al (2016) Generative adversarial text to image synthesis.International conference on machine learning(ICML), : 1060–1069

  15. Ding M et al (2021) CogView: Mastering Text-to-Image Generation via Transformers. in Neural Information Processing Systems

  16. Ramesh A et al (2021) Zero-shot text-to-image generation. International conference on machine learning(ICML), : 8821–8831

  17. Wu C et al (2022) Nüwa: Visual synthesis pre-training for neural visual world creation. European conference on computer vision(ECCV), : 720–736

  18. Yu J et al (2022) Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. Trans. Mach. Learn. Res., 2022

  19. Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process systems(NIPS) 33:6840–6851

    Google Scholar 

  20. Sohl-Dickstein JN et al (2015) Deep unsupervised learning using nonequilibrium thermodynamics. International conference on machine learning(ICML), : 2256–2265

  21. Song J, Meng C, Ermon S (2021) Denoising Diffusion Implicit Models. International Conference on Learning Representations(ICLR)

  22. Avrahami O, Lischinski D, Fried O (2021) Blended Diffusion for Text-driven Editing of Natural Images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), : pp. 18187–18197

  23. Saharia C et al (2021) Palette: Image-to-Image Diffusion Models. ACM SIGGRAPH 2022 Conference Proceedings

  24. Saharia C et al (2021) Image Super-Resolution via iterative refinement. IEEE Trans Pattern Anal Mach Intell 45:4713–4726

    Google Scholar 

  25. Wang C et al (2021) S3RP: Self-Supervised Super-Resolution and prediction for Advection-Diffusion process. ArXiv. abs/2111.04639

  26. Meng C et al (2021) SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. in International Conference on Learning Representations

  27. Lugmayr A et al (2022) RePaint: Inpainting using Denoising Diffusion Probabilistic Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022: pp. 11451–11461

  28. Romero A et al (2022) NTIRE 2022 Image Inpainting Challenge: Report. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), : pp. 1149–1181

  29. Zhang Y-x et al (2023) ProSpect: prompt spectrum for Attribute-Aware personalization of diffusion models. ACM Trans Graphics (TOG) 42:1–14

    Article  Google Scholar 

  30. Zhang L, Rao A, Agrawala M (2023) Adding Conditional Control to Text-to-Image Diffusion Models [J]. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), : 3813-24

  31. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  32. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. in Proceedings of the IEEE conference on computer vision and pattern recognition

  33. Houlsby N et al (2019) Parameter-efficient transfer learning for NLP. in International conference on machine learning. PMLR

  34. Li J et al (2023) Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. in International conference on machine learning. PMLR

  35. Zeng Y et al (2024) What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

  36. Han Z et al (2024) Stylebooth: Image style editing with multimodal instruction. arXiv preprint arXiv:2404.12154

  37. Karayev S et al (2014) Recognizing Image Style. Proceedings of the British Machine Vision Conference (BMVC)

  38. Radford A et al (2021) Learning Transferable Visual Models From Natural Language Supervision. in International Conference on Machine Learning

  39. Heusel M et al (2017) Gans trained by a two time-scale update rule converge to a local Nash equilibrium. Adv Neural Inf Process Syst, 30

  40. Xing P, Wang H, Sun Y et al Csgo: Content-style composition in text-to-image generation. The Eleventh International Conference on Learning Representations(ICLR).2025

  41. Rombach R et al (2021) High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), : pp. 10674–10685

Download references

Author information

Authors and Affiliations

Authors

Contributions

Shuai Yang Conceptualization, Formal analysis, Methodology, Software, Validation, Writing – original draft, Writing – review & editing; Xinyue Sun Formal analysis, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing; Jing Guo Formal analysis, Validation, Writing – review & editing; Kai Wang Methodology, Validation, Writing – review & editing; Yongzhen Ke Conceptualization, Methodology, Project administration, Resources, Writing – review & editing; Xingjian Zhang Data curation, Methodology, Validation, Writing – review & editing.

Corresponding author

Correspondence to Yongzhen Ke.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, S., Sun, X., Guo, J. et al. Stylized image generation based on multi-attribute decomposition. Pattern Anal Applic 28, 202 (2025). https://doi.org/10.1007/s10044-025-01577-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1007/s10044-025-01577-9

Keywords

Profiles

  1. Shuai Yang