Abstract
The pre-trained latent diffusion model has achieved excellent results in text-to-image generation, providing users with high-quality visual results and encouraging people to use creative text to control the generated images. In order to meet the user’s demand for controlling generation details, a common practice is to employ reference images to “stylize” the generating results. Although the “text + single style image” method can help users express their generational needs, this seemingly natural combination masks many problems. The semantic information contained in the describing text and the style characteristics expressed by the reference image are not always harmonious and unified, and conflicts often break out between them. For example, the description text is “color prominence”, while the reference image is a modernist concise style with medium tones. This style divergence puts the style transfer model into a dilemma. The key issue is that it is difficult to express the user’s style requirements with a single style image, which limits users’ control over the generation process at a fine-grained level. Therefore, we are committed to resolving the style conflict between text and style images, enabling users to provide two reference images for style control and to include control information on the attributes of these two style images within the text. Specifically, we propose a multi-attribute decomposition style transfer method, which extracts attribute features from style images and then utilizes a lightweight module to perform feature fusion fine-tuning training. Experimental results demonstrate that our method enables attribute-controllable style generation while maintaining good style alignment with the reference image. The code is available at https://gitee.com/yongzhenke/SIG-MAD.






Similar content being viewed by others
Data availability
The code and data are available at https://gitee.com/yongzhenke/SIG-MAD.
References
Rombach R et al (2022) High-resolution image synthesis with latent diffusion models. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang Y et al (2023) Inversion-based style transfer with diffusion models. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Gal R et al An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. The Eleventh International Conference on Learning Representations(ICLR).2023
Sohn K et al (2023) NeurIPS., StyleDrop: Text-to-Image Generation in Any Style. Neural Information Processing Systems 2023
Gatys LA et al (2017) Controlling Perceptual Factors in Neural Style Transfer. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: pp. 3730–3738
Chen J et al (2023) ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors. Proceedings of the 31st ACM International Conference on Multimedia
Wang Z et al (2023) Styleadapter: A single-pass lora-free model for stylized image generation. arXiv preprint arXiv:2309.01770
Pan Z, Zhou X, Tian H (2023) Arbitrary style guidance for enhanced Diffusion-Based Text-to-Image generation. IEEE/CVF Winter Conf Appl Comput Vis (WACV) 2022:4450–4460
Lei M et al (2025) StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements. Proceedings of the Computer Vision and Pattern Recognition Conference(CVPR). : 23443–23452
Jeong J et al (2024) Visual style prompting with swapping Self-Attention. ArXiv. abs/2402.12974
Kong S et al (2016) Photo aesthetics ranking network with attributes and content adaptation. in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer
Zhang R et al (2021) Tip-Adapter: Training-free CLIP-Adapter for better Vision-Language modeling. ArXiv. abs/2111.03930
Mansimov E et al Generating Images from Captions with Attention. International Conference on Learning Representations(ICLR).2016
Reed S, Akata Z, Yan X et al (2016) Generative adversarial text to image synthesis.International conference on machine learning(ICML), : 1060–1069
Ding M et al (2021) CogView: Mastering Text-to-Image Generation via Transformers. in Neural Information Processing Systems
Ramesh A et al (2021) Zero-shot text-to-image generation. International conference on machine learning(ICML), : 8821–8831
Wu C et al (2022) Nüwa: Visual synthesis pre-training for neural visual world creation. European conference on computer vision(ECCV), : 720–736
Yu J et al (2022) Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. Trans. Mach. Learn. Res., 2022
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process systems(NIPS) 33:6840–6851
Sohl-Dickstein JN et al (2015) Deep unsupervised learning using nonequilibrium thermodynamics. International conference on machine learning(ICML), : 2256–2265
Song J, Meng C, Ermon S (2021) Denoising Diffusion Implicit Models. International Conference on Learning Representations(ICLR)
Avrahami O, Lischinski D, Fried O (2021) Blended Diffusion for Text-driven Editing of Natural Images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), : pp. 18187–18197
Saharia C et al (2021) Palette: Image-to-Image Diffusion Models. ACM SIGGRAPH 2022 Conference Proceedings
Saharia C et al (2021) Image Super-Resolution via iterative refinement. IEEE Trans Pattern Anal Mach Intell 45:4713–4726
Wang C et al (2021) S3RP: Self-Supervised Super-Resolution and prediction for Advection-Diffusion process. ArXiv. abs/2111.04639
Meng C et al (2021) SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. in International Conference on Learning Representations
Lugmayr A et al (2022) RePaint: Inpainting using Denoising Diffusion Probabilistic Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022: pp. 11451–11461
Romero A et al (2022) NTIRE 2022 Image Inpainting Challenge: Report. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), : pp. 1149–1181
Zhang Y-x et al (2023) ProSpect: prompt spectrum for Attribute-Aware personalization of diffusion models. ACM Trans Graphics (TOG) 42:1–14
Zhang L, Rao A, Agrawala M (2023) Adding Conditional Control to Text-to-Image Diffusion Models [J]. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), : 3813-24
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. in Proceedings of the IEEE conference on computer vision and pattern recognition
Houlsby N et al (2019) Parameter-efficient transfer learning for NLP. in International conference on machine learning. PMLR
Li J et al (2023) Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. in International conference on machine learning. PMLR
Zeng Y et al (2024) What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Han Z et al (2024) Stylebooth: Image style editing with multimodal instruction. arXiv preprint arXiv:2404.12154
Karayev S et al (2014) Recognizing Image Style. Proceedings of the British Machine Vision Conference (BMVC)
Radford A et al (2021) Learning Transferable Visual Models From Natural Language Supervision. in International Conference on Machine Learning
Heusel M et al (2017) Gans trained by a two time-scale update rule converge to a local Nash equilibrium. Adv Neural Inf Process Syst, 30
Xing P, Wang H, Sun Y et al Csgo: Content-style composition in text-to-image generation. The Eleventh International Conference on Learning Representations(ICLR).2025
Rombach R et al (2021) High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), : pp. 10674–10685
Author information
Authors and Affiliations
Contributions
Shuai Yang Conceptualization, Formal analysis, Methodology, Software, Validation, Writing – original draft, Writing – review & editing; Xinyue Sun Formal analysis, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing; Jing Guo Formal analysis, Validation, Writing – review & editing; Kai Wang Methodology, Validation, Writing – review & editing; Yongzhen Ke Conceptualization, Methodology, Project administration, Resources, Writing – review & editing; Xingjian Zhang Data curation, Methodology, Validation, Writing – review & editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, S., Sun, X., Guo, J. et al. Stylized image generation based on multi-attribute decomposition. Pattern Anal Applic 28, 202 (2025). https://doi.org/10.1007/s10044-025-01577-9
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10044-025-01577-9
