Semantic guidance for precise style control in diffusion image generation

Wang, Kairui; Liu, Xinying; Chang, Yonghao; Zhao, Di; Xian, Tian; Geng, Xuelei

doi:10.1038/s41598-025-28715-x

Semantic guidance for precise style control in diffusion image generation

Article
Open access
Published: 28 November 2025

Volume 15, article number 45581 (2025)
Cite this article

You have full access to this open access article

Download PDF

Scientific Reports

Semantic guidance for precise style control in diffusion image generation

Download PDF

Kairui Wang¹,
Xinying Liu¹,
Yonghao Chang²,
Di Zhao¹,
Tian Xian¹ &
…
Xuelei Geng¹

2944 Accesses
1 Citation
Explore all metrics

Abstract

Diffusion models excel at generating high-quality images and are easy to scale, making them highly popular among active users. Meanwhile, diffusion-based text-to-image models have demonstrated significant potential in transferring reference styles. Recently, much research has focused on decoupling the overall style and semantics of reference images, but there has been limited research on balancing style weights from one or multiple reference images. We propose a method for extracting one or more styles from one or more reference images and fusing them together for style-diverse images. We utilize the SAM model to perform semantic segmentation on reference images, extracting the desired style images, and design a parallel decoupling adapter based on an image adapter to simultaneously decouple multiple styles. Additionally, we optimize the encoder to perform more precise style extraction from style reference images while ensuring that style information is not lost. Our method enables multi-visual style prompting without any fine-tuning, and the intensity of each style is controllable. Furthermore, our work demonstrates outstanding visual stylization results, achieving the best balance between style intensity and the controllability of textual elements.

Investigating Style Similarity in Diffusion Models

Stylized image generation based on multi-attribute decomposition

Article 22 November 2025

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Introduction

Diffusion-based text-to-image (T2I) generative models^1,2,3 have achieved breakthrough advancements in recent years, exhibiting unprecedented capabilities in generating photorealistic outputs that semantically align with textual inputs. These frameworks have also demonstrated efficacy in personalized customization tasks, particularly in the context of style transfer^4,5,6. Despite these achievements, fine-grained style manipulation remains inherently challenging due to the semantic ambiguity of style attributes (e.g., color palette, line dynamics, artistic genres, brushstroke patterns, and emotional tones). While these features are visually perceptible, their precise linguistic expression remains elusive. With the introduction of the IP-Adapter⁷, which enables the injection of reference images as visual prompts, current style transfer methods have evolved into a dual-input approach: a textual description specifying the semantic content and a reference image controlling the visual style.

Previous studies^8,9,10 have predominantly focused on fine-tuning diffusion model using homogeneous style datasets, ensuring stylistic consistency between generated outputs nd training distributions. Recent advancements have shifted emphasis toward training-free style image generation methodologies^4,6,7. Adapter-free approaches^4,6 leverage the self-attention mechanism inherent in diffusion architectures, extracting discriminative key-value features from reference style images through shared attention operations. By contrast, adapter-based methods⁷ employ lightweight modules to distill stylistic representations from reference images, which are then incorporated into the diffusion pipeline via cross-attention mechanisms. Despite significant progress in training-free style feature integration and content-style decoupling, dynamic style balancing and precise reference image manipulation remain under-investigated. To the best of our knowledge, this work constitutes the first systematic exploration of using a single reference image to modulate stylistic consistency across semantically distinct image regions. We present a novel framework, PSC-DG, which seamlessly integrates with pre-trained diffusion models without requiring additional training or fine-tuning. This architecture enables fine-grained control over specific stylistic dimensions within the reference image, as visualized in Fig. 1.

In the pre-trained text-to-image diffusion framework, text embeddings are combined with the model’s internal representations through cross-attention layers. The IP-Adapter enhances this architecture by inserting an auxiliary layer dedicated to processing visual features into each cross-attention module. This modification enables independent manipulation of text and image representations through dedicated cross-attention paths, thereby minimizing information degradation that may occur during direct feature connection. Based on this decoupled cross-attention mechanism, we extend the framework to integrate multiple image feature sets into the decoupling process, allowing the extraction of multiple styles as visual prompts in a single step and precise adjustment of their respective weights.

Inspired by the Inpaint Anything framework¹¹, which effectively integrates the Segment Anything Model (SAM)¹² with AIGC architectures (e.g., Stable Diffusion) to enable advanced capabilities including object removal, content infilling, and scene substitution, we adapt the SAM framework for style transfer tasks. This integration facilitates fine-grained modulation of feature inputs and weights corresponding to specific content styles in the reference image, while harmonizing stylistic consistency between foreground and background elements.

Furthermore, we introduce a weight allocation strategy for segmented content-style features. Through block-wise processing in the image encoder, dynamic weights are assigned to individual segments of the content-style image, enhancing the encoder’s saliency-aware information prioritization. Style embedding extraction is augmented by leveraging CLIP¹³, a state-of-the-art model renowned for its prowess in deriving semantically rich visual features from open-domain imagery. To this end, a pretrained CLIP image encoder is adopted as our feature extraction backbone.

Distinct from previous works, this study places emphasis on the controllability of reference images. The proposed PSC-DG framework enables fine-grained modulation of style reference intensity for specific stylistic attributes within the reference image (i.e., the parametric weighting of content-style components in the generative model), while maintaining high-fidelity output quality. The contributions of this work are threefold:

We devise an IP-Adapter-based mechanism that decouples multiple image feature sets from textual information, with experimental results demonstrating the controllability of each reference image’s style information.
We introduce the novel concept of “style coordination” and propose an innovative approach that synergizes SAM with image diffusion models.
We refine the encoder to mitigate content leakage, optimizing its capacity to extract style features from images with solid-color backgrounds.

Related works

Diffusion-based text-to-image generation

Diffusion models have emerged as a transformative paradigm in computational imaging, ushering in revolutionary breakthroughs in generative modeling. The Denoising Diffusion Probabilistic Model (DDPM)¹⁴ laid the groundwork by modeling image generation as a Markovian denoising process through iterative noise corruption and reconstruction. Subsequent advancements, typified by the Denoising Diffusion Implicit Model (DDIM)¹⁵, relaxed the strict Markovian constraints, enabling deterministic sampling with reduced computational overhead. Contemporary diffusion frameworks, augmented by large-scale pretraining, have established new benchmarks in text-to-image synthesis. These methodologies typically employ the U-Net architecture¹⁶ as the core diffusion backbone, augmented with cross-attention mechanisms to integrate textual embeddings from pretrained language models. A notable milestone was the introduction of the Latent Diffusion Model (LDM)², commercialized as Stable Diffusion (SD), which revolutionized the field by compressing data via a pretrained autoencoder, thereby transferring the generative process to a latent space with reduced dimensionality. Modern text-to-image diffusion systems^17,18,19 have become indispensable tools for visual content creation, demonstrating unprecedented capabilities in generating semantically aligned, high-fidelity images. The latest iteration, SDXL¹⁷, represents a significant leap forward, achieving superior synthesis quality and efficiency through architectural scaling, refined text-image alignment, and the introduction of a dedicated post-processing refinement stage. Powerful diffusion models provide extensive stylistic prior knowledge for style transfer, enabling the capture and reproduction of texture details from different artistic genres or personalized styles while preserving the main features of the content, thereby enhancing the creative expressiveness of style transfer.

Stylized image generation

Stylized image generation has emerged as a dynamic research frontier at the intersection of computer vision and computational graphics, aiming to synthesize images infused with distinct artistic or visual styles through innovative methodologies. Early customization approaches^8,10 focused on optimizing subsets or full diffusion model parameters to encapsulate stylistic attributes from reference images. However, these methods suffer from severe overfitting, compromising text-prompt fidelity and requiring extensive fine-tuning-often spanning hours-per reference image. In contrast, text-inversion techniques^20,21 project style images into learnable textual token embeddings, though this cross-modal mapping may introduce information degradation. The Diffusion Cocktail framework²² explores compositional strategies by exchanging content-style information across models to enhance generative diversity, albeit with limited efficiency and controllability. Jin et al.²³ propose a Frequency-aware Cross-Modal Attention Network (FCMNet), which constructs a dual-stream encoder-decoder by designing a Frequency-aware Cross-Modal Attention (FACMA) module, a Spatial Frequency Channel Attention (SFCA) module, and a Weighted Cross-Modal Fusion (WCMF) module. Jin et al.²⁴ propose a Tri-Party Progressive Integration Network (TriPINet). This network operates by extracting three types of features-RGB, frequency, and noise. It is designed with a gCMDA module to fuse cross-modal features and a PI-SE module to progressively integrate multi-scale features. Xie et al.²⁵ propose a novel architecture named the High-Order Graph Convolutional Transformer (HOGFormer). This architecture comprises three core modules: Chebyshev Graph Convolution (CGConv), a Graph-based Dynamic Adjacency Matrix Transformer (GDAMFormer), and High-Order Graph Convolution (HOGConv). It is designed to effectively capture both global and local information. FreeStyle²⁶ achieves text-guided style transfer using pretrained diffusion models via a dual-stream encoder-single-stream decoder architecture augmented with a feature modulation module, eliminating the need for optimization.

Recent research has shifted toward tuning-free paradigms, leveraging stylized image generation adapters to distill visual features and integrate them into diffusion’s cross-attention mechanisms^{4,5,6,7,27,28,29,30,31,32}. For instance, StyleAlign⁴ and swap self-attention⁶ manipulate the denoising process by aligning self-attention keys and values with reference blocks. T2I-Adapter³¹ and IP-Adapter utilize Transformer-based architectures³³ as image encoders, processing CLIP-derived embeddings through U-Net cross-attention layers. DEADiff²⁹ employs Q-Formers³⁴ filters trained on paired data to extract decoupled features, selectively injecting them into cross-attention layers. DiffuseST³⁵ synergizes textual and spatial features via iterative denoising, achieving high-quality style transfer by disentangling content and style injections in the target branch. InstantStyle³² preserves style fidelity during text-to-image synthesis by segregating style and content in feature space and embedding reference features into style-specific blocks. The above work has achieved good results in single-style prompt generation and one-shot stylized image generation. However, few studies have investigated the balance between style and content in images under multi-style prompts. To address this, we conducted the first study on this topic.

Method

Diffusion models are probabilistic generative models, where the generation process consists of a forward process and a reverse process. The forward process is a Markov chain, where each step injects a small amount of Gaussian noise into the latent variables. Formally, for a series of time steps $\textrm{t}= 1, \ldots , T,$ the process is represented as:

$$\begin{aligned} z_t=\sqrt{\alpha _t}z_{t-1}+\sqrt{1-\alpha _t}\epsilon _t\end{aligned}$$

(1)

Where, $z_t$ is the latent variable at time step t, $\alpha _{\textrm{t}}$represents the variance schedule, and $\epsilon _{\textrm{t} }\thicksim \mathcal {N} ( 0, \mathcal {I} )$. The reverse process aims to reconstruct the latent representation from the noise. It is defined as:

$$\begin{aligned} \hat{z}_{t-1}=\frac{1}{\sqrt{\alpha _t}}\left( \hat{z}_t-\frac{1-\alpha _t}{\sqrt{1-\overline{\alpha }_t}}\hat{\epsilon }(\hat{z}_t,t;\theta )\right) \end{aligned}$$

(2)

Where, $\hat{\bf{z}}_\textrm{t}-1$ is the reconstructed latent variable at time $\hat{\bf{z}}_\textrm{t}-1$, $\bar{\alpha }_\textrm{t}$ is the cumulative product of $\alpha _\textrm{t}$ up to time t, and $\hat{\epsilon }$ $(\hat{\bf{z}}_\textrm{t}$,t;$\theta )$ is the noise predicted by the model parameterized by $\theta$.

Stable Diffusion, a pretrained text-conditioned latent diffusion model (LDM), has gained widespread recognition for its capacity to synthesize high-fidelity images from textual prompts. Built upon the U-Net architecture, this network functions as a noise predictor, its core being an encoder-decoder architecture featuring skip connections that pass information between the down-sampling and up-sampling paths to preserve fine image details. Each layer incorporates a residual block, a self-attention block, and a cross-attention block. Specifically, the residual blocks facilitate the construction of deeper networks and mitigate the vanishing gradient problem, enabling more stable training. The self-attention blocks allow different image regions (or patches) of the model to communicate with one another, capturing global dependencies within the image and ensuring the internal consistency and coherence of the generated content. The cross-attention blocks are the key mechanism for achieving text conditioning; they use prompt embeddings from a text encoder (like CLIP) as the Key and Value, and the intermediate image features from the U-Net as the Query. This injects textual semantic information into the image generation process, guiding the model to denoise according to the text description. Our proposed methodology capitalizes on the Stable Diffusion framework, which adeptly disentangles content and style conditioning throughout the image synthesis pipeline, yielding visually striking and coherent outputs.

To enhance this framework, we incorporate two innovative modules, as depicted in Fig. 2. The first, a multi-decoupled cross-attention module, facilitates the injection of features derived from multiple style images into the diffusion process through cross-attention mechanisms; a detailed discussion of this component is provided in “Multi-decoupled cross-attention”. The second, a style segmentation module, partitions the reference image into content style and background style , allowing for targeted block selection during feature integration, as expounded in “Semantic-guided style feature extraction”. In “Image feature processing”, we delineate the differentiated processing strategies applied to content style and background style following content extraction from the reference image.

Multi-decoupled cross-attention

Upon extracting the style embeddings, two strategies can be employed to integrate the style conditions with the textual conditions: (1) Appending to Text: In this approach, style embeddings are concatenated with text embeddings, and the resulting composite interacts with the backbone features via the conventional text-based cross-attention mechanism. While many image-prompting techniques rely on straightforward concatenation to incorporate image feature information, this method proves suboptimal, as it does not fully leverage the rich information encapsulated within the image features. The cross-attention mechanism in traditional Latent Diffusion Models (LDMs) can be mathematically formulated as follows:

$$\begin{aligned} {\bf Z}^{\prime }=\mathrm {Attention~}({\bf Q},{\bf K},{\bf V})=\mathrm {~Softmax~}\left( \frac{{\bf Q}{\bf K}^{\top }}{\sqrt{d}}\right) {\bf V}\end{aligned}$$

(3)

Where, ${\bf Q}={\bf ZW}_{q},{\bf K}=c_{t}{\bf W}_{k},{\bf V}=c_{t}{\bf W}_{v}$ , $c_{t}$ represents the text features and ${\bf Z}$ denotes the hidden_state associated with the image.

(2) The IP-Adapter incorporates a decoupled cross-attention mechanism: a new cross-attention module is added for the style embeddings, which then integrates features based on text conditions and features based on style conditions. After decoupling the attention in the IP-Adapter, the cross-attention between the text and ${\bf Q}$ is calculated separately from the cross-attention between the reference image and ${\bf Q}$ . The two attention matrices are then summed, calculated as follows:

$$\begin{aligned} {\bf Z}^{\textrm{new}}=\textrm{Softmax}\left( \frac{{\bf Q}{\bf K}^{\textrm{T}}}{\sqrt{d}}\right) {\bf V}+\lambda \times \textrm{Softmax}\left( \frac{{\bf Q}({\bf K}^{\textrm{T}})^{\textrm{T}}}{\sqrt{d}}\right) {\bf V}^{\mathrm {^{\prime }}}\end{aligned}$$

(4)

where ${\bf Q}={\bf Z}{\bf W}_q,{\bf K}=c_t{\bf W}_k,{\bf V}=c_t{\bf W}_v,{\bf K}^{\prime }=c_i{\bf W}_k^{\prime },{\bf V}^{\prime }=c_i{\bf W}_v^{\prime }$ , $\mathbf {c_i}$ is the hidden_state obtained after encoding the reference images corresponding to the IP-Adapter using CLIP and processing them. We design two types of adapters that can integrate multiple images simultaneously, based on this approach. (1) We attempt to compare the common concatenated method, where features from multiple images are fused and then inject the image feature information by referring to the IP-Adapter. In this case, the only change we make is to ${\bf K}^{\prime }=c_{i}^{\prime }{\bf W}_{k}^{\prime },{\bf V}^{\prime }=c_{i}^{\prime }{\bf W}_{v}^{\prime }$ , $c_{i}^{\prime }$ which represents the concatenated feature information from multiple images.(2) By directly referencing the decoupled cross-attention of the IP-Adapter, we also apply the same approach when processing the feature information from multiple images. This involves separating the cross-attention layers between the text features and the features of each individual image, with the calculation represented as follows:

$$\begin{aligned} \begin{aligned} {\bf Z}^{\textrm{new}}=&\textrm{Softmax}\left( \frac{{\bf Q}{\bf K}^{\textrm{T}}}{\sqrt{d}}\right) {\bf V}+\lambda _1\times \textrm{Softmax}\left( \frac{{\bf Q}(\mathbf {K_1}^{\mathrm {^{\prime }}})^{\textrm{T}}}{\sqrt{d}}\right) \\&\mathbf {V_1}^{\mathrm {^{\prime }}}+\ldots +\lambda _n\times \textrm{Softmax}\left( \frac{{\bf Q}(\mathbf {K_n}^{\mathrm {^{\prime }}})^{\textrm{T}}}{\sqrt{d}}\right) \mathbf {V_n}^{\mathrm {^{\prime }}}\ \end{aligned} \end{aligned}$$

(5)

Where ${\bf Q}={\bf Z}{\bf W}_q,{\bf K}=c_t{\bf W}_k,{\bf V}=c_t{\bf W}_v,{\bf K}_j^{\prime }=c_{ij}{\bf W}_k^{\prime },{\bf V}_j^{\prime }=c_{\textrm{ij}}{\bf W}_v^{\prime }$. Here, $c_{ij}$ represents the features of each different reference image after being encoded by CLIP, allowing us to easily control the weights of the features for each reference image. We ultimately chose the latter approach. While the former can also effectively extract features from multiple style images and decouple them from the text prompts, the decoupling between images is suboptimal, making it difficult to harmonize the balance between different styles.

Semantic-guided style feature extraction

In the work of Inpaint Anything, the automatic segmentation is triggered via the Segment Anything Model (SAM) to achieve content separation in images. Subsequent inpainting of resultant voids is performed using models such as LaMa³⁶, ensuring visual coherence across the image. After object removal, text prompts are processed by generative models like Stable Diffusion to synthesize contextually appropriate content for vacated regions. Inspired by this approach, our method starts with a reference image to separate the specified content style from the background style. SAM, as a segmentation architecture optimized for prompt-driven tasks, demonstrates exceptional responsiveness to localized prompts specified by user input (such as coordinates and bounding boxes). This capability allows for precise description of content-style and background-style regions, seamlessly aligning with our conceptual framework.

However, the separated images cannot be directly injected into the diffusion model through Multi-Decoupled Cross-Attention, because we found that the diffusion model encodes hierarchical semantic information in its layers, with specific attention modules exhibiting a preference for encoding style-related features. Specifically, the first attention module in the upsampling blocks the layer up blocks.0.attentions.1 has been identified as particularly effective in capturing stylistic attributes such as color palettes, textures, and ambiance. By exploiting these layers, we implicitly extract style information while minimizing content leakage, preserving stylistic fidelity. Once these style-specific blocks are identified, features derived from the reference image are selectively injected into them, enabling seamless style transfer. This methodology addresses content leakage concerns in post-segmentation reference images, ensuring a controlled and refined stylization process.

Image feature processing

Raw extracted background and content style features are not directly applicable for downstream tasks. During CLIP encoding, missing content in background regions-such as black pixels-receives disproportionate weight from the attention mechanism, a phenomenon we aim to mitigate. We found that moderate cropping (e.g., reducing by 40%) can preserve stylistic representations comparable to the original images. Based on this insight, we propose a masked sampling strategy that extracts background styles from unaffected regions of equivalent spatial dimensions for filling. This approach prevents the attention mechanism from over-focusing on blank regions while maintaining overall style consistency. In the CLIP image encoding pipeline, the Vision Transformer (ViT)⁹ partitions images into non-overlapping patches, processes them independently, and aggregates the results. After content-style separation, background regions appear as black pixels. During encoding and aggregation, these black areas introduce artifacts due to consecutive black patches in initial segments, with the severity of distortion increasing as the black regions expand, as shown in Fig. 8. This issue is not limited to black similar artifacts occur with other solid colors. To address this problem, we propose an adaptive patch weighting scheme that leverages the characteristics of the ViT architecture in Fig. 3. This strategy reduces the impact of black pixel information, minimizes artifact generation, and enables robust content-style extraction (Fig. 9). The detailed method is as follows:

$$\begin{aligned} W=W+(1\varvec{-}W)*K\end{aligned}$$

(6)

In this context, W represents the proportion of non-black content pixels within each patch relative to the entire patch. If a patch lacks content pixels, we cannot simply set K to zero, as a sudden “loss of information” during CLIP encoding would severely affect the overall encoding results, especially when this situation occurs in the initial patches. By employing this approach, we amplify the weight of the desired content style while simultaneously reducing the weight of ineffective backgrounds, thereby enabling CLIP to pay greater attention to the style of the image.

Experiments

Within the experimental framework, comparative evaluations demonstrate that both fine-tuned and pretrained variants of the IP-Adapter yield highly consistent stylization results. To improve computational efficiency for subsequent comparative analyses, the pretrained IP-Adapter is adopted. Regarding the quality and complexity of generated images, SDXL exhibits superior performance, justifying its selection as the backbone diffusion model for this study.

Qualitative results

Text-Guided Image Stylization. To evaluate the robustness and generalization capability of the PSC-DG framework, we conducted a comprehensive suite of style transfer experiments, encompassing the application of diverse artistic styles across heterogeneous content categories while enabling fine-grained weight modulation for distinct content styles within reference images. Representative results of these experiments are visualized in Fig. 4. Through targeted style injection into designated attention blocks, style leakage is effectively mitigated.

Spatially Constrained Image Stylization via ControlNet Integration. We augment our framework by incorporating the ControlNet architecture to enable spatially localized style transfer, with representative results depicted in Figure 5. Through the integration of a single style reference image and diverse textual prompts, adaptive modulation of stylistic weights for semantically coherent content regions is realized. This approach exhibits broad applicability across heterogeneous stylization scenarios while ensuring full compatibility with the ControlNet framework.

Comparison to previous methods

For the baseline, we compare our method with the latest state-of-the-art stylization methods, including InstantStyle³², Swapping Self-Attention⁶, CAST³⁷, StyleShot³⁸, and the original IP-Adapter with weight adjustment⁷, As shown in Fig. 6. The image generation prompts for each model are the same. In the case of CAST, which does not use text prompts, we employ image prompts. In this analysis, we place greater emphasis on the quality of image and style transfer.

Qualitative evaluation: Figure 6 presents comparative results with state-of-the-art methods, we find that content-driven methods such as CAST avoid diffusion models and rely on simple color transfer, failing to capture complex style attributes from reference images. This limitation leads to visible artifacts in generated outputs. Second, Swapping Self-Attention significantly alters content during generation, disrupting style-content balance. While capable of partial style capture, it often produces unsatisfactory content fidelity. The IP-Adapter struggles with style-content decoupling, frequently prioritizing one at the expense of the other-either mismatching content during style adoption or sacrificing stylistic impact for content fidelity. StyleShot-generated images are marred by compositional inconsistencies, including abrupt stylistic shifts, visual transition discontinuities, and ambiguous subject-background relationships. In contrast, InstantStyle achieves commendable style-content harmony. Our proposed method builds upon InstantStyle’s style injection framework by introducing dynamic weight modulation, enabling finer-grained control over reference image stylistic attributes. Notably, our approach demonstrates superior performance in preserving content fidelity while enhancing stylistic details and overall image quality compared to InstantStyle.

Table 1 A quantitative analysis is performed in comparison to state-of-the-art methods.

Full size table

Quantitative comparison. In Table 1, we compare the performance of our proposed method with four other methods. Five objective metrics are employed: Artfid and Fid are used to quantify the distribution distance between generated images and real images (lower values indicate better generation quality); SS and Lpips are utilized to measure the perceptual similarity between images; CSD is applied to calculate the style similarity between images (higher values indicate more effective style transfer). Additionally, one subjective metric, “Preference,” is obtained through a subjective preference survey involving 100 participants. The experimental results demonstrate that our method exhibits advantages across most evaluation metrics: its Artfid (34.2) and Fid (24.6) are the minimum values, and the CSD score (0.41) is the maximum value, which confirms the superior objective quality of the generated images. Moreover, the Preference metric reaches 0.31, far exceeding that of the comparison methods, indicating a higher level of subjective acceptance among users. In summary, our method significantly outperforms the comparison methods in both objective performance and subjective preference, effectively validating the superiority of our proposed method.

Ablation study

Following CLIP-based feature extraction, an adaptive weighting scheme was applied to the extracted features. Compared to the baseline scenario without weight modulation, this approach attenuated attention to black background regions across multiple spatial locations while enhancing stylistic attention to subject content areas, as visualized in Fig. 7.

Comparing with InstantStyle in the style extraction of pure background reference images, as shown in Fig. 8, Empirical analysis reveals that background weight attenuation leads to significantly enhanced attention to content style features, resulting in generated images with closer stylistic alignment to the reference content. In contrast, InstantStyle exhibits sensitivity to dark background pixels, leading to systematic darkening of generated outputs.

Figure 8 demonstrates the effects of reducing the proportion of monochromatic backgrounds while maintaining consistent content styles. Empirical analysis reveals that high background proportions are associated with the emergence of irregular artifacts in style-transferred images, often accompanied by compromised content fidelity. As the monochromatic background ratio decreases, these artifacts gradually resolve into well-defined structures, coinciding with improvements in principal content quality. At minimal background proportions, both irregular artifact formation and content degradation are effectively mitigated.

Quantitative analysis of balancing parameter K across multiple image metrics identified an empirically determined optimal range of 0.2–0.5, as visualized in Fig. 10. Figure 11 demonstrates that increasing background weight K leads to progressive darkening of generated images, accompanied by amplified influence of background pixels. Conversely, reducing K shifts the model’s focus toward thematic content regions, enhancing stylistic alignment with target subjects. However, excessively low values (e.g., K=0) induce abrupt feature discontinuities, resulting in suboptimal image quality due to insufficient information propagation. Based on these findings, we recommend setting K within the range of 0.2–0.5 to balance stylistic consistency and content fidelity.

Conclusion

This paper introduces PSC-DG, a novel framework addressing the challenge of fine-grained stylistic control in text-to-image synthesis. By synergistically integrating the semantic segmentation capabilities of the Segment Anything Model (SAM) with the decoupled cross-attention architecture of IP-Adapter⁷, PSC-DG enables accurate extraction and integration of multiple reference image styles, significantly enhancing output stylistic diversity. Methodologically, a multi-image decoupling mechanism is proposed to disentangle textual and multi-image feature representations, enabling granular control over per-reference image style weights. Additionally, style-specific attention layers are strategically designated for style feature infusion, mitigating content leakage while preserving stylistic expressiveness. At the preprocessing stage, distinct optimization strategies for background and content styles enhance the encoder’s capacity to distill precise style representations.

Extensive experiments validate the efficacy of PSC-DG. Style transfer results demonstrate superior performance across heterogeneous content-style combinations, enabling nuanced style weight adjustments and producing high-fidelity, coherent images. Comparative evaluations against state-of-the-art methods confirm PSC-DG’s consistent superiority in both qualitative and quantitative metrics.

Notwithstanding these contributions, the method’s reliance on reference images introduces limitations. Output quality may degrade when reference images inadequately capture target styles or represent highly niche stylistic categories. Future research should focus on refining style extraction and fusion mechanisms to address these challenges.

Data availability

No external datasets were involved in this study, and the datasets generated and/or analyzed during the period are available from the corresponding author upon reasonable request.

References

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125 (2022).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10674–10685. https://doi.org/10.1109/CVPR52688.2022.01042 (IEEE, 2022).
Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22. 36479–36494 (Curran Associates Inc., 2024).
Hertz, A., Voynov, A., Fruchter, S. & Cohen-Or, D. Style aligned image generation via shared attention. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4775–4785. https://doi.org/10.1109/CVPR52733.2024.00457 (IEEE, 2024).
Sohn, K. et al. StyleDrop: Text-to-Image Generation in Any Style. arXiv:2306.00983 (2023).
Jeong, J., Kim, J., Choi, Y., Lee, G. & Uh, Y. Visual Style Prompting with Swapping Self-Attention. https://arxiv.org/abs/2402.12974v2 (2024).
Ye, H., Zhang, J., Liu, S., Han, X. & Yang, W. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv:2308.06721 (2023).
Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. https://doi.org/10.48550/arXiv.2106.09685 (2021). arXiv:2106.09685.
Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 (2021).
Ruiz, N. et al. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 22500–22510. https://doi.org/10.1109/CVPR52729.2023.02155 (IEEE, 2023).
Yu, T. et al. Inpaint Anything: Segment Anything Meets Image Inpainting. arXiv:2304.06790 (2023).
Kirillov, A. et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015–4026 (2023).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning. 8748–8763 (PMLR, 2021).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Google Scholar
Song, J., Meng, C. & Ermon, S. Denoising Diffusion Implicit Models. arXiv:2010.02502 (2022).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015 (Navab, N., Hornegger, J., Wells, W. M. & Frangi, A. F. Eds.) . 234–241. https://doi.org/10.1007/978-3-319-24574-4_28 (Springer, 2015).
Podell, D. et al. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952 . (2023).
Zhang, L., Rao, A. & Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847 (2023).
Li, D. et al. Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation. arXiv:2402.17245 (2024).
Gal, R. et al. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. arXiv:2208.01618 (2022).
Zhang, Y. et al. Inversion-based style transfer with diffusion models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10146–10156. https://doi.org/10.1109/CVPR52729.2023.00978 (2023).
Liu, H., Guo, Y., Wang, S. & Wen, H. Diffusion Cocktail: Fused Generation from Diffusion Models. arXiv:2312.08873 (2023).
Jin, X. et al. Fcmnet: Frequency-aware cross-modality attention networks for rgb-d salient object detection. Neurocomputing 491, 414–425. https://doi.org/10.1016/j.neucom.2022.04.015 (2022).
Article Google Scholar
Jin, X., Yu, W. & Shi, W. Image manipulation localization via dynamic cross-modality fusion and progressive integration. Neurocomputing 610, 128607. https://doi.org/10.1016/j.neucom.2024.128607 (2024).
Article Google Scholar
Xie, Y., Hong, C., Zhuang, W., Liu, L. & Li, J. Hogformer: High-order graph convolution transformer for 3d human pose estimation. Int. J. Mach. Learn. Cybern. 16, 599–610 (2025).
Article CAS Google Scholar
He, F. et al. Freestyle: Free lunch for text-guided style transfer using diffusion models. arXiv preprint arXiv:2401.15636 (2024).
Li, W. et al. Styletokenizer: Defining image style by a single instance for controlling diffusion models. In European Conference on Computer Vision. 110–126 (Springer, 2024).
Wang, Z. et al. Styleadapter: A unified stylized image generation model. arXiv preprint arXiv:2309.01770 (2023).
Qi, T. et al. Deadiff: An efficient stylization diffusion model with disentangled representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8693–8702 (2024).
Frenkel, Y., Vinker, Y., Shamir, A. & Cohen-Or, D. Implicit style-content separation using b-lora. In European Conference on Computer Vision. 181–198 (Springer, 2024).
Mou, C. et al. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. Proc. AAAI Conf. Artif. Intell. 38, 4296–4304 (2024).
Google Scholar
Wang, H. et al. Instantstyle: Free lunch towards style-preserving in text-to-image generation. arXiv preprint arXiv:2404.02733 (2024).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).
Li, J., Li, D., Savarese, S. & Hoi, S. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning. Vol. 202. ICML’23. 19730–19742 (JMLR.org, 2023).
Hu, Y., Zhuang, C. & Gao, P. Diffusest: Unleashing the capability of the diffusion model for style transfer. In Proceedings of the 6th ACM International Conference on Multimedia in Asia. 1–1 (2024).
Suvorov, R. et al. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2149–2159 (2022).
Zhang, Y. et al. Domain enhanced arbitrary image style transfer via contrastive learning. In ACM SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22. 1–8. https://doi.org/10.1145/3528233.3530736 (Association for Computing Machinery, 2022).
Gao, J. et al. Styleshot: A snapshot on any style. In IEEE Transactions on Pattern Analysis and Machine Intelligence (2025).
Wright, M. & Ommer, B. Artfid: Quantitative evaluation of neural style transfer. In DAGM German Conference on Pattern Recognition. 560–576 (Springer, 2022).

Download references

Funding

This study was funded by The Shandong Provincial Natural Science Foundation (ZR2020MH291). This study is an important part of the above program.

Author information

Authors and Affiliations

Shandong University of Science and Technology, Qingdao, 266590, China
Kairui Wang, Xinying Liu, Di Zhao, Tian Xian & Xuelei Geng
Hunan University, Changsha, 410000, China
Yonghao Chang

Authors

Kairui Wang
View author publications
Search author on:PubMed Google Scholar
Xinying Liu
View author publications
Search author on:PubMed Google Scholar
Yonghao Chang
View author publications
Search author on:PubMed Google Scholar
Di Zhao
View author publications
Search author on:PubMed Google Scholar
Tian Xian
View author publications
Search author on:PubMed Google Scholar
Xuelei Geng
View author publications
Search author on:PubMed Google Scholar

Contributions

W. Data organization, methodology, software, writing – first draft; writing – review and editing. L. Conceptualization, research, analysis, funding acquisition. C. Writing – first draft, visualization. G. Research, mapping. X. Formalization, conceptualization. Z. Software, validation.

Corresponding author

Correspondence to Xinying Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, K., Liu, X., Chang, Y. et al. Semantic guidance for precise style control in diffusion image generation. Sci Rep 15, 45581 (2025). https://doi.org/10.1038/s41598-025-28715-x

Download citation

Received: 28 July 2025
Accepted: 12 November 2025
Published: 28 November 2025
Version of record: 30 December 2025
DOI: https://doi.org/10.1038/s41598-025-28715-x
Springer Nature Limited

Semantic guidance for precise style control in diffusion image generation

Abstract

Explore a research question

Similar content being viewed by others

Investigating Style Similarity in Diffusion Models

Stylized image generation based on multi-attribute decomposition

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Introduction