Introduction

Diffusion-based text-to-image (T2I) generative models1,2,3 have achieved breakthrough advancements in recent years, exhibiting unprecedented capabilities in generating photorealistic outputs that semantically align with textual inputs. These frameworks have also demonstrated efficacy in personalized customization tasks, particularly in the context of style transfer4,5,6. Despite these achievements, fine-grained style manipulation remains inherently challenging due to the semantic ambiguity of style attributes (e.g., color palette, line dynamics, artistic genres, brushstroke patterns, and emotional tones). While these features are visually perceptible, their precise linguistic expression remains elusive. With the introduction of the IP-Adapter7, which enables the injection of reference images as visual prompts, current style transfer methods have evolved into a dual-input approach: a textual description specifying the semantic content and a reference image controlling the visual style.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Use cases of PSC-DG demo. Our method allows for the gradual increase of the specified content style expression based on the reference image, without disrupting the expression of the content generated by textual guidance.

Previous studies8,9,10 have predominantly focused on fine-tuning diffusion model using homogeneous style datasets, ensuring stylistic consistency between generated outputs nd training distributions. Recent advancements have shifted emphasis toward training-free style image generation methodologies4,6,7. Adapter-free approaches4,6 leverage the self-attention mechanism inherent in diffusion architectures, extracting discriminative key-value features from reference style images through shared attention operations. By contrast, adapter-based methods7 employ lightweight modules to distill stylistic representations from reference images, which are then incorporated into the diffusion pipeline via cross-attention mechanisms. Despite significant progress in training-free style feature integration and content-style decoupling, dynamic style balancing and precise reference image manipulation remain under-investigated. To the best of our knowledge, this work constitutes the first systematic exploration of using a single reference image to modulate stylistic consistency across semantically distinct image regions. We present a novel framework, PSC-DG, which seamlessly integrates with pre-trained diffusion models without requiring additional training or fine-tuning. This architecture enables fine-grained control over specific stylistic dimensions within the reference image, as visualized in Fig. 1.

In the pre-trained text-to-image diffusion framework, text embeddings are combined with the model’s internal representations through cross-attention layers. The IP-Adapter enhances this architecture by inserting an auxiliary layer dedicated to processing visual features into each cross-attention module. This modification enables independent manipulation of text and image representations through dedicated cross-attention paths, thereby minimizing information degradation that may occur during direct feature connection. Based on this decoupled cross-attention mechanism, we extend the framework to integrate multiple image feature sets into the decoupling process, allowing the extraction of multiple styles as visual prompts in a single step and precise adjustment of their respective weights.

Inspired by the Inpaint Anything framework11, which effectively integrates the Segment Anything Model (SAM)12 with AIGC architectures (e.g., Stable Diffusion) to enable advanced capabilities including object removal, content infilling, and scene substitution, we adapt the SAM framework for style transfer tasks. This integration facilitates fine-grained modulation of feature inputs and weights corresponding to specific content styles in the reference image, while harmonizing stylistic consistency between foreground and background elements.

Furthermore, we introduce a weight allocation strategy for segmented content-style features. Through block-wise processing in the image encoder, dynamic weights are assigned to individual segments of the content-style image, enhancing the encoder’s saliency-aware information prioritization. Style embedding extraction is augmented by leveraging CLIP13, a state-of-the-art model renowned for its prowess in deriving semantically rich visual features from open-domain imagery. To this end, a pretrained CLIP image encoder is adopted as our feature extraction backbone.

Distinct from previous works, this study places emphasis on the controllability of reference images. The proposed PSC-DG framework enables fine-grained modulation of style reference intensity for specific stylistic attributes within the reference image (i.e., the parametric weighting of content-style components in the generative model), while maintaining high-fidelity output quality. The contributions of this work are threefold:

  • We devise an IP-Adapter-based mechanism that decouples multiple image feature sets from textual information, with experimental results demonstrating the controllability of each reference image’s style information.

  • We introduce the novel concept of “style coordination” and propose an innovative approach that synergizes SAM with image diffusion models.

  • We refine the encoder to mitigate content leakage, optimizing its capacity to extract style features from images with solid-color backgrounds.

Related works

Diffusion-based text-to-image generation

Diffusion models have emerged as a transformative paradigm in computational imaging, ushering in revolutionary breakthroughs in generative modeling. The Denoising Diffusion Probabilistic Model (DDPM)14 laid the groundwork by modeling image generation as a Markovian denoising process through iterative noise corruption and reconstruction. Subsequent advancements, typified by the Denoising Diffusion Implicit Model (DDIM)15, relaxed the strict Markovian constraints, enabling deterministic sampling with reduced computational overhead. Contemporary diffusion frameworks, augmented by large-scale pretraining, have established new benchmarks in text-to-image synthesis. These methodologies typically employ the U-Net architecture16 as the core diffusion backbone, augmented with cross-attention mechanisms to integrate textual embeddings from pretrained language models. A notable milestone was the introduction of the Latent Diffusion Model (LDM)2, commercialized as Stable Diffusion (SD), which revolutionized the field by compressing data via a pretrained autoencoder, thereby transferring the generative process to a latent space with reduced dimensionality. Modern text-to-image diffusion systems17,18,19 have become indispensable tools for visual content creation, demonstrating unprecedented capabilities in generating semantically aligned, high-fidelity images. The latest iteration, SDXL17, represents a significant leap forward, achieving superior synthesis quality and efficiency through architectural scaling, refined text-image alignment, and the introduction of a dedicated post-processing refinement stage. Powerful diffusion models provide extensive stylistic prior knowledge for style transfer, enabling the capture and reproduction of texture details from different artistic genres or personalized styles while preserving the main features of the content, thereby enhancing the creative expressiveness of style transfer.

Stylized image generation

Stylized image generation has emerged as a dynamic research frontier at the intersection of computer vision and computational graphics, aiming to synthesize images infused with distinct artistic or visual styles through innovative methodologies. Early customization approaches8,10 focused on optimizing subsets or full diffusion model parameters to encapsulate stylistic attributes from reference images. However, these methods suffer from severe overfitting, compromising text-prompt fidelity and requiring extensive fine-tuning-often spanning hours-per reference image. In contrast, text-inversion techniques20,21 project style images into learnable textual token embeddings, though this cross-modal mapping may introduce information degradation. The Diffusion Cocktail framework22 explores compositional strategies by exchanging content-style information across models to enhance generative diversity, albeit with limited efficiency and controllability. Jin et al.23 propose a Frequency-aware Cross-Modal Attention Network (FCMNet), which constructs a dual-stream encoder-decoder by designing a Frequency-aware Cross-Modal Attention (FACMA) module, a Spatial Frequency Channel Attention (SFCA) module, and a Weighted Cross-Modal Fusion (WCMF) module. Jin et al.24 propose a Tri-Party Progressive Integration Network (TriPINet). This network operates by extracting three types of features-RGB, frequency, and noise. It is designed with a gCMDA module to fuse cross-modal features and a PI-SE module to progressively integrate multi-scale features. Xie et al.25 propose a novel architecture named the High-Order Graph Convolutional Transformer (HOGFormer). This architecture comprises three core modules: Chebyshev Graph Convolution (CGConv), a Graph-based Dynamic Adjacency Matrix Transformer (GDAMFormer), and High-Order Graph Convolution (HOGConv). It is designed to effectively capture both global and local information. FreeStyle26 achieves text-guided style transfer using pretrained diffusion models via a dual-stream encoder-single-stream decoder architecture augmented with a feature modulation module, eliminating the need for optimization.

Recent research has shifted toward tuning-free paradigms, leveraging stylized image generation adapters to distill visual features and integrate them into diffusion’s cross-attention mechanisms4,5,6,7,27,28,29,30,31,32. For instance, StyleAlign4 and swap self-attention6 manipulate the denoising process by aligning self-attention keys and values with reference blocks. T2I-Adapter31 and IP-Adapter utilize Transformer-based architectures33 as image encoders, processing CLIP-derived embeddings through U-Net cross-attention layers. DEADiff29 employs Q-Formers34 filters trained on paired data to extract decoupled features, selectively injecting them into cross-attention layers. DiffuseST35 synergizes textual and spatial features via iterative denoising, achieving high-quality style transfer by disentangling content and style injections in the target branch. InstantStyle32 preserves style fidelity during text-to-image synthesis by segregating style and content in feature space and embedding reference features into style-specific blocks. The above work has achieved good results in single-style prompt generation and one-shot stylized image generation. However, few studies have investigated the balance between style and content in images under multi-style prompts. To address this, we conducted the first study on this topic.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Overall framework diagram. The background style image segmented by SAM is passed to the encoder after being filled with specified content, while the content style image undergoes weight processing after being input into the encoder. The resulting image features are injected solely into the style blocks through a multi-decoupled cross-attention mechanism. The final generated image is influenced by the styles of the segmented background style image and the content style image, and is controllable.

Method

Diffusion models are probabilistic generative models, where the generation process consists of a forward process and a reverse process. The forward process is a Markov chain, where each step injects a small amount of Gaussian noise into the latent variables. Formally, for a series of time steps \(\textrm{t}= 1, \ldots , T,\) the process is represented as:

$$\begin{aligned} z_t=\sqrt{\alpha _t}z_{t-1}+\sqrt{1-\alpha _t}\epsilon _t\end{aligned}$$
(1)

Where, \(z_t\) is the latent variable at time step t, \(\alpha _{\textrm{t}}\)represents the variance schedule, and \(\epsilon _{\textrm{t} }\thicksim \mathcal {N} ( 0, \mathcal {I} )\). The reverse process aims to reconstruct the latent representation from the noise. It is defined as:

$$\begin{aligned} \hat{z}_{t-1}=\frac{1}{\sqrt{\alpha _t}}\left( \hat{z}_t-\frac{1-\alpha _t}{\sqrt{1-\overline{\alpha }_t}}\hat{\epsilon }(\hat{z}_t,t;\theta )\right) \end{aligned}$$
(2)

Where, \(\hat{\bf{z}}_\textrm{t}-1\) is the reconstructed latent variable at time \(\hat{\bf{z}}_\textrm{t}-1\), \(\bar{\alpha }_\textrm{t}\) is the cumulative product of \(\alpha _\textrm{t}\) up to time t, and \(\hat{\epsilon }\) \((\hat{\bf{z}}_\textrm{t}\),t;\(\theta )\) is the noise predicted by the model parameterized by \(\theta\).

Stable Diffusion, a pretrained text-conditioned latent diffusion model (LDM), has gained widespread recognition for its capacity to synthesize high-fidelity images from textual prompts. Built upon the U-Net architecture, this network functions as a noise predictor, its core being an encoder-decoder architecture featuring skip connections that pass information between the down-sampling and up-sampling paths to preserve fine image details. Each layer incorporates a residual block, a self-attention block, and a cross-attention block. Specifically, the residual blocks facilitate the construction of deeper networks and mitigate the vanishing gradient problem, enabling more stable training. The self-attention blocks allow different image regions (or patches) of the model to communicate with one another, capturing global dependencies within the image and ensuring the internal consistency and coherence of the generated content. The cross-attention blocks are the key mechanism for achieving text conditioning; they use prompt embeddings from a text encoder (like CLIP) as the Key and Value, and the intermediate image features from the U-Net as the Query. This injects textual semantic information into the image generation process, guiding the model to denoise according to the text description. Our proposed methodology capitalizes on the Stable Diffusion framework, which adeptly disentangles content and style conditioning throughout the image synthesis pipeline, yielding visually striking and coherent outputs.

To enhance this framework, we incorporate two innovative modules, as depicted in Fig. 2. The first, a multi-decoupled cross-attention module, facilitates the injection of features derived from multiple style images into the diffusion process through cross-attention mechanisms; a detailed discussion of this component is provided in “Multi-decoupled cross-attention”. The second, a style segmentation module, partitions the reference image into content style and background style , allowing for targeted block selection during feature integration, as expounded in “Semantic-guided style feature extraction”. In “Image feature processing”, we delineate the differentiated processing strategies applied to content style and background style following content extraction from the reference image.

Multi-decoupled cross-attention

Upon extracting the style embeddings, two strategies can be employed to integrate the style conditions with the textual conditions: (1) Appending to Text: In this approach, style embeddings are concatenated with text embeddings, and the resulting composite interacts with the backbone features via the conventional text-based cross-attention mechanism. While many image-prompting techniques rely on straightforward concatenation to incorporate image feature information, this method proves suboptimal, as it does not fully leverage the rich information encapsulated within the image features. The cross-attention mechanism in traditional Latent Diffusion Models (LDMs) can be mathematically formulated as follows:

$$\begin{aligned} {\bf Z}^{\prime }=\mathrm {Attention~}({\bf Q},{\bf K},{\bf V})=\mathrm {~Softmax~}\left( \frac{{\bf Q}{\bf K}^{\top }}{\sqrt{d}}\right) {\bf V}\end{aligned}$$
(3)

Where, \({\bf Q}={\bf ZW}_{q},{\bf K}=c_{t}{\bf W}_{k},{\bf V}=c_{t}{\bf W}_{v}\) , \(c_{t}\) represents the text features and \({\bf Z}\) denotes the hidden_state associated with the image.

(2) The IP-Adapter incorporates a decoupled cross-attention mechanism: a new cross-attention module is added for the style embeddings, which then integrates features based on text conditions and features based on style conditions. After decoupling the attention in the IP-Adapter, the cross-attention between the text and \({\bf Q}\) is calculated separately from the cross-attention between the reference image and \({\bf Q}\) . The two attention matrices are then summed, calculated as follows:

$$\begin{aligned} {\bf Z}^{\textrm{new}}=\textrm{Softmax}\left( \frac{{\bf Q}{\bf K}^{\textrm{T}}}{\sqrt{d}}\right) {\bf V}+\lambda \times \textrm{Softmax}\left( \frac{{\bf Q}({\bf K}^{\textrm{T}})^{\textrm{T}}}{\sqrt{d}}\right) {\bf V}^{\mathrm {^{\prime }}}\end{aligned}$$
(4)

where \({\bf Q}={\bf Z}{\bf W}_q,{\bf K}=c_t{\bf W}_k,{\bf V}=c_t{\bf W}_v,{\bf K}^{\prime }=c_i{\bf W}_k^{\prime },{\bf V}^{\prime }=c_i{\bf W}_v^{\prime }\) , \(\mathbf {c_i}\) is the hidden_state obtained after encoding the reference images corresponding to the IP-Adapter using CLIP and processing them. We design two types of adapters that can integrate multiple images simultaneously, based on this approach. (1) We attempt to compare the common concatenated method, where features from multiple images are fused and then inject the image feature information by referring to the IP-Adapter. In this case, the only change we make is to \({\bf K}^{\prime }=c_{i}^{\prime }{\bf W}_{k}^{\prime },{\bf V}^{\prime }=c_{i}^{\prime }{\bf W}_{v}^{\prime }\) , \(c_{i}^{\prime }\) which represents the concatenated feature information from multiple images.(2) By directly referencing the decoupled cross-attention of the IP-Adapter, we also apply the same approach when processing the feature information from multiple images. This involves separating the cross-attention layers between the text features and the features of each individual image, with the calculation represented as follows:

$$\begin{aligned} \begin{aligned} {\bf Z}^{\textrm{new}}=&\textrm{Softmax}\left( \frac{{\bf Q}{\bf K}^{\textrm{T}}}{\sqrt{d}}\right) {\bf V}+\lambda _1\times \textrm{Softmax}\left( \frac{{\bf Q}(\mathbf {K_1}^{\mathrm {^{\prime }}})^{\textrm{T}}}{\sqrt{d}}\right) \\&\mathbf {V_1}^{\mathrm {^{\prime }}}+\ldots +\lambda _n\times \textrm{Softmax}\left( \frac{{\bf Q}(\mathbf {K_n}^{\mathrm {^{\prime }}})^{\textrm{T}}}{\sqrt{d}}\right) \mathbf {V_n}^{\mathrm {^{\prime }}}\ \end{aligned} \end{aligned}$$
(5)

Where \({\bf Q}={\bf Z}{\bf W}_q,{\bf K}=c_t{\bf W}_k,{\bf V}=c_t{\bf W}_v,{\bf K}_j^{\prime }=c_{ij}{\bf W}_k^{\prime },{\bf V}_j^{\prime }=c_{\textrm{ij}}{\bf W}_v^{\prime }\). Here, \(c_{ij}\) represents the features of each different reference image after being encoded by CLIP, allowing us to easily control the weights of the features for each reference image. We ultimately chose the latter approach. While the former can also effectively extract features from multiple style images and decouple them from the text prompts, the decoupling between images is suboptimal, making it difficult to harmonize the balance between different styles.

Semantic-guided style feature extraction

In the work of Inpaint Anything, the automatic segmentation is triggered via the Segment Anything Model (SAM) to achieve content separation in images. Subsequent inpainting of resultant voids is performed using models such as LaMa36, ensuring visual coherence across the image. After object removal, text prompts are processed by generative models like Stable Diffusion to synthesize contextually appropriate content for vacated regions. Inspired by this approach, our method starts with a reference image to separate the specified content style from the background style. SAM, as a segmentation architecture optimized for prompt-driven tasks, demonstrates exceptional responsiveness to localized prompts specified by user input (such as coordinates and bounding boxes). This capability allows for precise description of content-style and background-style regions, seamlessly aligning with our conceptual framework.

However, the separated images cannot be directly injected into the diffusion model through Multi-Decoupled Cross-Attention, because we found that the diffusion model encodes hierarchical semantic information in its layers, with specific attention modules exhibiting a preference for encoding style-related features. Specifically, the first attention module in the upsampling blocks the layer up blocks.0.attentions.1 has been identified as particularly effective in capturing stylistic attributes such as color palettes, textures, and ambiance. By exploiting these layers, we implicitly extract style information while minimizing content leakage, preserving stylistic fidelity. Once these style-specific blocks are identified, features derived from the reference image are selectively injected into them, enabling seamless style transfer. This methodology addresses content leakage concerns in post-segmentation reference images, ensuring a controlled and refined stylization process.

Image feature processing

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Set weights for each patch of the image.

Raw extracted background and content style features are not directly applicable for downstream tasks. During CLIP encoding, missing content in background regions-such as black pixels-receives disproportionate weight from the attention mechanism, a phenomenon we aim to mitigate. We found that moderate cropping (e.g., reducing by 40%) can preserve stylistic representations comparable to the original images. Based on this insight, we propose a masked sampling strategy that extracts background styles from unaffected regions of equivalent spatial dimensions for filling. This approach prevents the attention mechanism from over-focusing on blank regions while maintaining overall style consistency. In the CLIP image encoding pipeline, the Vision Transformer (ViT)9 partitions images into non-overlapping patches, processes them independently, and aggregates the results. After content-style separation, background regions appear as black pixels. During encoding and aggregation, these black areas introduce artifacts due to consecutive black patches in initial segments, with the severity of distortion increasing as the black regions expand, as shown in Fig. 8. This issue is not limited to black similar artifacts occur with other solid colors. To address this problem, we propose an adaptive patch weighting scheme that leverages the characteristics of the ViT architecture in Fig. 3. This strategy reduces the impact of black pixel information, minimizes artifact generation, and enables robust content-style extraction (Fig. 9). The detailed method is as follows:

$$\begin{aligned} W=W+(1\varvec{-}W)*K\end{aligned}$$
(6)

In this context, W represents the proportion of non-black content pixels within each patch relative to the entire patch. If a patch lacks content pixels, we cannot simply set K to zero, as a sudden “loss of information” during CLIP encoding would severely affect the overall encoding results, especially when this situation occurs in the initial patches. By employing this approach, we amplify the weight of the desired content style while simultaneously reducing the weight of ineffective backgrounds, thereby enabling CLIP to pay greater attention to the style of the image.

Experiments

Within the experimental framework, comparative evaluations demonstrate that both fine-tuned and pretrained variants of the IP-Adapter yield highly consistent stylization results. To improve computational efficiency for subsequent comparative analyses, the pretrained IP-Adapter is adopted. Regarding the quality and complexity of generated images, SDXL exhibits superior performance, justifying its selection as the backbone diffusion model for this study.

Qualitative results

Text-Guided Image Stylization. To evaluate the robustness and generalization capability of the PSC-DG framework, we conducted a comprehensive suite of style transfer experiments, encompassing the application of diverse artistic styles across heterogeneous content categories while enabling fine-grained weight modulation for distinct content styles within reference images. Representative results of these experiments are visualized in Fig. 4. Through targeted style injection into designated attention blocks, style leakage is effectively mitigated.

Spatially Constrained Image Stylization via ControlNet Integration. We augment our framework by incorporating the ControlNet architecture to enable spatially localized style transfer, with representative results depicted in Figure 5. Through the integration of a single style reference image and diverse textual prompts, adaptive modulation of stylistic weights for semantically coherent content regions is realized. This approach exhibits broad applicability across heterogeneous stylization scenarios while ensuring full compatibility with the ControlNet framework.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Qualitative results. Under the condition of providing a single style reference image along with different prompts, adjusting the style weights of different contents in the reference image achieved the desired style-controlled outcomes while also ensuring high consistency in style generation.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Visualization results based on content image stylization.

Comparison to previous methods

For the baseline, we compare our method with the latest state-of-the-art stylization methods, including InstantStyle32, Swapping Self-Attention6, CAST37, StyleShot38, and the original IP-Adapter with weight adjustment7, As shown in Fig. 6. The image generation prompts for each model are the same. In the case of CAST, which does not use text prompts, we employ image prompts. In this analysis, we place greater emphasis on the quality of image and style transfer.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

A qualitative analysis is performed in comparison to state-of-the-art methods.

Qualitative evaluation: Figure 6 presents comparative results with state-of-the-art methods, we find that content-driven methods such as CAST avoid diffusion models and rely on simple color transfer, failing to capture complex style attributes from reference images. This limitation leads to visible artifacts in generated outputs. Second, Swapping Self-Attention significantly alters content during generation, disrupting style-content balance. While capable of partial style capture, it often produces unsatisfactory content fidelity. The IP-Adapter struggles with style-content decoupling, frequently prioritizing one at the expense of the other-either mismatching content during style adoption or sacrificing stylistic impact for content fidelity. StyleShot-generated images are marred by compositional inconsistencies, including abrupt stylistic shifts, visual transition discontinuities, and ambiguous subject-background relationships. In contrast, InstantStyle achieves commendable style-content harmony. Our proposed method builds upon InstantStyle’s style injection framework by introducing dynamic weight modulation, enabling finer-grained control over reference image stylistic attributes. Notably, our approach demonstrates superior performance in preserving content fidelity while enhancing stylistic details and overall image quality compared to InstantStyle.

Table 1 A quantitative analysis is performed in comparison to state-of-the-art methods.

Quantitative comparison. In Table 1, we compare the performance of our proposed method with four other methods. Five objective metrics are employed: Artfid and Fid are used to quantify the distribution distance between generated images and real images (lower values indicate better generation quality); SS and Lpips are utilized to measure the perceptual similarity between images; CSD is applied to calculate the style similarity between images (higher values indicate more effective style transfer). Additionally, one subjective metric, “Preference,” is obtained through a subjective preference survey involving 100 participants. The experimental results demonstrate that our method exhibits advantages across most evaluation metrics: its Artfid (34.2) and Fid (24.6) are the minimum values, and the CSD score (0.41) is the maximum value, which confirms the superior objective quality of the generated images. Moreover, the Preference metric reaches 0.31, far exceeding that of the comparison methods, indicating a higher level of subjective acceptance among users. In summary, our method significantly outperforms the comparison methods in both objective performance and subjective preference, effectively validating the superiority of our proposed method.

Ablation study

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Visualization of attention maps for CLIP image feature extraction.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Comparison with InstantStyle after applying weights.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Gradually reduce the solid color background of the reference image.

Following CLIP-based feature extraction, an adaptive weighting scheme was applied to the extracted features. Compared to the baseline scenario without weight modulation, this approach attenuated attention to black background regions across multiple spatial locations while enhancing stylistic attention to subject content areas, as visualized in Fig. 7.

Comparing with InstantStyle in the style extraction of pure background reference images, as shown in Fig. 8, Empirical analysis reveals that background weight attenuation leads to significantly enhanced attention to content style features, resulting in generated images with closer stylistic alignment to the reference content. In contrast, InstantStyle exhibits sensitivity to dark background pixels, leading to systematic darkening of generated outputs.

Figure 8 demonstrates the effects of reducing the proportion of monochromatic backgrounds while maintaining consistent content styles. Empirical analysis reveals that high background proportions are associated with the emergence of irregular artifacts in style-transferred images, often accompanied by compromised content fidelity. As the monochromatic background ratio decreases, these artifacts gradually resolve into well-defined structures, coinciding with improvements in principal content quality. At minimal background proportions, both irregular artifact formation and content degradation are effectively mitigated.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

Different effects under K.

Quantitative analysis of balancing parameter K across multiple image metrics identified an empirically determined optimal range of 0.2–0.5, as visualized in Fig. 10. Figure 11 demonstrates that increasing background weight K leads to progressive darkening of generated images, accompanied by amplified influence of background pixels. Conversely, reducing K shifts the model’s focus toward thematic content regions, enhancing stylistic alignment with target subjects. However, excessively low values (e.g., K=0) induce abrupt feature discontinuities, resulting in suboptimal image quality due to insufficient information propagation. Based on these findings, we recommend setting K within the range of 0.2–0.5 to balance stylistic consistency and content fidelity.

Fig. 11
Fig. 11The alternative text for this image may have been generated using AI.
Full size image

Visualization results of generated images under different values of K.

Conclusion

This paper introduces PSC-DG, a novel framework addressing the challenge of fine-grained stylistic control in text-to-image synthesis. By synergistically integrating the semantic segmentation capabilities of the Segment Anything Model (SAM) with the decoupled cross-attention architecture of IP-Adapter7, PSC-DG enables accurate extraction and integration of multiple reference image styles, significantly enhancing output stylistic diversity. Methodologically, a multi-image decoupling mechanism is proposed to disentangle textual and multi-image feature representations, enabling granular control over per-reference image style weights. Additionally, style-specific attention layers are strategically designated for style feature infusion, mitigating content leakage while preserving stylistic expressiveness. At the preprocessing stage, distinct optimization strategies for background and content styles enhance the encoder’s capacity to distill precise style representations.

Extensive experiments validate the efficacy of PSC-DG. Style transfer results demonstrate superior performance across heterogeneous content-style combinations, enabling nuanced style weight adjustments and producing high-fidelity, coherent images. Comparative evaluations against state-of-the-art methods confirm PSC-DG’s consistent superiority in both qualitative and quantitative metrics.

Notwithstanding these contributions, the method’s reliance on reference images introduces limitations. Output quality may degrade when reference images inadequately capture target styles or represent highly niche stylistic categories. Future research should focus on refining style extraction and fusion mechanisms to address these challenges.