Abstract
Traditional semantic segmentation conducts pixel-level classification on fixed classes, which results in catastrophic forgetting when fine-tuning the segmentation model on new data. Continual semantic segmentation has been introduced to address this challenge; however, replaying methods based on generative adversarial networks (GANs) cannot guarantee either semantic accuracy in generated images or distribution alignment between original training data and generated images. Motivated by the diffusion model, which inherently considers the entire data distribution, we propose a replay module named SDReplay with a dual-generator architecture to generate images of old classes with accurate semantics and an aligned distribution, where the Structure-Preserved Generator (SPG) synthesizes high-fidelity imagery with precise semantic consistency by leveraging structural priors, while the Distribution-Aligned Generator (DAG) ensures robust distributional fidelity for legacy classes through advanced token embedding optimization. The results in multiple datasets show that our approach improves the mean intersection-over-union (mIoU) by approximately 1.0%.







Similar content being viewed by others
Data Availability
No datasets were generated or analysed during the current study.
References
Yuqiao, X., Huang, S., Zhou, H.: Ca-clip: category-aware adaptation of clip model for few-shot class-incremental learning. Multimedia Syst. 30(3), 130 (2024)
Mengying, F., Binghao, L., Tianren, M., Qixiang, Y.: Overcomplete-to-sparse representation learning for few-shot class-incremental learning. Multimedia Systems 30(2), 102 (2024)
Tian, Y., Zhang, Y., Chen, W.-G., Liu, D., Wang, H., Huayi, X., Han, J., Ge, Y.: 3d tooth instance segmentation learning objectness and affinity in point cloud. ACM Trans. Multimed. Comput. Commun. Appl. 18(4), 1–16 (2022)
Tian, Y., Jian, G., Wang, J., Chen, H., Pan, L., Zhaocheng, X., Li, J., Wang, R.: A revised approach to orthodontic treatment monitoring from oralscan video. IEEE J. Biomed. Health Inform 27(12), 1–10 (2023)
Tian, Y., Hanshi, F., Wang, H., Liu, Y., Zhaocheng, X., Chen, H., Li, J., Wang, R.: Rgb oralscan video-based orthodontic treatment monitoring. SCIENCE CHINA Inf. Sci. 67(1), 112107 (2024)
Tian, Y., Cheng, G., Gelernter, J., et al.: Joint temporal context exploitation and active learning for video segmentation. Pattern Recognition 100, 107158 (2020)
Tian, Y., Zhang, Y., Zhou, D., et al.: Triple attention network for video segmentation. Neurocomputing 417, 202–211 (2020)
Zhang, C.-B., Xiao, J.-W., Liu, X., Chen, Y.-C., Cheng, M.-M.: Representation compensation networks for continual semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7053–7064, (2022)
Yang, G., Fini, E., Dan, X., Rota, P., Ding, M., Hao, T., Alameda-Pineda, X., Ricci, E.: Continual attentive fusion for incremental learning in semantic segmentation. IEEE Trans. Multimedia 25, 3841–3854 (2022)
Oh, Y., Baek, D., Ham, B. Alife: Adaptive logit regularizer and feature replay for incremental semantic segmentation. In International Conference on Advances in Neural Information Processing Systems, pages 14516–14528, (2022)
Baek, D., Oh, Y., Lee, S., Lee,J., Ham, B.: Decomposed knowledge distillation for class-incremental semantic segmentation. In International Conference on Advances in Neural Information Processing Systems, pages 10380–10392, (2022)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Bing, X., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265, (2015)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, (2022)
Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, (2018)
Chamikara, M.A.P., Bertók, P., Liu, D., Camtepe, S., Khalil, I.: Efficient data perturbation for privacy preserving and accurate data stream mining. Pervasive and Mobile Computing 48, 1–19 (2018)
Li, D., Ling, H., Kim, S.W., Kreis, K., Fidler, S.,Torralba, A. Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21330–21340, (2022)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125, 1(2):3, (2022)
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems 35, 36479–36494 (2022)
Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y.,Xie, W.: Open-vocabulary object segmentation with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7667–7676, (2023)
Wu, W., Zhao, Y., Shou, M.Z., Zhou, H., Shen, C.: Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1206–1217, (2023)
Cermelli, F., Mancini, M., Bulo, S.R., Ricci, E., Caputo, B.: Modeling the background for incremental learning in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9233–9242, (2020)
Douillard, A., Chen, Y., Dapogny, A., Cord, M. Plop: Learning without forgetting for continual semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4040–4050, (2021)
Cha, S., Yoo, Y.J., Moon, T., et al.: Ssul: Semantic segmentation with unknown label for exemplar-based class-incremental learning. In International Conference on Advances in Neural Information Processing Systems 34, 10919–10930 (2021)
Chen, J., Cong, R., Luo, Y., Ip, H.H.S., Kwong, S.: Replay without saving: Prototype derivation and distribution rebalance for class-incremental semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 47(6), 4699–4716 (2025)
Zhu, G., Dongyue, W., Gao, C., Wang, R., Yang, W., Sang, N.: Adaptive prototype replay for class incremental semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence 39, 10932–10940 (2025)
Zhu, L., Chen, T., Yin, J., See, S., Liu, J.: Continual semantic segmentation with automatic memory sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3082–3092, (2023)
Zhu, L., Chen, T., Yin, J., See, S., Soh, D.W., Liu, J.: Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, (2025)
Qiu, H., Feng, J., Zhao, L., Gu, C., Yu, H., Zhang, Y., Wang, Z.: Rmaf: A replay method based on active forgetting for continual learning. Neurocomputing, page 131098, (2025)
Song, Ji., Meng, C., Ermon, S.: Denoising diffusion implicit models. In International Conference on Learning Representations, pages 1156–1165, (2021)
Soria, X., Sappa, A., Humanante, P., Akbarinia, A.: Dense extreme inception network for edge detection. Pattern Recognition 139, 109461 (2023)
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, pages 2392–2402, (2021)
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or. D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, (2022)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3223, (2016)
Tian, Y., Gelernter, J., Wang, X., et al.: Lane marking detection via deep convolutional neural network. Neurocomputing 280, 46–55 (2018)
Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu,T., Lu, L., Li, H. et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14408–14419, 2023
Zhang L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. ArXiv preprint arXiv:2302.05543, (2023)
Shi, N., Li, D., Hong, M., Sun, R.: Rmsprop converges with proper hyper-parameter. In International Conference on Learning Representations, pages 1684–1695, (2020)
Tian, Y., Xu, Z., Ma, Y., Ding, W., Wang, R., Gao, Z., Cheng, G., He, L., Zhao, X.: Survey on deep learning in multimodal medical imaging for cancer detection. Neural Computing and Applications, pages 1–16, (2023)
Maracani, A., Michieli, U., Toldo, M., Zanuttigh, P. Recall: Replay-based continual learning in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7026–7035, (2021)
Michieli, U., Zanuttigh, P.: Continual semantic segmentation via repulsion-attraction of sparse and disentangled latent representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1114–1124, (2021)
Acknowledgements
The authors would like to thank AJE (www.aje.com) for its language editing assistance during the preparation of this manuscript.
Funding
This work was supported in part by the Key R&D Program of Zhejiang Province (No. 2023C01039)
the Natural Science Foundation of Zhejiang Province (No. LZ24F020001)
the Opening Foundation of the Tongxiang Institute of General Artificial Intelligence (No. TAGI2-B-2024-0009)
and State Key Laboratory of Advanced Medical Materials and Devices.
Author information
Authors and Affiliations
Contributions
Authors’ contributions Jian Jiang: Formal analysis, Writing – original draft preparation. Yan Tian: Conceptualization, Methodology, Writing - review & editing. Yongchuan Xu: Software, Data curation, Writing – review & editing. Zhaocheng Xu: Writing – review & editing. Xun Wang: Writing – review & editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
The research does not involve human participants and/or animals. Consent for data used has already been fully informed.
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, J., Tian, Y., Xu, Y. et al. Sdreplay: diffusion model for continual semantic segmentation in traffic scenarios. Multimedia Systems 31, 463 (2025). https://doi.org/10.1007/s00530-025-02049-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s00530-025-02049-0