CMFuse: a hierarchical feature fusion model combining convolution and Mamba for medical image classification

Chen, Xu; Yin, Xuesong; Huang, Qi; Shu, Ting; Ding, Jianhao; Wang, Yigang

doi:10.1007/s10586-025-05344-7

CMFuse: a hierarchical feature fusion model combining convolution and Mamba for medical image classification

Published: 03 September 2025

Volume 28, article number 662 (2025)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Xu Chen^1,2,
Xuesong Yin^1,2,
Qi Huang³,
Ting Shu⁴,
Jianhao Ding^1,2 &
…
Yigang Wang¹

330 Accesses
2 Citations
Explore all metrics

Abstract

Medical image classification requires intelligent integration of both local details and global patterns. Most existing classification models are based on convolutional neural networks (CNNs), Transformers, or their hybrid variants. Although standard CNNs effectively capture fine-grained features such as edges and textures, they are less effective than Transformers in capturing global structures. However, Transformers are computationally expensive and limited spatial awareness. The emerging Mamba model offers efficient sequence processing, but its potential for medical imaging scenarios remains underutilized. To address these challenges, we propose CMFuse—a three-branch network that combines CNNs for local details, Mamba for global context, and an adaptive fusion block (CMF). The CMF block uses dynamic attention mechanisms to automatically balance local and global features while maintaining overall awareness of the lesion shape. Experiments on five medical datasets show that CMFuse achieves superior classification accuracy with lower computational complexity. Notably, it improves accuracy by 2.38% on PAD-UFES-20 and 1.89% on SMAD, demonstrating its robustness and potential in medical imaging.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

CFM-UNet: coupling local and global feature extraction networks for medical image segmentation

Article Open access 01 July 2025

Cross-modal attentive fusion network for tri-modal lesion growth prediction

Article Open access 08 June 2026

MSSMamba: hybrid multi-scale spatial-state mamba with frequency-adaptive boundary refinement for medical image segmentation

Article 05 January 2026

Data availability

The public datasets analyzed in this study are available at the following websites: Kavsir Dataset: https://www.kaggle.com/datasets/yasserhessein/the-kvasir-dataset/data, COVID-19 Dataset: https://data.mendeley.com/datasets/dvntn9yhd2/1, PAD-UFES-20: https://data.mendeley.com/datasets/zr7vgbcyr2/1, ISIC2018: https://challenge.isic-archive.com/landing/2018/. The private dataset is available from the corresponding author on reasonable request.

References

Sharma, P., Nayak, D.R., Balabantaray, B.K., Tanveer, M., Nayak, R.: A survey on cancer detection via convolutional neural networks: current challenges and future directions. Neural Netw. 169, 637–659 (2023)
Article Google Scholar
Zhou, S.K., Greenspan, H., Davatzikos, C., Duncan, J.S., Van Ginneken, B., Madabhushi, A., Prince, J.L., Rueckert, D., Summers, R.M.: A review of deep learning in medical imaging: imaging traits, technology trends, case studies with progress highlights, and future promises. Proc. IEEE 109(5), 820–838 (2021)
Article Google Scholar
Chen, X., Wang, X., Zhang, K., Fung, K.-M., Thai, T.C., Moore, K., Mannel, R.S., Liu, H., Zheng, B., Qiu, Y.: Recent advances and clinical applications of deep learning in medical image analysis. Med. Image Anal. 79, 102444 (2022)
Article Google Scholar
Agarwal, R., Ghosal, P., Murmu, N., Nandi, D.: Spiking neural network in computer vision: techniques, tools and trends. In: International Conference on Advanced Computational and Communication Paradigms, 2023, pp. 201–209. Springer (2023)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9 (2015)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, 2017, pp. 6000–6010. Curran Associates, Inc., Red Hook (2017)
Li, Z., Jiang, J., Chen, K., Chen, Q., Zheng, Q., Liu, X., Weng, H., Wu, S., Chen, W.: Preventing corneal blindness caused by keratitis using artificial intelligence. Nat. Commun. 12(1), 3738 (2021)
Article Google Scholar
Dai, L., Sheng, B., Chen, T., Wu, Q., Liu, R., Cai, C., Wu, L., Yang, D., Hamzah, H., Liu, Y., et al.: A deep learning system for predicting time to progression of diabetic retinopathy. Nat. Med. 30(2), 584–594 (2024)
Article Google Scholar
Chen, W., Li, R., Yu, Q., Xu, A., Feng, Y., Wang, R., Zhao, L., Lin, Z., Yang, Y., Lin, D., et al.: Early detection of visual impairment in young children using a smartphone-based deep learning system. Nat. Med. 29(2), 493–503 (2023)
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth $16 \times 16$ words: transformers for image recognition at scale. In: International Conference on Learning Representations, 2021 (2021). https://openreview.net/forum?id=YicbFdNTTy
Gu, A., Dao, T.: Mamba: linear-time sequence modeling with selective state spaces. arXiv (2023). arXiv:abs/2312.00752
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778 (2016)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803 (2018)
Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q.: Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 367–376 (2021)
Cheng, J., Tian, S., Yu, L., Gao, C., Kang, X., Ma, X., Wu, W., Liu, S., Lu, H.: ResGANet: residual group attention network for medical image classification and segmentation. Med. Image Anal. 76, 102313 (2022)
Article Google Scholar
Dao, T., Gu, A.: Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research, 2024, vol. 235, pp. 10041–10071. PMLR (2024). https://proceedings.mlr.press/v235/dao24a.html
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision Mamba: efficient visual representation learning with bidirectional state space model (2024). arXiv:2401.09417
Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., Liu, Y.: VMamba: visual state space model. Adv. Neural Inf. Process. Syst. 37, 103031–103063 (2024)
Google Scholar
Ma, J., Li, F., Wang, B.: U-Mamba: enhancing long-range dependency for biomedical image segmentation (2024). arXiv:2401.04722
Yue, Y., Li, Z.: MedMamba: vision Mamba for medical image classification (2024). arXiv:2403.03849
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708 (2017)
Houssein, E.H., Abdelkareem, D.A., Hu, G., Hameed, M.A., Ibrahim, I.A., Younan, M.: An effective multiclass skin cancer classification approach based on deep convolutional neural network. Clust. Comput. 27, 1–21 (2024)
Article Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022 (2021)
Rangapuram, S.S., Seeger, M., Gasthaus, J., Stella, L., Wang, Y., Januschowski, T.: Deep state space models for time series forecasting. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18, 2018, pp. 7796–7805. Curran Associates, Inc., Red Hook (2018)
Maqsood, S., Damaševičius, R.: Multiclass skin lesion localization and classification using deep learning based features fusion and selection framework for smart healthcare. Neural Netw. 160, 238–258 (2023)
Article Google Scholar
Cheng, M., Ma, H., Ma, Q., Sun, X., Li, W., Zhang, Z., Sheng, X., Zhao, S., Li, J., Zhang, L.: Hybrid transformer and CNN attention network for stereo image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1702–1711 (2023)
Umirzakova, S., Mardieva, S., Muksimova, S., Ahmad, S., Whangbo, T.: Enhancing the super-resolution of medical images: introducing the deep residual feature distillation channel attention network for optimized performance and efficiency. Bioengineering 10(11), 1332 (2023)
Article Google Scholar
Pacal, I., Celik, O., Bayram, B., Cunha, A.: Enhancing EfficientNetv2 with global and efficient channel attention mechanisms for accurate MRI-based brain tumor classification. Clust. Comput. 27, 1–26 (2024)
Article Google Scholar
Li, H., Zhai, D.-H., Xia, Y.: ERDUnet: an efficient residual double-coding UNet for medical image segmentation. IEEE Trans. Circuits Syst. Video Technol. 34(4), 2083–2096 (2023)
Article Google Scholar
Agarwal, R., Chowdhury, A., Chatterjee, R.K., Chel, H., Murmu, C., Murmu, N., Nandi, D.: Deep quasi-recurrent self-attention with dual encoder–decoder in biomedical CT image segmentation. IEEE J. Biomed. Health Inform. 28(12), 7195–7205 (2024)
Article Google Scholar
Zhang, Y., Liu, H., Hu, Q.: TransFuse: fusing transformers and CNNs for medical image segmentation. In: Medical Image Computing and Computer Assisted intervention—MICCAI 2021: 24th International Conference, Proceedings, Part I 24, Strasbourg, France, 27 September–1 October, 2021, pp. 14–24. Springer (2021)
Huo, X., Sun, G., Tian, S., Wang, Y., Yu, L., Long, J., Zhang, W., Li, A.: HiFuse: hierarchical multi-scale feature fusion network for medical image classification. Biomed. Signal Process. Control 87, 105534 (2024)
Article Google Scholar
Heidari, M., Kazerouni, A., Soltany, M., Azad, R., Aghdam, E.K., Cohen-Adad, J., Merhof, D.: HiFormer: hierarchical multi-scale representations using transformers for medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6202–6212 (2023)
Zhou, J., Zhang, X., Zhu, Z., Lan, X., Fu, L., Wang, H., Wen, H.: Cohesive multi-modality feature learning and fusion for COVID-19 patient severity prediction. IEEE Trans. Circuits Syst. Video Technol. 32(5), 2535–2549 (2021)
Article Google Scholar
Wu, P., Wang, Z., Zheng, B., Li, H., Alsaadi, F.E., Zeng, N.: AGGN: attention-based glioma grading network with multi-scale feature extraction and multi-modal information fusion. Comput. Biol. Med. 152, 106457 (2023)
Article Google Scholar
Omeroglu, A.N., Mohammed, H.M., Oral, E.A., Aydin, S.: A novel soft attention-based multi-modal deep learning framework for multi-label skin lesion classification. Eng. Appl. Artif. Intell. 120, 105897 (2023)
Article Google Scholar
Gu, A., Goel, K., Re, C.: Efficiently modeling long sequences with structured state spaces. In: International Conference on Learning Representations, 2022 (2022). https://openreview.net/forum?id=uYLFoz1vlAC
Zhang, M., Yu, Y., Jin, S., Gu, L., Ling, T., Tao, X.: VM-UNet-V2: rethinking vision Mamba UNet for medical image segmentation. In: International Symposium on Bioinformatics Research and Applications, 2024, pp. 335–346. Springer (2024)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, 2018, pp. 3–19. Springer, Cham (2018)
Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization (2016). arXiv:1607.06450
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. ICML’15, 2015, vol. 37, pp. 448–456. JMLR.org (2015)
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11976–11986 (2022)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv Learning (2016)
Lin, T.-Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1449–1457 (2015). https://doi.org/10.1109/ICCV.2015.170
Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 317–326 (2016). https://doi.org/10.1109/CVPR.2016.41
Pogorelov, K., Randel, K.R., Griwodz, C., Eskeland, S.L., Lange, T., Johansen, D., Spampinato, C., Dang-Nguyen, D.-T., Lux, M., Schmidt, P.T., Riegler, M., Halvorsen, P.: Kvasir: a multi-class image dataset for computer aided gastrointestinal disease detection. In: Proceedings of the 8th ACM on Multimedia Systems Conference, 2017 (2017)
Shastri, S., Kansal, I., Kumar, S., Singh, K., Popli, R., Mansotra, V.: CheXImageNet: a novel architecture for accurate classification of COVID-19 with chest X-ray digital images using deep convolutional neural networks. Health Technol. 12, 193–204 (2022)
Article Google Scholar
Kumar, S., Shastri, S., Mahajan, S., Singh, K., Gupta, S., Rani, R., Mohan, N., Mansotra, V.: LiteCovidNet: a lightweight deep neural network model for detection of COVID-19 using X-ray images. Int. J. Imaging Syst. Technol. 32, 1464–1480 (2022)
Article Google Scholar
Pacheco, A.G.C., Lima, G.R., Silva Salomão, A., Krohling, B., Biral, I.P., Angelo, G.G., Alves Jr, F.C.R., Esgario, J.G.M., Simora, A.C., Castro, P.B.C., Rodrigues, F.B., Frasson, P.H.L., Krohling, R.A., Knidel, H., Santos, M.C.S., Espírito Santo, R.B., Macedo, T.L.S.G., Canuto, T.R.P., Barros, L.F.S.: PAD-UFES-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief 32, 106221 (2020)
Article Google Scholar
Codella, N.C.F., Rotemberg, V.M., Tschandl, P., Celebi, M.E., Dusza, S.W., Gutman, D., Helba, B., Kalloo, A., Liopyris, K., Marchetti, M.A., Kittler, H., Halpern, A.C.: Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the International Skin Imaging Collaboration (ISIC) (2019). arXiv:1902.03368
Mandal, B.: Optimization of quadratic curve fitting from data points using real coded genetic algorithm. In: Emerging Technologies in Data Mining and Information Security: Proceedings of IEMIS 2020, 2021, vol. 1, pp. 419–428. Springer (2021)
Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 336–359 (2016)
Article Google Scholar

Download references

Acknowledgements

This work was supported by Public-Welfare Technology Application Research of Zhejiang Province in China under Grant LGG22F020032, Wenzhou Basic Industrial Project in China under Grants G2023093 and 2024G0135, Startup Foundation of Hangzhou Dianzi University under Grant KYS285624344, Zhejiang Provincial Natural Science Foundation of China under Grant LY22F020019, and Key Research and Development Project of Zhejiang Province in China under Grant 2021C03137.

Funding

This work was supported by Public-Welfare Technology Application Research of Zhejiang Province in China under Grant LGG22F020032, Wenzhou Basic Industrial Project in China under Grants G2023093 and 2024G0135, Startup Foundation of Hangzhou Dianzi University under Grant KYS285624344, Zhejiang Provincial Natural Science Foundation of China under Grant LY22F020019, and Key Research and Development Project of Zhejiang Province in China under Grant 2021C03137.

Author information

Authors and Affiliations

School of Media and Design, Hangzhou Dianzi University, Hangzhou, 310018, China
Xu Chen, Xuesong Yin, Jianhao Ding & Yigang Wang
Wenzhou Institute, Hangzhou Dianzi University, Wenzhou, 325038, China
Xu Chen, Xuesong Yin & Jianhao Ding
School of Biological & Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou, 310012, China
Qi Huang
School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou, 310018, China
Ting Shu

Authors

Xu Chen
View author publications
Search author on:PubMed Google Scholar
Xuesong Yin
View author publications
Search author on:PubMed Google Scholar
Qi Huang
View author publications
Search author on:PubMed Google Scholar
Ting Shu
View author publications
Search author on:PubMed Google Scholar
Jianhao Ding
View author publications
Search author on:PubMed Google Scholar
Yigang Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

XY conducted validation, methodology, investigation, and formal analysis. XC contributed to conceptualization, data curation, and writing, including both the original draft and review/editing. YW was responsible for validation, project administration, and funding acquisition. QH, TS, and JD provided methodology, resources, and supervision. All authors reviewed and approved the manuscript.

Corresponding author

Correspondence to Xuesong Yin.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, X., Yin, X., Huang, Q. et al. CMFuse: a hierarchical feature fusion model combining convolution and Mamba for medical image classification. Cluster Comput 28, 662 (2025). https://doi.org/10.1007/s10586-025-05344-7

Download citation

Received: 02 January 2025
Revised: 27 March 2025
Accepted: 25 April 2025
Published: 03 September 2025
Version of record: 03 September 2025
DOI: https://doi.org/10.1007/s10586-025-05344-7

Keywords

Profiles

Ting Shu View author profile
Jianhao Ding View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CMFuse: a hierarchical feature fusion model combining convolution and Mamba for medical image classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CFM-UNet: coupling local and global feature extraction networks for medical image segmentation

Cross-modal attentive fusion network for tri-modal lesion growth prediction

MSSMamba: hybrid multi-scale spatial-state mamba with frequency-adaptive boundary refinement for medical image segmentation

Explore related subjects

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now