close
Skip to main content
Log in

CMFuse: a hierarchical feature fusion model combining convolution and Mamba for medical image classification

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Medical image classification requires intelligent integration of both local details and global patterns. Most existing classification models are based on convolutional neural networks (CNNs), Transformers, or their hybrid variants. Although standard CNNs effectively capture fine-grained features such as edges and textures, they are less effective than Transformers in capturing global structures. However, Transformers are computationally expensive and limited spatial awareness. The emerging Mamba model offers efficient sequence processing, but its potential for medical imaging scenarios remains underutilized. To address these challenges, we propose CMFuse—a three-branch network that combines CNNs for local details, Mamba for global context, and an adaptive fusion block (CMF). The CMF block uses dynamic attention mechanisms to automatically balance local and global features while maintaining overall awareness of the lesion shape. Experiments on five medical datasets show that CMFuse achieves superior classification accuracy with lower computational complexity. Notably, it improves accuracy by 2.38% on PAD-UFES-20 and 1.89% on SMAD, demonstrating its robustness and potential in medical imaging.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.
Fig. 7
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

The public datasets analyzed in this study are available at the following websites: Kavsir Dataset: https://www.kaggle.com/datasets/yasserhessein/the-kvasir-dataset/data, COVID-19 Dataset: https://data.mendeley.com/datasets/dvntn9yhd2/1, PAD-UFES-20: https://data.mendeley.com/datasets/zr7vgbcyr2/1, ISIC2018: https://challenge.isic-archive.com/landing/2018/. The private dataset is available from the corresponding author on reasonable request.

References

  1. Sharma, P., Nayak, D.R., Balabantaray, B.K., Tanveer, M., Nayak, R.: A survey on cancer detection via convolutional neural networks: current challenges and future directions. Neural Netw. 169, 637–659 (2023)

    Article  Google Scholar 

  2. Zhou, S.K., Greenspan, H., Davatzikos, C., Duncan, J.S., Van Ginneken, B., Madabhushi, A., Prince, J.L., Rueckert, D., Summers, R.M.: A review of deep learning in medical imaging: imaging traits, technology trends, case studies with progress highlights, and future promises. Proc. IEEE 109(5), 820–838 (2021)

    Article  Google Scholar 

  3. Chen, X., Wang, X., Zhang, K., Fung, K.-M., Thai, T.C., Moore, K., Mannel, R.S., Liu, H., Zheng, B., Qiu, Y.: Recent advances and clinical applications of deep learning in medical image analysis. Med. Image Anal. 79, 102444 (2022)

    Article  Google Scholar 

  4. Agarwal, R., Ghosal, P., Murmu, N., Nandi, D.: Spiking neural network in computer vision: techniques, tools and trends. In: International Conference on Advanced Computational and Communication Paradigms, 2023, pp. 201–209. Springer (2023)

  5. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9 (2015)

  6. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, 2017, pp. 6000–6010. Curran Associates, Inc., Red Hook (2017)

  7. Li, Z., Jiang, J., Chen, K., Chen, Q., Zheng, Q., Liu, X., Weng, H., Wu, S., Chen, W.: Preventing corneal blindness caused by keratitis using artificial intelligence. Nat. Commun. 12(1), 3738 (2021)

    Article  Google Scholar 

  8. Dai, L., Sheng, B., Chen, T., Wu, Q., Liu, R., Cai, C., Wu, L., Yang, D., Hamzah, H., Liu, Y., et al.: A deep learning system for predicting time to progression of diabetic retinopathy. Nat. Med. 30(2), 584–594 (2024)

    Article  Google Scholar 

  9. Chen, W., Li, R., Yu, Q., Xu, A., Feng, Y., Wang, R., Zhao, L., Lin, Z., Yang, Y., Lin, D., et al.: Early detection of visual impairment in young children using a smartphone-based deep learning system. Nat. Med. 29(2), 493–503 (2023)

    Article  Google Scholar 

  10. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: International Conference on Learning Representations, 2021 (2021). https://openreview.net/forum?id=YicbFdNTTy

  11. Gu, A., Dao, T.: Mamba: linear-time sequence modeling with selective state spaces. arXiv (2023). arXiv:abs/2312.00752

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778 (2016)

  13. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803 (2018)

  14. Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q.: Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 367–376 (2021)

  15. Cheng, J., Tian, S., Yu, L., Gao, C., Kang, X., Ma, X., Wu, W., Liu, S., Lu, H.: ResGANet: residual group attention network for medical image classification and segmentation. Med. Image Anal. 76, 102313 (2022)

    Article  Google Scholar 

  16. Dao, T., Gu, A.: Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research, 2024, vol. 235, pp. 10041–10071. PMLR (2024). https://proceedings.mlr.press/v235/dao24a.html

  17. Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision Mamba: efficient visual representation learning with bidirectional state space model (2024). arXiv:2401.09417

  18. Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., Liu, Y.: VMamba: visual state space model. Adv. Neural Inf. Process. Syst. 37, 103031–103063 (2024)

    Google Scholar 

  19. Ma, J., Li, F., Wang, B.: U-Mamba: enhancing long-range dependency for biomedical image segmentation (2024). arXiv:2401.04722

  20. Yue, Y., Li, Z.: MedMamba: vision Mamba for medical image classification (2024). arXiv:2403.03849

  21. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  22. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)

    Article  Google Scholar 

  23. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)

    Article  Google Scholar 

  24. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708 (2017)

  25. Houssein, E.H., Abdelkareem, D.A., Hu, G., Hameed, M.A., Ibrahim, I.A., Younan, M.: An effective multiclass skin cancer classification approach based on deep convolutional neural network. Clust. Comput. 27, 1–21 (2024)

    Article  Google Scholar 

  26. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022 (2021)

  27. Rangapuram, S.S., Seeger, M., Gasthaus, J., Stella, L., Wang, Y., Januschowski, T.: Deep state space models for time series forecasting. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18, 2018, pp. 7796–7805. Curran Associates, Inc., Red Hook (2018)

  28. Maqsood, S., Damaševičius, R.: Multiclass skin lesion localization and classification using deep learning based features fusion and selection framework for smart healthcare. Neural Netw. 160, 238–258 (2023)

    Article  Google Scholar 

  29. Cheng, M., Ma, H., Ma, Q., Sun, X., Li, W., Zhang, Z., Sheng, X., Zhao, S., Li, J., Zhang, L.: Hybrid transformer and CNN attention network for stereo image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1702–1711 (2023)

  30. Umirzakova, S., Mardieva, S., Muksimova, S., Ahmad, S., Whangbo, T.: Enhancing the super-resolution of medical images: introducing the deep residual feature distillation channel attention network for optimized performance and efficiency. Bioengineering 10(11), 1332 (2023)

    Article  Google Scholar 

  31. Pacal, I., Celik, O., Bayram, B., Cunha, A.: Enhancing EfficientNetv2 with global and efficient channel attention mechanisms for accurate MRI-based brain tumor classification. Clust. Comput. 27, 1–26 (2024)

    Article  Google Scholar 

  32. Li, H., Zhai, D.-H., Xia, Y.: ERDUnet: an efficient residual double-coding UNet for medical image segmentation. IEEE Trans. Circuits Syst. Video Technol. 34(4), 2083–2096 (2023)

    Article  Google Scholar 

  33. Agarwal, R., Chowdhury, A., Chatterjee, R.K., Chel, H., Murmu, C., Murmu, N., Nandi, D.: Deep quasi-recurrent self-attention with dual encoder–decoder in biomedical CT image segmentation. IEEE J. Biomed. Health Inform. 28(12), 7195–7205 (2024)

    Article  Google Scholar 

  34. Zhang, Y., Liu, H., Hu, Q.: TransFuse: fusing transformers and CNNs for medical image segmentation. In: Medical Image Computing and Computer Assisted intervention—MICCAI 2021: 24th International Conference, Proceedings, Part I 24, Strasbourg, France, 27 September–1 October, 2021, pp. 14–24. Springer (2021)

  35. Huo, X., Sun, G., Tian, S., Wang, Y., Yu, L., Long, J., Zhang, W., Li, A.: HiFuse: hierarchical multi-scale feature fusion network for medical image classification. Biomed. Signal Process. Control 87, 105534 (2024)

    Article  Google Scholar 

  36. Heidari, M., Kazerouni, A., Soltany, M., Azad, R., Aghdam, E.K., Cohen-Adad, J., Merhof, D.: HiFormer: hierarchical multi-scale representations using transformers for medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6202–6212 (2023)

  37. Zhou, J., Zhang, X., Zhu, Z., Lan, X., Fu, L., Wang, H., Wen, H.: Cohesive multi-modality feature learning and fusion for COVID-19 patient severity prediction. IEEE Trans. Circuits Syst. Video Technol. 32(5), 2535–2549 (2021)

    Article  Google Scholar 

  38. Wu, P., Wang, Z., Zheng, B., Li, H., Alsaadi, F.E., Zeng, N.: AGGN: attention-based glioma grading network with multi-scale feature extraction and multi-modal information fusion. Comput. Biol. Med. 152, 106457 (2023)

    Article  Google Scholar 

  39. Omeroglu, A.N., Mohammed, H.M., Oral, E.A., Aydin, S.: A novel soft attention-based multi-modal deep learning framework for multi-label skin lesion classification. Eng. Appl. Artif. Intell. 120, 105897 (2023)

    Article  Google Scholar 

  40. Gu, A., Goel, K., Re, C.: Efficiently modeling long sequences with structured state spaces. In: International Conference on Learning Representations, 2022 (2022). https://openreview.net/forum?id=uYLFoz1vlAC

  41. Zhang, M., Yu, Y., Jin, S., Gu, L., Ling, T., Tao, X.: VM-UNet-V2: rethinking vision Mamba UNet for medical image segmentation. In: International Symposium on Bioinformatics Research and Applications, 2024, pp. 335–346. Springer (2024)

  42. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, 2018, pp. 3–19. Springer, Cham (2018)

  43. Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization (2016). arXiv:1607.06450

  44. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. ICML’15, 2015, vol. 37, pp. 448–456. JMLR.org (2015)

  45. Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11976–11986 (2022)

  46. Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv Learning (2016)

  47. Lin, T.-Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1449–1457 (2015). https://doi.org/10.1109/ICCV.2015.170

  48. Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 317–326 (2016). https://doi.org/10.1109/CVPR.2016.41

  49. Pogorelov, K., Randel, K.R., Griwodz, C., Eskeland, S.L., Lange, T., Johansen, D., Spampinato, C., Dang-Nguyen, D.-T., Lux, M., Schmidt, P.T., Riegler, M., Halvorsen, P.: Kvasir: a multi-class image dataset for computer aided gastrointestinal disease detection. In: Proceedings of the 8th ACM on Multimedia Systems Conference, 2017 (2017)

  50. Shastri, S., Kansal, I., Kumar, S., Singh, K., Popli, R., Mansotra, V.: CheXImageNet: a novel architecture for accurate classification of COVID-19 with chest X-ray digital images using deep convolutional neural networks. Health Technol. 12, 193–204 (2022)

    Article  Google Scholar 

  51. Kumar, S., Shastri, S., Mahajan, S., Singh, K., Gupta, S., Rani, R., Mohan, N., Mansotra, V.: LiteCovidNet: a lightweight deep neural network model for detection of COVID-19 using X-ray images. Int. J. Imaging Syst. Technol. 32, 1464–1480 (2022)

    Article  Google Scholar 

  52. Pacheco, A.G.C., Lima, G.R., Silva Salomão, A., Krohling, B., Biral, I.P., Angelo, G.G., Alves Jr, F.C.R., Esgario, J.G.M., Simora, A.C., Castro, P.B.C., Rodrigues, F.B., Frasson, P.H.L., Krohling, R.A., Knidel, H., Santos, M.C.S., Espírito Santo, R.B., Macedo, T.L.S.G., Canuto, T.R.P., Barros, L.F.S.: PAD-UFES-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief 32, 106221 (2020)

    Article  Google Scholar 

  53. Codella, N.C.F., Rotemberg, V.M., Tschandl, P., Celebi, M.E., Dusza, S.W., Gutman, D., Helba, B., Kalloo, A., Liopyris, K., Marchetti, M.A., Kittler, H., Halpern, A.C.: Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the International Skin Imaging Collaboration (ISIC) (2019). arXiv:1902.03368

  54. Mandal, B.: Optimization of quadratic curve fitting from data points using real coded genetic algorithm. In: Emerging Technologies in Data Mining and Information Security: Proceedings of IEMIS 2020, 2021, vol. 1, pp. 419–428. Springer (2021)

  55. Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 336–359 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Public-Welfare Technology Application Research of Zhejiang Province in China under Grant LGG22F020032, Wenzhou Basic Industrial Project in China under Grants G2023093 and 2024G0135, Startup Foundation of Hangzhou Dianzi University under Grant KYS285624344, Zhejiang Provincial Natural Science Foundation of China under Grant LY22F020019, and Key Research and Development Project of Zhejiang Province in China under Grant 2021C03137.

Funding

This work was supported by Public-Welfare Technology Application Research of Zhejiang Province in China under Grant LGG22F020032, Wenzhou Basic Industrial Project in China under Grants G2023093 and 2024G0135, Startup Foundation of Hangzhou Dianzi University under Grant KYS285624344, Zhejiang Provincial Natural Science Foundation of China under Grant LY22F020019, and Key Research and Development Project of Zhejiang Province in China under Grant 2021C03137.

Author information

Authors and Affiliations

Authors

Contributions

XY conducted validation, methodology, investigation, and formal analysis. XC contributed to conceptualization, data curation, and writing, including both the original draft and review/editing. YW was responsible for validation, project administration, and funding acquisition. QH, TS, and JD provided methodology, resources, and supervision. All authors reviewed and approved the manuscript.

Corresponding author

Correspondence to Xuesong Yin.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Yin, X., Huang, Q. et al. CMFuse: a hierarchical feature fusion model combining convolution and Mamba for medical image classification. Cluster Comput 28, 662 (2025). https://doi.org/10.1007/s10586-025-05344-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1007/s10586-025-05344-7

Keywords

Profiles

  1. Ting Shu
  2. Jianhao Ding