Abstract
Medical image classification requires intelligent integration of both local details and global patterns. Most existing classification models are based on convolutional neural networks (CNNs), Transformers, or their hybrid variants. Although standard CNNs effectively capture fine-grained features such as edges and textures, they are less effective than Transformers in capturing global structures. However, Transformers are computationally expensive and limited spatial awareness. The emerging Mamba model offers efficient sequence processing, but its potential for medical imaging scenarios remains underutilized. To address these challenges, we propose CMFuse—a three-branch network that combines CNNs for local details, Mamba for global context, and an adaptive fusion block (CMF). The CMF block uses dynamic attention mechanisms to automatically balance local and global features while maintaining overall awareness of the lesion shape. Experiments on five medical datasets show that CMFuse achieves superior classification accuracy with lower computational complexity. Notably, it improves accuracy by 2.38% on PAD-UFES-20 and 1.89% on SMAD, demonstrating its robustness and potential in medical imaging.







Similar content being viewed by others
Data availability
The public datasets analyzed in this study are available at the following websites: Kavsir Dataset: https://www.kaggle.com/datasets/yasserhessein/the-kvasir-dataset/data, COVID-19 Dataset: https://data.mendeley.com/datasets/dvntn9yhd2/1, PAD-UFES-20: https://data.mendeley.com/datasets/zr7vgbcyr2/1, ISIC2018: https://challenge.isic-archive.com/landing/2018/. The private dataset is available from the corresponding author on reasonable request.
References
Sharma, P., Nayak, D.R., Balabantaray, B.K., Tanveer, M., Nayak, R.: A survey on cancer detection via convolutional neural networks: current challenges and future directions. Neural Netw. 169, 637–659 (2023)
Zhou, S.K., Greenspan, H., Davatzikos, C., Duncan, J.S., Van Ginneken, B., Madabhushi, A., Prince, J.L., Rueckert, D., Summers, R.M.: A review of deep learning in medical imaging: imaging traits, technology trends, case studies with progress highlights, and future promises. Proc. IEEE 109(5), 820–838 (2021)
Chen, X., Wang, X., Zhang, K., Fung, K.-M., Thai, T.C., Moore, K., Mannel, R.S., Liu, H., Zheng, B., Qiu, Y.: Recent advances and clinical applications of deep learning in medical image analysis. Med. Image Anal. 79, 102444 (2022)
Agarwal, R., Ghosal, P., Murmu, N., Nandi, D.: Spiking neural network in computer vision: techniques, tools and trends. In: International Conference on Advanced Computational and Communication Paradigms, 2023, pp. 201–209. Springer (2023)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9 (2015)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, 2017, pp. 6000–6010. Curran Associates, Inc., Red Hook (2017)
Li, Z., Jiang, J., Chen, K., Chen, Q., Zheng, Q., Liu, X., Weng, H., Wu, S., Chen, W.: Preventing corneal blindness caused by keratitis using artificial intelligence. Nat. Commun. 12(1), 3738 (2021)
Dai, L., Sheng, B., Chen, T., Wu, Q., Liu, R., Cai, C., Wu, L., Yang, D., Hamzah, H., Liu, Y., et al.: A deep learning system for predicting time to progression of diabetic retinopathy. Nat. Med. 30(2), 584–594 (2024)
Chen, W., Li, R., Yu, Q., Xu, A., Feng, Y., Wang, R., Zhao, L., Lin, Z., Yang, Y., Lin, D., et al.: Early detection of visual impairment in young children using a smartphone-based deep learning system. Nat. Med. 29(2), 493–503 (2023)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: International Conference on Learning Representations, 2021 (2021). https://openreview.net/forum?id=YicbFdNTTy
Gu, A., Dao, T.: Mamba: linear-time sequence modeling with selective state spaces. arXiv (2023). arXiv:abs/2312.00752
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778 (2016)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803 (2018)
Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q.: Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 367–376 (2021)
Cheng, J., Tian, S., Yu, L., Gao, C., Kang, X., Ma, X., Wu, W., Liu, S., Lu, H.: ResGANet: residual group attention network for medical image classification and segmentation. Med. Image Anal. 76, 102313 (2022)
Dao, T., Gu, A.: Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. In: Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research, 2024, vol. 235, pp. 10041–10071. PMLR (2024). https://proceedings.mlr.press/v235/dao24a.html
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision Mamba: efficient visual representation learning with bidirectional state space model (2024). arXiv:2401.09417
Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., Liu, Y.: VMamba: visual state space model. Adv. Neural Inf. Process. Syst. 37, 103031–103063 (2024)
Ma, J., Li, F., Wang, B.: U-Mamba: enhancing long-range dependency for biomedical image segmentation (2024). arXiv:2401.04722
Yue, Y., Li, Z.: MedMamba: vision Mamba for medical image classification (2024). arXiv:2403.03849
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708 (2017)
Houssein, E.H., Abdelkareem, D.A., Hu, G., Hameed, M.A., Ibrahim, I.A., Younan, M.: An effective multiclass skin cancer classification approach based on deep convolutional neural network. Clust. Comput. 27, 1–21 (2024)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022 (2021)
Rangapuram, S.S., Seeger, M., Gasthaus, J., Stella, L., Wang, Y., Januschowski, T.: Deep state space models for time series forecasting. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18, 2018, pp. 7796–7805. Curran Associates, Inc., Red Hook (2018)
Maqsood, S., Damaševičius, R.: Multiclass skin lesion localization and classification using deep learning based features fusion and selection framework for smart healthcare. Neural Netw. 160, 238–258 (2023)
Cheng, M., Ma, H., Ma, Q., Sun, X., Li, W., Zhang, Z., Sheng, X., Zhao, S., Li, J., Zhang, L.: Hybrid transformer and CNN attention network for stereo image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1702–1711 (2023)
Umirzakova, S., Mardieva, S., Muksimova, S., Ahmad, S., Whangbo, T.: Enhancing the super-resolution of medical images: introducing the deep residual feature distillation channel attention network for optimized performance and efficiency. Bioengineering 10(11), 1332 (2023)
Pacal, I., Celik, O., Bayram, B., Cunha, A.: Enhancing EfficientNetv2 with global and efficient channel attention mechanisms for accurate MRI-based brain tumor classification. Clust. Comput. 27, 1–26 (2024)
Li, H., Zhai, D.-H., Xia, Y.: ERDUnet: an efficient residual double-coding UNet for medical image segmentation. IEEE Trans. Circuits Syst. Video Technol. 34(4), 2083–2096 (2023)
Agarwal, R., Chowdhury, A., Chatterjee, R.K., Chel, H., Murmu, C., Murmu, N., Nandi, D.: Deep quasi-recurrent self-attention with dual encoder–decoder in biomedical CT image segmentation. IEEE J. Biomed. Health Inform. 28(12), 7195–7205 (2024)
Zhang, Y., Liu, H., Hu, Q.: TransFuse: fusing transformers and CNNs for medical image segmentation. In: Medical Image Computing and Computer Assisted intervention—MICCAI 2021: 24th International Conference, Proceedings, Part I 24, Strasbourg, France, 27 September–1 October, 2021, pp. 14–24. Springer (2021)
Huo, X., Sun, G., Tian, S., Wang, Y., Yu, L., Long, J., Zhang, W., Li, A.: HiFuse: hierarchical multi-scale feature fusion network for medical image classification. Biomed. Signal Process. Control 87, 105534 (2024)
Heidari, M., Kazerouni, A., Soltany, M., Azad, R., Aghdam, E.K., Cohen-Adad, J., Merhof, D.: HiFormer: hierarchical multi-scale representations using transformers for medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6202–6212 (2023)
Zhou, J., Zhang, X., Zhu, Z., Lan, X., Fu, L., Wang, H., Wen, H.: Cohesive multi-modality feature learning and fusion for COVID-19 patient severity prediction. IEEE Trans. Circuits Syst. Video Technol. 32(5), 2535–2549 (2021)
Wu, P., Wang, Z., Zheng, B., Li, H., Alsaadi, F.E., Zeng, N.: AGGN: attention-based glioma grading network with multi-scale feature extraction and multi-modal information fusion. Comput. Biol. Med. 152, 106457 (2023)
Omeroglu, A.N., Mohammed, H.M., Oral, E.A., Aydin, S.: A novel soft attention-based multi-modal deep learning framework for multi-label skin lesion classification. Eng. Appl. Artif. Intell. 120, 105897 (2023)
Gu, A., Goel, K., Re, C.: Efficiently modeling long sequences with structured state spaces. In: International Conference on Learning Representations, 2022 (2022). https://openreview.net/forum?id=uYLFoz1vlAC
Zhang, M., Yu, Y., Jin, S., Gu, L., Ling, T., Tao, X.: VM-UNet-V2: rethinking vision Mamba UNet for medical image segmentation. In: International Symposium on Bioinformatics Research and Applications, 2024, pp. 335–346. Springer (2024)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, 2018, pp. 3–19. Springer, Cham (2018)
Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization (2016). arXiv:1607.06450
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. ICML’15, 2015, vol. 37, pp. 448–456. JMLR.org (2015)
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11976–11986 (2022)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv Learning (2016)
Lin, T.-Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1449–1457 (2015). https://doi.org/10.1109/ICCV.2015.170
Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 317–326 (2016). https://doi.org/10.1109/CVPR.2016.41
Pogorelov, K., Randel, K.R., Griwodz, C., Eskeland, S.L., Lange, T., Johansen, D., Spampinato, C., Dang-Nguyen, D.-T., Lux, M., Schmidt, P.T., Riegler, M., Halvorsen, P.: Kvasir: a multi-class image dataset for computer aided gastrointestinal disease detection. In: Proceedings of the 8th ACM on Multimedia Systems Conference, 2017 (2017)
Shastri, S., Kansal, I., Kumar, S., Singh, K., Popli, R., Mansotra, V.: CheXImageNet: a novel architecture for accurate classification of COVID-19 with chest X-ray digital images using deep convolutional neural networks. Health Technol. 12, 193–204 (2022)
Kumar, S., Shastri, S., Mahajan, S., Singh, K., Gupta, S., Rani, R., Mohan, N., Mansotra, V.: LiteCovidNet: a lightweight deep neural network model for detection of COVID-19 using X-ray images. Int. J. Imaging Syst. Technol. 32, 1464–1480 (2022)
Pacheco, A.G.C., Lima, G.R., Silva Salomão, A., Krohling, B., Biral, I.P., Angelo, G.G., Alves Jr, F.C.R., Esgario, J.G.M., Simora, A.C., Castro, P.B.C., Rodrigues, F.B., Frasson, P.H.L., Krohling, R.A., Knidel, H., Santos, M.C.S., Espírito Santo, R.B., Macedo, T.L.S.G., Canuto, T.R.P., Barros, L.F.S.: PAD-UFES-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief 32, 106221 (2020)
Codella, N.C.F., Rotemberg, V.M., Tschandl, P., Celebi, M.E., Dusza, S.W., Gutman, D., Helba, B., Kalloo, A., Liopyris, K., Marchetti, M.A., Kittler, H., Halpern, A.C.: Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the International Skin Imaging Collaboration (ISIC) (2019). arXiv:1902.03368
Mandal, B.: Optimization of quadratic curve fitting from data points using real coded genetic algorithm. In: Emerging Technologies in Data Mining and Information Security: Proceedings of IEMIS 2020, 2021, vol. 1, pp. 419–428. Springer (2021)
Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 336–359 (2016)
Acknowledgements
This work was supported by Public-Welfare Technology Application Research of Zhejiang Province in China under Grant LGG22F020032, Wenzhou Basic Industrial Project in China under Grants G2023093 and 2024G0135, Startup Foundation of Hangzhou Dianzi University under Grant KYS285624344, Zhejiang Provincial Natural Science Foundation of China under Grant LY22F020019, and Key Research and Development Project of Zhejiang Province in China under Grant 2021C03137.
Funding
This work was supported by Public-Welfare Technology Application Research of Zhejiang Province in China under Grant LGG22F020032, Wenzhou Basic Industrial Project in China under Grants G2023093 and 2024G0135, Startup Foundation of Hangzhou Dianzi University under Grant KYS285624344, Zhejiang Provincial Natural Science Foundation of China under Grant LY22F020019, and Key Research and Development Project of Zhejiang Province in China under Grant 2021C03137.
Author information
Authors and Affiliations
Contributions
XY conducted validation, methodology, investigation, and formal analysis. XC contributed to conceptualization, data curation, and writing, including both the original draft and review/editing. YW was responsible for validation, project administration, and funding acquisition. QH, TS, and JD provided methodology, resources, and supervision. All authors reviewed and approved the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no Conflict of interest to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, X., Yin, X., Huang, Q. et al. CMFuse: a hierarchical feature fusion model combining convolution and Mamba for medical image classification. Cluster Comput 28, 662 (2025). https://doi.org/10.1007/s10586-025-05344-7
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10586-025-05344-7


