1 Introduction

Chronic wounds are a growing burden on health care systems globally. The incidence of chronic wounds is substantial and is estimated to continue on an upward trend [1]. Diabetic foot ulcers (DFU) and arterial leg ulcers (ALU) are costly and debilitating complications of diabetes [2], with recent research suggesting an association between DFU episodes and all-cause resource utilisation and mortality [3]. Pressure ulcers (PRU) and venous leg ulcers (VLU) are the most common types of complex skin ulcers [4], with ulcer prevalence in the diabetic population estimated to be at least 13% in North America [5].

DFU has a global prevalence of approximately 6.3% among people with diabetes [6], with VLU estimated to have a prevalence of around 1.08% [7]. Global PRU prevalence is estimated to range between 5.2 and 12.3% [8]. However, these figures are likely to be higher as cases are often underreported, especially in lower-income countries where epidemiology data can be scarce and reporting may be inconsistent [9,10,11,12].

Occurrence of chronic wounds is often linked with comorbidities including vascular deficits, diabetes, chronic kidney disease, and hypertension [13]. Diabetic peripheral neuropathy is prevalent in the majority of DFU cases and is the main cause of DFU [14]. This condition results in nerve damage in the foot leading to a loss of sensation [15]. Patients suffering from this condition may go through prolonged periods of damage to their foot without realising it. Wound condition can worsen, leading to other serious complications. More than 50% of all DFU cases experience infection [16] and is one of the main causes of hospitalisation for diabetic patients [14]. Diabetic leg and foot ulcers are some of the most expensive chronic wound types to treat in the USA [13]. Up to 70% of VLU reoccur within 3 months after wound closure [17].

Patients diagnosed with DFU are up to three times more likely to die compared to patients without the disease, and are at risk to numerous comorbidities, including cardiovascular disease, nephropathy, neuropathy, peripheral arterial disease, and diabetic retinopathy. DFU and VLU often lead to significantly impaired quality of life [17,18,19]. Occurrence of ulcers is linked to increased risk of both amputation and mortality, particularly when associated with advanced age, anaemia, and peripheral arterial disease [17, 20, 21]. Chronic wounds place a significant emotional and physical burden on patients [22, 23], with depression associated with increased risks at initial and subsequent wound occurrence [24, 25].

Chronic wound management represents a major healthcare system cost and a significant time burden for clinicians and patients. This is particularly true for chronic wounds that are not diagnosed at an early stage and require more intensive treatment methods. Such situations may occur as a result of infection, with worse-case scenarios potentially leading to amputation [20]. These cases can result in frequent clinic and hospital visits for expert assessment [26, 27]. Even after chronic wounds have healed, recurrence rates are high, with minor or major amputation of lower extremities being common [28, 29]. The post-COVID-19 world poses further challenges and risks in the treatment of chronic wounds, especially for diabetic patients who are at higher risk [30,31,32].

New technologies to address growing clinical needs are becoming ever more prevalent in numerous medical fields [33, 34]. To address the issues associated with chronic wound prevalence, there has been an increase in research interest for fully automated non-contact remote detection and monitoring of chronic wounds [35,36,37]. Enhancing telemedicine systems to include automated monitoring of wounds can help to reduce risks to vulnerable patients and to ease pressures on overburdened healthcare systems [38]. Furthermore, the growing popularity of low-cost consumer mobile phones allows for these technologies to be distributed in poorer countries and rural areas where patients may have restricted access to healthcare settings.

Non-invasive easy-to-use devices capable of automated detection and monitoring could help to promote patient engagement in monitoring chronic wounds [35] which may help to reduce clinic and hospital visits. Recent scientific evidence shows that convolutional neural networks (CNNs) can be equal to, and in some cases surpass experienced dermatologists for detection and classification in medical domains [39,40,41,42,43,44,45]. Wound area changes over time have been shown to provide robust prediction in healing status [46]. Chronic wound segmentation allows for potential assessment of wound development and therefore healing status over time, providing superior accuracy when compared to object localisation techniques, which give a more general indicator of wound development [47].

Subjectivity in medical imaging domains is a challenging aspect of deep learning. Ground truth labelling of wound photographs requires clinical experts to manually delineate the wound regions before segmentation models can be trained, validated, and tested. For such procedures, there are currently no formal standards. Ramachandram et al. [48] found low inter-rater agreement for tissue types in wound images in a recent wound segmentation study. Their results showed Krippendorff alpha values as low as 0.014 for epithelial tissue. This issue is common in deep learning tasks throughout medical imaging domains, including MRI image quality assessment [49], dermoscopic skin lesion evaluation [50,51,52], and ultrasound diagnosis [53]. Accurate automated delineation of wound regions may also potentially be used as an assisting tool for clinicians to aid in the monitoring of healing progress. By limiting human subjectivity, such advances could help to reduce hospital/clinic burdens when treating patients.

Deep learning research in chronic wound segmentation is a relatively new domain. Early attempts to segment wounds involved the use of traditional computer vision techniques. [54] trained a cascaded two-stage classifier using two state vector machine (SVM) classifiers. Colour and texture descriptors are extracted from superpixels which are used for the classifier training. Colour and wavelet features were used as feature descriptors to distinguish wounds from healthy tissue. The final stage refines the wound boundary using conditional random field statistical modelling. Nonnegative matrix factorisation was utilised by [55]. Their factorisation segmentation approach was used to extract wound bed features from processed wound images. Local spectral histograms were then generated by convolving a filter bank. For each image pixel, local spectral histograms were calculated to construct the feature matrix.

This paper summarises almost all of the deep learning chronic wound segmentation papers published since 2015. The objective of this work is to investigate the relationships between test set sizes used in experiments and corresponding Dice similarity coefficient (DSC) and intersection over union (IoU) test metrics.

2 Methodology

Inclusion criteria for published articles were as follows:

  1. 1.

    Studies involving deep learning architectures used to segment DFU, PRU, or VLU wounds.

  2. 2.

    Written in English.

  3. 3.

    Clearly states the number of images used in train and test sets.

  4. 4.

    Quantified DSC or IoU metrics in test results.

  5. 5.

    If the paper was part of an online challenge event, such as the Diabetic Foot Ulcer Challenge 2022 (DFUC 2022) [56], then only the winning paper was selected for review.

Exclusion criteria for published articles that were not included in our review were as follows:

  1. 1.

    Article was not published in a journal, conference, or workshop.

  2. 2.

    Reported on wound segmentation results only from animals.

  3. 3.

    Focuses on only burn wounds. Due to the paucity of publicly available burn wound datasets, we do not include papers that focus only on those wound types in this review.

The Google, Google Scholar, Research Gate, and PubMed search engines were used to locate relevant publications. Search phrases used were: “wound segmentation”, “dfu segmentation”, “ulcer segmentation”, “pressure segmentation”, and “venous segmentation”. Additional terms, such as “deep learning”, “image segmentation”, “computer vision”, and “machine learning”, in addition to Boolean operators, were also used as search terms. Both paper title and main body text were searched when filtering results. Figure 1 shows the process of identifying relevant studies for inclusion in this review.

Fig. 1
Fig. 1
Full size image

Study selection flow diagram

3 Semantic segmentation

In this section, we discuss the most prominent chronic wound segmentation papers that focused on semantic segmentation. Semantic segmentation is defined as pixel-based segmentation of wound pixels in an image. This is distinct from instance segmentation, which involves per-wound based detection that identifies individual wound cases in an image. The vast majority of chronic wound segmentation papers use semantic segmentation, a likely consequence of the wider availability of source code and architectures that target this method. Higher computational costs are also associated with instance segmentation which may prove to be a limiting factor in model selection [57,58,59].

Wang et al. [60] trained a chronic wound segmentation CNN consisting of five encoding layers followed by four decoding layers with rectified linear unit (ReLU) activations, cross-entropy for the loss function, and L2 regularisation with a regularisation coefficient. They used 500 training images and 150 test images sourced from the NYU Wound Database. A modified version of GrabCut [61] was used to crop images to \(480 \times 640\) pixels to reduce non-wound background features. They reported a mean intersection over union (mIoU) of 0.473, with 0.950 pixel accuracy on the test set. This work is notable for being among the first to address chronic wound segmentation using deep learning techniques.

Goyal et al. [62] trained a selection of FCN models to segment DFU wounds and periwounds using a dataset of 600 DFU images with delineated masks provided by clinical experts. Two-tier transfer learning was completed using ImageNet and the Pascal VOC segmentation dataset. The DFU dataset was split into 420 images for training, 60 images used for validation, and a test set comprising 120 DFU images and 105 healthy foot images. For combined segmentation of wound and periwound regions, the best reported model was FCN32-s with a DSC of 0.899. For segmentation of only ulcer regions, the best model was FCN-16 s with a DSC of 0.794. For segmentation of only periwound regions, the best model was FCN-16 s, with a DSC of 0.851. This work observed that FCN-AlexNet and FCN-32 s were not able to accurately segment irregular boundaries. Conversely, they noted that smaller pixel strides used in FCN-16 s and FCN-8 s resulted in improved segmentation of irregular DFU wound contours. They also noted in test image results that wound and periwound features would overlap due to ambiguous feature boundaries. This work is notable for being one of the few deep learning wound segmentation studies to include periwound features (see Fig. 2).

Fig. 2
Fig. 2
Full size image

Illustration of periwound delineation and separation from wound region, originally reported by Goyal et al. [62]

Li et al. [63] proposed a composite model using watershed and thresholding in pre- and post-processing stages to assist in the removal of non-wound features. They trained a segmentation network comprising 13 convolutional layers of MobileNet as the backbone, where a pair of depthwise and pointwise layers is considered as a single layer. The last convolutional layer is upsampled and fused with the previous convolutional layer, followed by a pooling layer to reduce loss of local information. The fusion result is then upsampled 16 times to ensure that the output is of equal resolution to the input. A final post-processing stage is used to perform morphological operations (hole-filling and small region removal) and additional thresholding to remove any persisting background features. For training and testing, a dataset of 950 images was used, of which 389 were collected from patients in hospital settings and 561 images were sourced from the internet. A total of 760 images were used for training and 190 used for testing. They reported an mIoU of 0.8589 and precision of 0.9470.

Elmogy et al. [64] proposed a framework to segment and classify tissue types (slough, necrotic eschar, and granulation) in PRU wounds. First, a region of interest extractor (DeepMedic [65], a 3D CNN) uses three different colour spaces (RGB, HSV, and YCbCr) to reduce background details. Next, tissue segmentation is performed using the different spatial outputs from the region of interest models. The framework was trained and tested on 100 PRU images—36 images sourced from a healthcare services company (IGURCO GESTION S. L., Spain), and 64 images from the Medetec [66] dataset. The dataset was split into 60 images for training, 10 images for validation, and 30 images for testing. They reported a DSC of 0.93. They also reported results from a Bland-Altman analysis to assess the degree of agreement between the ground truth delineation of three observers. The results of this analysis showed that most ratings were within the range of \(m = \pm 1.96\), with a mean value closer to zero indicating good agreement. The same research group would later conduct similar experiments using a slightly larger dataset of 193 PRU images (test set \(\approx 58\)), and reported a reduced DSC of 0.92 [67].

Godeiro et al. [68] tested four segmentation architectures (U-Net, Segnet, FCN8, and FCN32) using chronic wound images and proposed the use of a colour space reduction (CSR) on the CIELab space that increased DSC, accuracy, specificity, and sensitivity for all four networks. They used a dataset of 30 wound images comprising necrosis, granulation and slough tissue types. A total of 10 images were used for training, 5 for validation, and 15 for testing. Due to the small dataset size, pretrained ImageNet VGG16 weights were used to initialise the models. A watershed algorithm was used to assist in dividing images into wound, skin, and other regions. Following CIELab colour space processing, the four segmentation networks were used to obtain the different tissue types in each wound image. They reported that the U-Net model with CSR provided the highest DSC of 0.9425. This is in comparison with the U-Net without CSR, which resulted in a DSC of 0.9153.

Zahia et al. [69] trained a model capable of performing optimised segmentation of the different tissue types present in pressure injuries (granulation, slough, and necrotic). The model was developed using MATLAB Neural Network Toolbox. They used a preprocessing step to remove flash light artefacts, followed by the creation of a set of \(5 \times 5\) pixel sub-images used for training. They used a dataset of 22 images comprising stage 3 and 4 pressure injuries. A total of 17 images were used for training and 5 images were used for testing, acquired from the Igurko Hospital, Spain. An additional 4 images were purchased from The National Pressure Ulcer Advisory Panel (NPUAP) online store for validation purposes. All train and test images were \(1020 \times 1020\) pixels. Examples of infected and necrotic tissues are present, together with healing states evidencing granulation tissue. In an attempt to negate the limited number of train and test images, all images were automatically cropped to \(5 \times 5 \times 3\) pixels resulting in 270,762 RGB patches for granulation tissue, 37,146 patches for necrotic tissue, and 80,636 patches for slough. This approach ensured that there was no loss of wound texture features when training the network. They report an average DSC value of 0.9138, an average precision per class of 0.9731 for granulation tissue, 0.9659 for necrotic tissue, and 0.7790 for slough tissue. One notable limitation they observed is that deep areas within certain pressure injuries would appear dark, which the model would confuse with necrotic tissue, which is also generally very dark in colour.

Pathompatai et al. [70] proposed a region-focus training strategy for wound images in the Medetec dataset. They split full-size images into smaller patches for training, and found that network performance could be improved by increasing the number of challenging image patches into the training process. Challenging examples were determined by heatmap analysis. They split full images (typically \(600 \times 400\) pixels) into \(256 \times 256\) pixel patches with a 128 pixel stride (50% overlap). For the generation of additional challenging patches, they cropped from different offsets so that the new patches contained different contextual features. To test their method, they trained U-Net using 115 images for training, 29 images for validation, and 36 images for testing—all counts are for full-size images prior to splitting into patches. Expert labelling was not mentioned in the study; therefore, the ground truth masks should be considered to be a weakly supervised component. They compared models with and without region-focus training, and with different numbers of region-focus patches. Their best performing model reported an mIoU of 0.7816. They noted that although the introduction of difficult patch examples would generally improve network performance, the value would need to be tuned for different scenarios. This method could be interpreted as a form of manual attention, prior to the more widespread use of more advanced automated attention mechanisms present in more recent network architectures.

Li et al. [71] proposed a segmentation framework based on human-designed feature maps and artificially assigned convolution kernels using a modified MobileNet. First, a location encoder is used to convert the 2D coordinates of the input image into a location map, which is concatenated with the input image. Next, after downsampling, the location map is fused with the output of the network backbone, which is post-processed using smooth kernels to remove small holes and small non-wound regions in the prediction. Finally, the feature maps are upsampled to the size of the input image, resulting in output maps. This work observed that in order to maintain invariance, CNNs obfuscate the location information of input maps. This is contrary to the usual spatial distribution of wounds and backgrounds in wound images, which tend to be non-uniform in nature. In their experiments, they used a dataset of 950 wound images, as used in their previous work [63]. The training set comprised 760 wound images, and the test set comprised 190 images. They reported an mIoU value of 0.8647, a maxIOU value of 0.8675, and a precision value of 0.9503.

Cui et al. [72] conducted experiments which compared DFU segmentation results from U-Net and a patch-based CNN method, originally proposed by [73]. They used a wound dataset of 445 images acquired from New York University. For training, 392 wound images were used, and 53 images were used for testing. The GrabCut tool was used to remove background features. For the patch-based CNN method, the original images where split into patches, resulting in 4500 pairs of local and global patches (\(31 \times 31\) and \(201 \times 201\), respectively). An adaptive thresholding method was used as a post-processing step to remove artefacts. They found that U-Net produced sharper wound boundaries when compared to those created by the patch-based CNN method. The reported U-Net results were 0.845 for DSC and 0.761 for mIoU.

Ohura et al. [74] trained four segmentation models (U-Net, U-Net with VGG16 backbone pretrained on ImageNet, SegNet, and LinkNet) using PRU images (\(n = 396\)) and tested on images of DFU (\(n = 20\)) and VLU (\(n = 20\)) which were collected from the Kyorin University Hospital, Japan. The number of training images used was 356, and the number of test images used was 40. The U-Net VGG16 model was found to provide the best results for sensitivity (0.993) and specificity (0.993), while the U-Net with VGG16 model provided the best AUC (0.998), DSC (0.947), and accuracy (0.989). These experiments showed that pretraining on ImageNet improved the standard U-Net with increases in AUC, DSC, and accuracy. However, sensitivity and specificity were slightly lower for the U-Net VGG16 model when compared to the base U-Net model.

Wagh et al. [75] compared results of various deep learning architectures against traditional associated hierarchical random field (AHRF) segmentation, which reformulates the image segmentation task as a graph optimisation problem. The CNN architectures they experimented with were U-Net, FCN, and DeepLabV3. They devised two experiments, each using a different dataset. Dataset 1 comprised a total of 114 wound images (95 train, 19 val) and was acquired under controlled lighting conditions. Dataset 2 comprised a total of 316 wound images (263 train, 53 val). A total of 202 of these images were acquired via internet scraping, and 114 images were taken from dataset 1. Dataset 3 comprised a total of 1442 wound images (1201 train, 241 val) and were sourced from the University of Massachusetts Medical Center. All datasets contained examples of DFU, arterial, venous, PRU, and surgical wounds. The deep extreme cut algorithm [76] was used to provide ground truth labels; therefore, these experiments should be considered as weakly supervised. For dataset 1, the best performing method was FCN, with a DSC of 0.7822. The best performing method with dataset 2 was also FCN, with a DSC of 0.8418. The best performing method with dataset 3 was DeepLabV3, with a DSC of 0.8554. They also conducted an additional experiment using a common validation set across all datasets. For validation DSC results, the best performing models were FCN on dataset 1 (0.7822), and DeepLabV3 on dataset 2 (0.8537) and dataset 3 (0.8760).

Ong et al. [77] proposed a wound segmentation model with 18.5% fewer parameters than U-Net, with the aim of performing inference on mobile devices. They added an additional layer to both upsampling and downsampling pathways. They also added depthwise separable convolutions (depthwise convolution followed by a \(1 \times 1\) convolution) to replace the standard convolutions, allowing for independent convolution of each input channel. The depthwise separable convolutions used a stride of \(2 \times 2\) which effectively downsamples layers by a factor of 2 which is used instead of the max-pooling present in the original U-Net architecture. The upsampling pathway contains transposed convolutions with strides of \(2 \times 2\). A dropout layer was also added after every depthwise separable convolution with a dropout rate of 0.2. Two parameters (“alpha” and “alpha_up”) were added to adjust the number of filters for the upsampling and downsampling pathways, respectively. For training and testing, they used a private dataset sourced from local hospitals comprising 583 wound images. A total of 467 images were used for training, and 116 images were used for testing. A nurse provided the ground truth labels, with a second confirming label quality. After experimenting with different alpha values, they reported an mIoU of 0.869, compared to 0.813 for the unmodified U-Net.

Wang et al. [78] trained a MobileNetV2 wound segmentation model (pretrained on the Pascal VOC segmentation dataset) using a new dataset of 1109 DFU images (831 used for training, 278 used for testing). They used a localisation preprocessing step to remove non-DFU wound features from images prior to segmentation. Following hole-filling and small region removal morphological post-processing algorithms, they reported a mean DSC of 0.9047. However, there are limitations to this work in that the wound images are very small patches that were then heavily padded up to \(224 \times 224\) pixels. This means that the actual wound pixels constituted very small regions of the total image. Without padding, the average width and height of the training images in this dataset are 71 and 104 pixels, respectively, while the average width and height of the test set images are 70 and 101 pixels, respectively. At such small sizes, as small as \(17 \times 18\) pixels, many DFU wound features would be lost. Heavily limiting features in such a way, and using only a small number of images for testing (\(n = 278\)), it would be easier to obtain high test metrics as the network has fewer features to learn.

Wang et al. [79] would later conduct the Foot Ulcer Segmentation Challenge (FUSC) 2021 which introduced a training set of 810 images, a validation set of 200 images, and a test set of 200 images. The newly introduced images contained examples with significantly less padding and with more of the foot and background visible compared to the previous dataset they released. The winner of this challenge, Mahbod et al. [80], achieved an image-based DSC of 0.8880, a reduction of 1.67% from the original score reported by [78]. This may be evidence that the task became more challenging when larger DFU wound images were introduced. This shows that the model would need to learn more complex features that were not present in the original experiments conducted by Wang et al. [79] on the prior smaller dataset which had significantly smaller wound images with significantly fewer features.

Zahia et al. [81] proposed an end-to-end system using 2D wound images and 3D meshes acquired using a structure sensor with the aim of providing automated PRU measurement. For their segmentation experiments, they used a dataset of 175 images for training, and 35 images for testing. They experimented with Mask R-CNN using ResNet-50 and ResNet-101 backbones. They found ResNet-50 (pretrained on MS COCO) to provide the highest performance. For segmentation, they reported a DSC of 0.83, mean sensitivity of 0.85, and a mean precision of 0.87. They attributed the challenging nature of the segmentation task to ambiguous boundaries in some of the PRU images. Such ambiguities may therefore also be reflected in the ground truth labelling, given that such boundaries are likely to be subjective when delineated by expert human annotators.

Niri et al. [82] conducted experiments using superpixels as a preprocessing stage in a wound segmentation workflow. They used an FCN32 segmentation model as part of a pipeline to classify wound tissue types. The model was trained using more than 5000 wound superpixel images with no augmentation. They used 5256 images for training, and 1530 images for testing which were divided into granulation, slough, necrosis, and unknown tissue types. The private dataset was sourced from the Hospital Nacional Dos de Mayo (Lima, Peru) and the CHRO Hospital (Orleans, France) and comprises DFU wound images. They also used 219 images from the ESCALE database in their training set, including leg ulcers, diabetic ulcers, and bed sores. The preprocessing stage of using superpixel images means that the model is not exposed to background details. They reported an accuracy of 0.9268, precision of 0.7807, and a DSC of 0.7574.

Chairat et al. [83] trained a U-Net with an EfficientNet-B2 encoder using the WoundsDB dataset of 188 cases acquired from 47 patients. The dataset was split into 132 images for the training set, 28 images for the validation set, and 28 images for the test set. Due to the high resolution of the original dataset images (\(4896 \times 3264\) pixels), random crops were performed, resulting in a training set of 1056 images and a validation set of 224 images. They reported an mIoU of 0.8674, and observed that detection of smaller wounds was generally less accurate when compared to detection of larger wounds. It was also observed that despite the high resolution of the WoundsDB images, most of the images had been acquired at distance, meaning that the images comprised mostly of skin and environment details.

Watanabe et al. [84] proposed RoleNet (role-oriented fully convolutional networks) to segment wound area using sparse estimation and classify the tissue type (granulation, necrotic, and slough). This framework was developed as part of a larger system that could be used for wound area estimation using RGB-D point cloud data taken with an iPhone X mobile phone. They employed depthwise separable and atrous convolutions in their model architecture to reduce the number of model parameters and to increase the receptive field of convolutions. Downsampling with varying strides was also implemented to prevent loss of features. Their training strategy involved training the segmentation and classification networks separately, and then the models were combined and jointly retrained. A dataset of 40 wound images was acquired from a medical book [85], and was split into train (\(n = 31\)), validation (\(n = 3\)), and test (\(n = 6\)) sets. They reported an mIoU of 0.790.

Niri et al. [86] trained a U-Net for segmentation of skin and wounds in 2D images and 3D wound meshes. Following initial segmentation of the 2D images, they devised a strategy to select the best view from the multi-view 3D meshes, which were then segmented and reprojected back to 2D to update the original 2D segmentation result. For their experiments, they created a dataset of 569 chronic wound images which included pathologies such as DFU, burns, and PRU. A total of 426 images were used for training, while 143 images were used for testing. They reported a DSC of 0.9304 and an mIoU of 0.8661. They observed that high angle and distance variations due to camera position and orientation would affect model performance.

Scebba et al. [87] observed the many challenges inherent in wound segmentation, such as the heterogeneity of wound types, tissue colours, shapes, body position, background composition / complexity, image capturing conditions, and variable specifications of capture devices. They noted that initiatives to standardise medical wound photography would likely result in additional workload burdens on clinicians, and that such standards would likely still not guarantee a consistent resolution to generalisation issues experienced in real-world settings. To address these issues, they proposed a wound segmentation method, whereby a MobileNet localisation model would first localise the wound prior to segmentation using U-Net to reduce the number of extraneous features present in the image. The localisation model predictions were automatically adjusted to be increased in size by 50% to allow for surrounding tissue to be present in the cropped inputs required by the segmentation model. They used five wound datasets in their experiments—SwissWOU—a private dataset of DFU (\(n = 1096\)) and systemic sclerosis digital ulcers (\(n = 63\)), DFUC 2020 [88], Medetec [66], SIH (second healing intention dataset), and FUSC (Foot Ulcer Segmentation Challenge) [79]. In some cases, without further justification, full datasets were not used in their experiments, e.g. only 60 images were used from the FUSC dataset, only 53 images were used from the Medetec dataset, and only 58 images were used from the SIH dataset. They experimented with a range of commonly used segmentation networks, with and without automated localisation, and with manually localised images. When tested only on the SwissWOU DFU images (10% of all patients), they found that U-Net was the best performing network, with an MCC of 0.85 and an IoU of 0.75. When tested with the SwissWOU systemic digital ulcers, Medetec, SIH, and FUSC images, they found U-Net to be the best performing network, with a mean MCC of 0.8725 and an mIoU of 0.7875.

Xing et al. [89] proposed an improved U-Net model for DFU wound segmentation. The first improvement is in the use of a coarse positioning module (CPM) which is used to automatically crop the target area of training images to the smallest outer rectangle using the ground truth mask. This method reduces background features prior to training, and was not used on test images. The second improvement is the use of an SVM in the output layer to improve the ability of the model to generalise. The SVM replaces the softmax layer in the U-Net, and was used due to its ability to take into account training error and model complexity. SVM can also achieve better classification results and deal with nonlinearities on small datasets. For their experiments, they used the FUSC dataset, from which 610 images were used training, 200 images were used for validation, and 200 images were used for testing. Their best performing model, using both the CPM and SVM, reported a DSC of 0.8902. Analysis of a selection of results indicated that the CPM, when used in isolation with U-Net, was able to improve detection of smaller regions and performed well in segmenting larger regions. When the SVM was tested in isolation with U-Net, its performance on smaller wounds was not as accurate as CPM, and was significantly worse when segmenting larger DFU wounds. The CPM and SVM methods combined with U-Net provided the best overall performance, but on visual inspection, the quality of smaller DFU wound regions was not as accurate as the CPM with U-Net approach. Overall, their proposed method outperformed U-Net, ResU-Net, LinkNet, and Attention U-Net.

Bose et al. [90] proposed D3MSU-Net for wound segmentation, based on the original U-Net architecture. They used dense dilated convolutions to vary the field of view for each network level. Deep multiscale supervision blocks were implemented to provide supervision to hidden layers. This involves calculating loss at the hidden layers in addition to normal loss calculation that occurs at the end of the model, followed by optimisation of the model on the aggregated loss value. They tested their model on a series of medical imaging datasets (x-ray, CT, and MRI), including the FUSC and Medetec wound datasets. For the FUSC and Medetec datasets, they reported DSC values of 0.9285 and 0.9637, respectively.

Chang et al. [91] trained five segmentation models (U-Net, DeeplabV3, PSPNet, FPN, and Mask R-CNN) with a ResNet-101 encoder using a dataset comprising 2893 PRU images. They found the DeepLabV3 model to have the best performance, with an F1-score of 0.9887, IoU of 0.9782, precision of 0.9888, recall of 0.9887, and accuracy of 0.9925. Although these results are promising, there are notable limitations to this work. The authors indicate that challenging examples were excluded from the dataset. These cases included images where wounds were dressed with ointment, covered with dressing, actively bleeding, obscured by hematoma or pus, or if the image was out of focus or acquired under poor lighting conditions. Additionally, their test set was small, comprising only 10% of the total dataset (289 images). They also excluded images where clinicians disagreed on labelling, and the exact composition of the dataset was not reported.

Curti et al. [92] created a private wound dataset entitled “Deepskin” and used semi-supervised techniques in gradual training and validation stages to train a U-Net model with an EfficientNet-B3 encoder pretrained using ImageNet weights. Semi-supervision was used to assist labelling of unlabelled images starting with 145 labelled images, with results being confirmed (although not quantified in the study) by specialists using a set of quality criteria. All images used to train the model were sourced from a single centre, which the authors note may hinder the model’s ability to generalise. The dataset comprised 1564 images (\(1440 \times 1080\) pixels) from 474 patients, which were collected over a 2 year period. The training set comprised 1407 images, and the test set comprised 157 images. Wound locations included foot, leg, chest, arm, and head. The authors note that all images had background details removed and that wounds would always occupy the centre of the image. By removing challenging examples, they achieved 0.96 in DSC, precision, and recall. This work also noted the absence of a set of standardised criteria for wound and periwound area definition among specialists.

Kendrick et al. [93] proposed a modified FCN32 architecture with a VGG16 backbone. Their training strategy involved creating patches of the training images to help de-emphasise non-wound background features. They replaced ReLU with Leaky ReLU to assist feature learning and to reduce the occurrence of dead neurons. Excessive downsampling was avoided by removal of the final three max-pooling layers, helping to retain feature map size in the lowest part of the network. Figure 3 illustrates the overall network architecture. Using the DFUC 2022 dataset, they achieved a DSC of 0.7447 and an mIoU of 0.6467. To date, this work represents the current state-of-the-art result on the largest publicly available chronic wound segmentation dataset—DFUC 2022. A limitation of this work is the lack of data understanding concerning the DFUC 2022 dataset. Currently, the composition and class distribution has not yet been fully analysed, meaning that the dataset may be both imbalanced and biased in terms of train and test distribution for wound class, size, anatomical location, and other factors.

Fig. 3
Fig. 3
Full size image

Illustration of the DFU semantic segmentation network architecture proposed by Kendrick et al. [93]. Orange layers represent convolutions with Leaky ReLU activation, red layers indicate max-pooling, and light green layers indicate skip connections using modified squeeze and excite. In the decoding path, green represents dropout, yellow represents separable convolution with dilation and softmax layer. This model represents the current state-of-the-art performance on the largest publicly available chronic wound segmentation dataset (DFUC 2022)

Liao et al. [94] proposed HarDNet-DFUS, a modified version of the HarDNet-MSEG architecture which enhances the backbone and replaces the decoder. Using the DFUC 2022 dataset, they achieved a DSC of 0.7287, placing them first in DFUC 2022. They enhanced the original HarDNet-MSEG by replacing each HarDBlk module in the encoder backbone with a new HarDBlkV2 module, and replacing RFB modules in the decoder with large window attention (Lawin transformer) which utilises an MLP decoder, an MLP-mixer, and spatial pyramid pooling (SPP) to capture multiscale features. To further increase accuracy, they adopted an ensemble strategy using fivefold cross-validation and test time augmentation (TTA). Additional augmented images were added to the test images when testing using the sub-models, with the average of their outputs used as the final prediction results.

Ramachandram et al. [95] developed a segmentation network capable of both wound segmentation (AutoTrace) and tissue type segmentation (AutoTissue) for use in a commercial mobile app. The AutoTrace segmentation model used a traditional auto-encoder design with depthwise separable convolutional layers, attention gates, and strided depthwise convolutions to downsample activations instead of using fixed max-pooling. Additive attention gates were placed at the end of each of the skip connections to regulate the flow of activations from prior layers. Attention coefficients can identify salient image regions and trim feature responses to preserve only the relevant activations. The decoder blocks consisted of bilinear upsampling followed by two depthwise separable convolution layers per block, reducing memory and computational requirements. The AutoTissue segmentation model used EfficientNetB0 as the encoder, with a four-block decoder, with each block comprising a single two-dimensional bilinear upsampling layer followed by two depthwise convolution layers. The AutoTrace model was trained using a private dataset of 467,000 images, and the AutoTissue model was trained using a second private dataset of 17,000 images. The images and clinical annotations were collected at hospitals across North America, providing them with a diverse range of inputs, including varied ethnic groups. However, the exact composition of the dataset was not disclosed in the study. They report an mIoU of 0.8644 for wound segmentation and 0.7192 for wound and tissue segmentation. A cohort of wound clinicians, by consensus, rated 91% (53/58) of the tissue segmentation results to be between fair and good for segmentation and tissue identification quality. Reporting of qualitative assessment of deep learning results is uncommon in wound related deep learning studies. The reporting of such qualitative measures in this study is useful; however, it is still very limited, with only 58 examples assessed. The reporting of the qualitative measures is also somewhat vague, i.e. a high percentage of ratings were found to be between fair and good. An additional limitation of this work was the small size of the test sets they used, with only 2000 image-label pairs used for testing the AutoTrace model, and 383 images for the AutoTissue model. This means that for the segmentation AutoTrace model, from a dataset total of 469,000 images (467,000 train + 2000 test), less than 0.44% of the total images were used for testing. Such small test sets may not be statistically significant and may not contain sufficiently varied examples when compared to the vast size of the corresponding training sets, meaning that evaluation metrics may not be reliable. This aspect may be especially pertinent in cases where the exact composition of the test set has not been quantified.

Marijanović et al. [96] developed three segmentation models for use with a robotic manipulator, RGB-D camera, and 3D scanner in the acquisition of wound images. They used the FUSC DFU dataset to train and test their models. A fixed-size overlapping sliding-window method was used to generate input images to the network, with each model using a different window size. Window sizes of 5, 7, and 9 were used for Model5, Model7, and Model9, respectively. The threshold for an input sub-image to be marked as wound was 50% for Model5 and Model7, and 25% for Model9. The core model architecture comprised of four fully connected hidden ReLU layers and one fully connected output layer with a sigmoid activation function. A total of 810 wound images were used for training (train = 648, val = 162) and 200 wound images were used for testing. Predictions from all three models were merged to form a final prediction mask using the AND logical operator. Finally, they performed post-processing using thresholding, noise removal, and region filling morphological operations. They reported a recall value of 0.77, a precision value of 0.72, and a DSC of 0.74.

Swerdlow et al. [97] trained a Mask R-CNN model with a ResNet-101 backbone for segmentation and classification of stage 1–4 PRU injuries. They used a dataset (eKare Inc. Data pressure injury wound data repository) of 969 pressure injury images, with 848 images used for training, and 121 images used for testing. This work noted the lack of publicly available datasets, referencing pressure injury images available in the Medetec dataset. However, the study did not use these images in experiments due to degraded image quality. They reported a DSC of 0.92 for stage 1 injuries, 0.85 for stage 2 injuries, 0.93 for stage 3 injuries, and 0.91 for stage 4 injuries. The study protocol ensured that all images were taken with the same camera from approximately the same distance (40–65 cm) from the wound. The authors also note that wounds smaller than \(2 \times 2\) cm were not included in the dataset.

Liu et al. [98] trained a U-Net segmentation model with a ResNet-101 backbone for use with automatic wound area measurement using a LiDAR camera. They used 528 images of PRU to train and test their model. They reported a mean DSC value of 0.8488, mIoU of 0.7773, mean precision of 0.8756, mean recall of 0.8639, and mean accuracy of 0.9807. However, this work notes that only the highest quality photographs were used, indicating that there may have been a lack of challenging examples. They initially collected 1038 PRU photographs, collected from the National Taiwan University Hospital. They then excluded images which were blurred, overexposed, underexposed, obscured, or contained other numerous non-wound objects or features. The total number of PRU photographs used in their experiments was 528, of which 327 were used for training and internal validation, and 201 used for external validation. All images were resized to \(512 \times 512\) pixels prior to training. This study observed that in order for the model to become more robust, more challenging images would need to be used to train the model with, ideally exhibiting more complex non-wound background details. They also observed that PRU photographs could not always reveal drainage sinus and deep dead space especially when they were taken by nonprofessional first-line caregivers. These regions may show as black or very dark regions in the image.

Lan et al. [99] proposed FusionSegNet which performed segmentation as a means of improving binary classification of chronic wounds. They used the FUSC dataset for training and testing their segmentation model. Using pretraining on the AZH wound dataset they evaluated results of three segmentation networks using the FUSC validation set. All images were resized to \(512 \times 512\) pixels for all experiments. They found that U-Net provided the highest metrics out of the three segmentation models they trained (Residual U-Net, MobileNetV2, and U-Net) and reported a precision value of 0.9009, a recall value of 0.9026, and a DSC value of 0.9010.

Oota et al. [100] proposed WSNET, a segmentation framework capable of (a) wound-domain adaptive pretraining on a large unlabelled wound dataset, and (b) a global–local architecture that utilises full image and its patches to learn fine-grained details from heterogeneous wounds. They used a new dataset of 2686 wounds (WOUNDSEG dataset), comprising examples of DFU, pressure trauma, venous, surgical, arterial, cellulitis, and others. Three classification backbones were first pretrained (DenseNet121, DenseNet169, and MobileNet) using images of DFU (\(n = 19,773\)), PRU (\(n = 47,541\)), surgical wounds (\(n = 12,238\)), trauma wounds (\(n = 13,667\)), and venous ulcers (\(n = 32,492\)). The classifiers were then frozen, and the decoder weights were fine-tuned over the wound segmentation dataset for four segmentation models. When tested on the WOUNDSEG dataset, they reported a DSC of 0.847. They also reported results of 0.956 DSC for the Medetec dataset (using U-Net with a DenseNet169 backbone), and 0.927 DSC for the AZH dataset (using LinkNet with a DenseNet121 backbone). This work highlights the importance of large-scale same-domain pretraining. Although these results are impressive, they are not reproducible as the vast majority of the dataset is private, and the pretrained weights have also not been shared publicly.

Table 1 Summary of deep learning experiments for chronic wound segmentation between 2015 and 2023

4 Instance segmentation

In this section, we discuss all the chronic wound segmentation papers that specifically reported on the use of instance segmentation methods. Instance segmentation focuses on identification of individual wounds per image, as opposed to semantic segmentation, which involves detection of wound pixels in an image. Our investigation found only four studies which specifically indicated that the experiments utilised instance segmentation. Instance segmentation is generally regarded as a more challenging task compared to semantic segmentation, as it requires localisation of wounds in addition to segmentation of wound regions. We note that there may be ambiguity as to what constitutes instance segmentation in the literature due to the models used in studies and the methods of evaluating those models. For example, Mask R-CNN is considered an instance segmentation network; however, studies that use this model architecture may not evaluate the results on an instance basis.

Wijesinghe et al. [105] developed a mobile app capable of severity stage classification of diabetic retinopathy and DFU wound segmentation. For the DFU segmentation task, they trained and tested a Mask R-CNN model using 400 DFU images sourced from a diabetic clinic in Sri Lanka. Using ten images for evaluation, they reported a mean average precision (mAP) of 0.87 at an IoU threshold of 0.5.

Gamage et al. [106] trained a Mask R-CNN network with a ResNet-101 backbone (pretrained with the MS COCO dataset) on neuropathic ulcers for instance segmentation. They used a dataset of 400 images, with 360 images used for training and 40 images used for testing. Their training strategy involved three stages: (1) during the first 30 epochs, only the head layers (the region proposal network, classifier and mask heads) of the network are trained; (2) in the next 50 epochs, the upper 4+ layers are trained; and (3) in last 20 epochs all layers in the network are trained. They achieved 0.5084 mAP (for IoU = 0.5 to 0.95), 0.8632 average precision (AP) (for IoU = 0.5), and 0.6157 AP (for 0.75 IoU).

Privalov et al. [101] trained a Mask R-CNN network for DFU instance segmentation. They pretrained their model using the MS COCO dataset. They used a dataset of 295 wound images for training, and 35 images used for testing. They reported a DSC of 0.7910. They also performed inter- and intra-rater analyses using one-way analysis of variance (ANOVA) to analyse the variance between and within group means. This revealed no statistically significant differences for all raters for the network in the first round (\(F=1.424\) and \(p>0.228\)) and the second round (\(F=0.9969\) and \(p>0.411\)) for segmentation. The repeated measure analyses revealed no statistically significant differences in the quality of segmentation for the four medical experts (\(F=6.05\) and \(p>0.09\)). However, they observed some intra-rater variability.

Evidence of the effect of reducing the size of the train and test sets on deep learning wound segmentation models was highlighted by Cao et al. [107]. They reported a series of baseline results using 4000 DFU images trained on a selection of segmentation networks. They then trained using their own instance segmentation model (a variant of Mask R-CNN) with a reduced dataset of 1426 images and observed a significant increase in mAP (0.6940 to 0.8570). However, it was not clear if the increases in performance metrics were due to the new model they created or a result of simply reducing the number of images used in their experiments. The study indicated that only DFU wound images with clear ulcer edges and non-blurry cases were used in the experiments with the proposed model, suggesting that challenging examples were excluded.

5 Meta-analysis

A total of 40 chronic wound deep learning segmentation papers were reviewed in this review, with a total of 43 experiments from the papers summarised in Table 1. Figure 4 summarises the number of publications in chronic wound deep learning research covered by this review and meta-analysis.

Fig. 4
Fig. 4
Full size image

Graph showing the number of deep learning chronic wound segmentation publications between 2015 and 2023. Note that the 2023 figure only includes papers published up to March 2023

The total number of experiments that reported DSC values is 34, and the mean DSC value for all experiments is 0.8733. In terms of test sets, all experiments reporting DSC values can be divided into two distinct groups: (1) experiments that use between 5–289 test images; (2) experiments that use between 1530–2686 test images. A total of 38 experiments used test sets ranging from 5–289 images, and a total of five experiments used test sets ranging from 1530–2686 images. The mean DSC for experiments in the 5–289 test images range is 0.8872. The mean DSC for experiments in the 1530–2686 test images range is 0.7695. These mean DSC values indicate a strong correlation between test set size and reported DSC, with a difference of 0.1177 between those experiments that use \(< 300\) test images and those that use \(> 1500\) test images. Figure 5 shows the relationship between test set sizes and DSC values. The trend line demonstrates a negative correlation, i.e. smaller test sets resulted in higher DSC, and larger test sets resulted in lower DSC values.

Fig. 5
Fig. 5
Full size image

Scatter chart showing the negative correlation between test set sizes used in the literature and corresponding DSC values

The total number of experiments that reported mIoU values is 16, and the mean IoU value for all experiments was 0.7976. In terms of test sets, the experiments that reported mIoU values can be divided into two distinct groups: (1) experiments that use between 6–289 test images; (2) experiments that use between 2000–2686 test images. A total of 13 experiments used test sets ranging from 6–289 images, and a total of 3 experiments used test sets ranging from 2000–2686 images. The mIoU for experiments in the 6–289 test images range is 0.8106. The mIoU for experiments in the 2000–2686 test images range is 0.7414. As with the previous DSC analysis, these mIoU values indicate a negative correlation between test set size and reported mIoU, with a difference of 0.0692 between those experiments that use \(< 300\) test images and those that use \(>= 2000\) test images. Figure 6 shows the relationship between test set sizes and mIoU values. The trend line demonstrates that higher mIoU values correlate with smaller test sets, and that lower mIoU values correlate with larger test sets. Although this trend is less prominent when compared to the DSC results (see Fig. 5), there is still a clear correlation.

Fig. 6
Fig. 6
Full size image

Scatter chart showing the negative correlation between test set sizes used in the literature and corresponding mIoU values

Of the 43 experiments summarised in Table 1, 18 experiments reported a DSC value of \(> 0.9\), all of which used test sets that comprised \(< 300\) images. The highest DSC value among all experiments that used \(>= 1500\) test set images is 0.8470. Only one experiment reported an mIoU value of \(> 0.9\), which used a test of \(< 300\) images, while there were no experiments which reported an mIoU \(> 0.9\) when \(> 300\) test set images were used.

The mean test set size for all experiments is 344. The mean test set size of the experiments that comprise \(< 300\) images is 120. The mean test set size of the experiments that comprise \(> 1500\) images is 2044. The total number of experiments that use \(< 300\) test images is 38. The total number of experiments that use \(> 1500\) test images is 5. The total number of experiments that reported DSC values is 34. The total number of experiments that reported mIoU values is 16. The total number of experiments that reported both DSC and mIoU is 7.

These findings show that, on average, experiments that use smaller test sets (\(< 300\) images) reported significantly higher DSC and mIoU values compared to experiments that use \(> 1500\) test set images. We observe that the \(R^{2}\) values for both graphs (Figs. 5 and 6) might be considered relatively low (\(R^{2} = 0.2163\) for DSC, \(R^{2} = 0.0551\) for mIoU) in some fields of study. However, \(R^{2}\) values are domain dependent, and in the absence of similar studies in this field, we report these measures as baselines which can be compared against in future works. A possible limitation of this analysis is that the number of experiments that reported the use of test sets with \(< 300\) images (\(n = 38\)) significantly outnumbers those experiments that reported the use of larger test sets (\(n = 5\)). Additionally, the total number of experiments present in the analysis may not yet be sufficient to form a comprehensive assessment. However, given that few experiments in the literature provide composition details of test sets (discussed further in Sect. 5.3), we suggest that the trends shown in Figs. 5 and 6 are at least partly indicative of issues prevalent in results reported in the literature.

For all experiments reported in Table 1, the smallest training set used is ten images, and the largest training set size is 467,000 images. The smallest validation set used is three images, and the largest validation set size is 200 images. The smallest test set used is five images, and the largest test set size is 2686 images. The mean total dataset size (train, validation, and test) for all experiments in Table 1 is 8712 images. Table 2 shows a summary of the total number of datasets used in the experiments reviewed in the literature.

Table 2 Summary of datasets used in the experiments in the reviewed literature

The mean width and height of all images used in the experiments detailed in Table 1 is 585 and 506 pixels, respectively. The mean total number of pixels used in the experiments reported in Table 1 is 296,010 pixels per image.

A summary of the deep learning model architectures used in the experiments detailed in Table 1 is shown in Fig. 7. These figures clearly show that U-Net (20 experiments) is the most commonly used model architecture in chronic wound segmentation deep learning research, followed by FCN (5 experiments) and LinkNet (4 experiments).

Fig. 7
Fig. 7
Full size image

Graph summarising the different model architectures used in the experiments in the reviewed literature

5.1 Test sets

One of the main observations in the field of deep learning for wound segmentation is that most studies use very small test sets (\(< 300\) images) and that test metrics, such as DSC and mIoU, often correspond to test set size (see Figs. 5 and 6). Studies that report results from smaller test sets, on average, show significantly higher test metrics, while studies that use larger test sets, on average, show significantly lower test metrics. The use of very small test sets in the majority of chronic wound deep learning studies means that those segmentation models may not be sufficiently challenged during testing. Such models are therefore unlikely to generalise well in real-world settings. Comparison of evaluation and test metrics across existing studies should therefore be regarded tentatively, especially in cases where small test sets have been used.

5.2 Qualitative assessment

Deep learning chronic wound segmentation methods have shown to provide high levels of accuracy in laboratory settings [108]. However, very few studies focus on expert qualitative assessment. This is understandable, given that obtaining access to clinical experts can be difficult. However, given that the aim of these studies is to work towards the development of systems that will be used in real-world settings, expert qualitative assessment should be regarded as a key factor in results analysis. Research groups who do not have access to clinical experts should consider collaboration with those groups who do in order to advance the field, and to promote the idea of human-in-the-loop in the development of chronic wound segmentation models.

A recent study reported very poor inter-rater agreement in determining the presence and regions of various tissue types, with results as low as 0.014 (Krippendorff alpha value) for epithelial tissue [48]. For the largest publicly available chronic wound dataset (DFUC 2022), we analysed the inter-reliability of the expert coders on \(\approx 20\%\) of the data. For the delineation between experts, the mean DSC is 0.6982 (\(sd = 0.2545\)), with an mIoU of 0.5877 (\(sd = 0.2670\)), a recall of 0.9112 (\(sd = 0.1907\)), a precision of 0.6383 (\(sd = 0.2846\)), and an accuracy of 0.9869 (\(sd = 0.0291\)). The sd values from these results indicate significant variation in terms of mean DSC, mIoU, recall, and precision. Variability in expert labelling used for ground truth masks may mean that using only ground truth masks during testing could be insufficient when assessing the true accuracy of wound segmentation model prediction results. An example from the DFUC 2022 dataset showing expert rater delineation variability is shown in Fig. 8.

Fig. 8
Fig. 8
Full size image

Illustration showing delineation variability between two clinical experts—a shows the original wound image; b shows the delineation from expert A; c shows the delineation from expert B. Example taken from the DFUC 2022 dataset. Note that images are cropped for illustrative purposes

5.3 Data characterisation

Another notable observed limitation of current deep learning research in chronic wound segmentation is the lack of information regarding how well balanced datasets are when used in experiments. Data characterisation can be a vital component in deep learning research experiments [109]. Details of the exact composition of chronic wound datasets when used in deep learning segmentation experiments is often poorly understood and rarely reported. Examples of balancing may include any combination of the following:

  1. 1.

    Balance by wound class / tissue composition.

  2. 2.

    Balance by wound size / severity.

  3. 3.

    Balance by patient clinical data (e.g. ethnicity, age, etc.).

  4. 4.

    Balance by image quality (e.g. lighting or image sharpness).

  5. 5.

    Balance by feature similarity.

Studies that do not attempt to address at least a few of these possible imbalances may produce unreliable results. For example, a test set that includes only higher quality images means that the results may be biased as only easy to segment examples are used for inference. Another pertinent example is when test sets contain images that have high visual similarity with other images in either the train or test set [110]. Removal of duplicate images present within and across train and test sets is also an important factor [111]. Imbalances such as these are unlikely to allow for sufficient testing of the ability of a trained model to generalise, and resulting test metrics are likely to show over-estimates of a model’s true ability. In studies where small test sets are used, these factors may be magnified considerably. These findings correlate with recent studies in other deep learning domains which note a sparsity of transparency and dataset characterisation [112, 113]. Thorough data understanding and transparency should be seen as a critical component of all deep learning research in order to obtain reliable results, which in turn will help to establish clinical confidence in model predictions.

5.4 Experiment design

Studies that use pretrained models with large general datasets such as Pascal VOC or ImageNet, often do not report on the effect of the pretraining, i.e. comparing models trained with and without pretraining. This is common among most wound segmentation papers, whereby the effect of an experiment component is usually not tested in isolation. Another example of this phenomenon is in the use of augmentation methods during training phases [114]. Reporting of results from studies which conduct experiments on individual augmentation methods in wound segmentation for different datasets would be highly useful for future research. Similar data have been published in other medical imaging domains [115], which should act as a motivator to make progress in the field of deep learning in chronic wound research.

5.5 Challenging cases

Several studies investigated in this review (\(n = 6\): [87, 91, 97, 98, 103, 107]) noted that challenging examples had been removed prior to training and testing models. Such cases may include photographs with poor or inconsistent lighting, blurry features, small wounds, or wounds that are present on curvatures of the anatomy. In those studies that reported the removal of challenging examples, experiments had essentially been setup with ideal conditions in order to obtain the best results. We posit that reliable real-world wound segmentation systems should be designed to be highly robust and be able to handle such challenging cases, and that this aspect may be one of the key facets in moving the field forward, especially in terms of home monitoring of wounds where capture conditions are less likely to be controlled.

5.6 Reproducibility

Our findings show that not all published research works in wound segmentation are easily reproducible. Reproducibility and transparency in terms of source code sharing are vital in this domain to allow for other researchers to fully benefit from the findings of others, and to help build on existing methods. Model architectures may be reproducible, without publicly available source code, by use of model details recorded in publications. However, this highly depends on the quality of the details within each paper, and incurs a significant time commitment as model architectures have to be reconstructed from scratch. Of the 40 papers we reviewed, ten papers indicated that they had modified model architectures, of which two papers [90, 94] provided online repository links to their source code.

5.7 Chronic wound dataset availability

Sharing of chronic wound datasets has improved in recent years [116,117,118]; however, compared to other medical domains, the total number of publicly available datasets is still very small. One of the main reasons for this is the difficulty in accessing clinical experts who are able to provide sufficient ground truth labelling [119]. Additionally, General Data Protection Regulation (GDPR) compliance may also present barriers to data sharing. Scenarios that may prevent GDPR compliance include the presence of patient Personally Identifiable Information (PII) within dataset images and associated clinical data. Model explainability also falls under the remit of GDPR which poses additional ethical issues [120]. Debiasing of deep learning datasets may also fall under special categories of data when processing patient images for use in training deep learning segmentation models [121]. The lack of publicly available ethnically diverse datasets also poses a problem for chronic wounds segmentation research, especially those comprising darker skin tones [122]. Of those studies that claim to have used ethnically diverse wound datasets, the exact composition of those datasets was not reported, so it is unknown how well those models were able to generalise across different ethnic groups.

Table 3 Summary of publicly available chronic wound datasets

Due to the significant range of features found across different chronic wounds at different stages of development, availability of highly heterogeneous wound datasets should be seen as critical to the progress on scientific investigation in the field. Of the 43 experiments summarised in Table 1, 24 papers used private datasets. This includes datasets that were claimed to be publicly available but were either not found online or had been unsuccessfully requested. These findings indicate that the vast majority of deep learning studies in chronic wounds have not shared their datasets. The sparsity of public datasets and the abundance of private datasets represents a serious challenge for research in this domain and may be a significant limiting factor in research progress. A summary of publicly available chronic wound datasets is shown in Table 3.

6 Recommendations

This review highlights some of the major challenges present in current deep learning research in chronic wound segmentation. Based on our findings in the review and meta-analysis, we propose a number of recommendations for researchers working in this domain:

  1. 1.

    Future research works should focus on significantly increasing test set sizes to provide a more meaningful assessment of chronic wound segmentation that better reflects the large variety of wound features found in real-world settings. Taking into account the current publicly available chronic wound datasets, we recommend a train and test split of 50:50.

  2. 2

    Researchers should share their source code with the research community so that others may more easily build on research progress. Exact details of environment setup should also be included as training environments can be highly dependent on specific library versions.

  3. 3.

    Researchers should attempt to integrate multiple publicly available chronic wound datasets into their experiments. This will help to gauge a better understanding of the ability of a given model, especially in cases where datasets from multiple sources are used as test sets.

  4. 4.

    Researchers should seek to collaborate in cases where access to clinical expertise is limited. This will help to increase the currently limited reporting of qualitative measures.

  5. 5.

    Rather than exclude challenging cases from experiments, researchers should include such examples and devise methods that are able to better accommodate these types of images in order to make models more robust.

  6. 6.

    Researchers should seek to better understand the nature of the data they are working with. This will help to improve the scope of understanding in the field and may reveal aspects that have not previously been appreciated or considered.

From the above suggestions, we note that simply increasing test set size may still not be sufficient when determining a model’s true ability to generalise if the composition of the test set is not fully understood. However, in the absence of a means of fully understanding the composition of the test set, increasing the test set size may help to negate some of the issues associated with the use of small test sets especially if test sets from multiple sources are used.

In the long term, the most accurate overall assessment of the ability of a model to generalise would come from a combined analysis of dataset understanding, test metrics, and qualitative feedback obtained from clinical experts in wound care.

7 Conclusions

In this review and meta-analysis of chronic wound segmentation methods in deep learning, for the first time, we identify some of the major issues that present significant obstacles in this research domain. The most notable of these issues is the use of very small test sets in the vast majority of studies combined with a lack of data understanding. Models evaluated on such test sets are unlikely to perform well on out of distribution examples meaning that they will likely not generalise well on the wide range of chronic wound features found in real-world settings. This work represents the most substantial review of deep learning in chronic wound segmentation to date, and provides researchers with key insights into possible areas of research progress.